ARTIFICIAL INTELLIGENCE

AgenticOps: Securing and Managing AI Agents in Production

Discover how AgenticOps extends DevOps to secure, observe, and manage AI agents, ensuring reliable and effective operation in enterprise environments.

Read time: 10 min read
Word count: 2,198 words
Date: Dec 16, 2025

Summarize with AI

As AI agents become integral to enterprise operations, managing their deployment and lifecycle demands a specialized approach. AgenticOps builds on existing IT capabilities like AIOps and ModelOps, addressing the unique challenges of securing, monitoring, and responding to incidents involving AI agents. This new operational framework emphasizes establishing robust identities, extending observability, upgrading incident management, tracking key performance indicators, and integrating user feedback to ensure AI agents deliver reliable, repeatable outcomes at scale. Organizations must adopt these practices to navigate the complexities of AI agent management.

An illustration depicting AI agents and human interaction in a networked environment, symbolizing the integration of AI into operational workflows. Credit: Shutterstock

🌟 Non-members read here

The Rise of AI Agents and the Need for AgenticOps

Artificial intelligence agents are rapidly transforming enterprise operations by combining sophisticated language and reasoning models with the capacity to execute actions through automations and application programming interfaces. These agents are becoming increasingly adept at facilitating complex operations, with protocols like the Model Context Protocol (MCP) enabling seamless agent-to-agent integrations and enhanced discoverability. Early organizational adoption often sees these AI agents embedded within existing Software as a Service applications, assisting with tasks ranging from recruitment in human resources to resolving intricate supply-chain challenges in operations.

Beyond supporting existing applications, innovative companies are actively developing proprietary AI agents to augment specific workflows, cater to industry-specific demands, and enrich customer experiences. This development phase requires a thorough consideration of core principles, architectural designs, non-functional requirements, and robust testing methodologies. These foundational steps are crucial for safely deploying experimental AI agents and for promoting them into full production environments. The rapid rollout of these advanced AI systems, however, introduces new operational and security complexities.

This evolving landscape necessitates a fresh approach to IT operations, giving rise to “AgenticOps.” This new framework extends traditional DevOps practices and IT service management functions to encompass the unique requirements of securing, observing, monitoring, and responding to incidents involving AI agents. AgenticOps represents a critical evolution in how organizations manage and maintain the integrity and performance of their AI-powered workforces.

Understanding AgenticOps

AgenticOps is not an entirely new concept but rather a natural progression and integration of several established IT operational capabilities, adapted for the distinct characteristics of AI agents. It builds upon the foundations laid by earlier innovations in operational intelligence and automation. This comprehensive approach is designed to manage the full lifecycle of AI agents effectively, from development to deployment and ongoing maintenance.

One foundational component is AIOps, which emerged to tackle the challenge of managing an overwhelming number of independent monitoring tools. AIOps platforms centralize log files and other observability data, leveraging machine learning to correlate alerts and consolidate them into manageable incidents. This capability provides a unified view of operational health. Another critical precursor is ModelOps, a dedicated capability focused on monitoring machine learning models in production environments. ModelOps tracks model drift and other operational issues, ensuring that AI models remain accurate and perform as expected over time.

Furthermore, AgenticOps incorporates principles from platform engineering, automation of IT processes, and the use of generative AI within IT operations. These elements collectively enhance collaboration among IT teams and streamline the resolution of incidents, contributing to a more efficient operational ecosystem. The goal is to create a synergy where AI agents not only perform tasks but are also managed by intelligent systems that can learn and adapt.

DJ Sampath, SVP of the AI software and platform group at Cisco, highlights three core requirements for effective AgenticOps: centralizing data from disparate operational silos, fostering seamless collaboration between humans and AI agents, and leveraging purpose-built AI language models specifically trained to understand networks, infrastructure, and applications. Sampath notes that “AI agents with advanced models can help network, system, and security engineers configure networks, understand logs, run queries, and address issue root causes more efficiently and effectively.” These requirements address the distinct challenges of managing AI agents compared to traditional applications, web services, or standalone AI models.

The outputs of AI agents can vary significantly, unlike the more predictable behavior of traditional applications. This variability necessitates a shift in how operational success is measured. Rajeev Butani, chairman and CEO of MediaMint, states, “AI agents in production need a different playbook because, unlike traditional apps, their outputs vary, so teams must track outcomes like containment, cost per action, and escalation rates, not just uptime.” He emphasizes that the ultimate measure of success lies in proving that agents can deliver reliable and repeatable outcomes at scale, rather than merely avoiding incidents. This new paradigm for measurement underscores the need for specialized AgenticOps practices.

Establishing Robust AgenticOps Practices

As organizations delve deeper into the development and deployment of AI agents, adopting specific AgenticOps practices becomes paramount. These practices ensure the secure, efficient, and reliable operation of AI agents, safeguarding against potential risks while maximizing their benefits. By integrating these strategies early, IT teams can build a solid foundation for scaling their AI agent workforce.

Securing AI Agent Identities and Access

A fundamental practice in AgenticOps involves rigorously establishing AI agent identities and their corresponding security profiles. This entails defining precisely what data and application programming interfaces (APIs) agents are authorized to access. A recommended approach is to provision AI agents in a manner similar to human users, assigning them unique identities, authorizations, and entitlements. This can be achieved through established identity and access management (IAM) platforms such as Microsoft Entra ID, Okta, or Oracle Identity and Access Management. These platforms provide the necessary infrastructure to manage digital identities effectively.

Jason Sabin, CTO of DigiCert, emphasizes the importance of robust identity management for AI agents: “Because AI agents adapt and learn, they need strong cryptographic identities, and digital certificates make it possible to revoke access instantly if an agent is compromised or goes rogue.” Securing agent identities in this way, much like machine identities, ensures digital trust and accountability across the entire security architecture. This proactive security measure is critical for mitigating the risks associated with autonomous systems.

Architects, DevOps engineers, and security leaders must collaborate to define initial standards for IAM and digital certificates during the rollout of AI agents. It is important to anticipate that these capabilities will evolve as the number of AI agents scales. As the agent workforce expands, specialized tools and configurations may become necessary to maintain optimal security and management. This iterative approach allows for adaptation to increasing complexity and evolving threats.

Enhancing Observability and Monitoring for AI Agents

The hybrid nature of AI agents, combining elements of applications, data pipelines, AI models, integrations, and APIs, demands an extension and integration of existing DevOps practices. Platform engineering, for instance, must evolve to consider unstructured data pipelines, Model Context Protocol (MCP) integrations, and critical feedback loops for AI models. This holistic view ensures that all components contributing to an agent’s operation are accounted for.

Christian Posta, Global Field CTO of Solo.io, underscores the pivotal role of platform teams in this evolution: “Platform teams will play an instrumental role in moving AI agents from pilots into production.” He adds that this involves “evolving platform engineering to be context aware, not just of infrastructure, but of the stateful prompts, decisions, and data flows that agents and LLMs rely on.” This enhanced awareness provides organizations with crucial observability, security, and governance, without hindering the self-service innovation that AI teams require.

Similarly, traditional observability and monitoring tools must be upgraded to diagnose more than just uptime, reliability, errors, and performance. AI agents introduce new metrics and behavioral patterns that need to be tracked. Federico Larsen, CTO of Copado, explains that “AI agents require multi-layered monitoring, including performance metrics, decision logging, and behavior tracking.” He recommends implementing proactive anomaly detection using machine learning to identify deviations from expected patterns before they impact business operations. Furthermore, establishing clear escalation paths with human-in-the-loop override capabilities is essential when AI agents make unexpected decisions.

Currently, several observability, monitoring, and incident management platforms are evolving to support AI agent capabilities, including BigPanda, Cisco AI Canvas, Datadog LLM observability, and SolarWinds AI Agent. DevOps teams should define the minimal required configurations and standards for platform engineering, observability, and monitoring for their initial AI agent deployments. Concurrently, they must continuously monitor vendor capabilities and evaluate new tools as AI agent development becomes more widespread and sophisticated.

Optimizing Incident Management and Performance Tracking

The introduction of AI agents significantly elevates the complexity of incident management and root cause analysis, pushing the boundaries of traditional IT operations. Site Reliability Engineers (SREs), who already face challenges in pinpointing root causes for application and data pipeline issues, will encounter even greater hurdles with AI agents. Effective strategies are needed to quickly diagnose and resolve problems that arise from these autonomous systems.

Upgrading Incident Management and Root Cause Analysis

When an AI agent “hallucinates” by providing incorrect responses, or automates improper actions, SREs and IT operations teams must act swiftly to resolve these issues. This requires an in-depth ability to trace the agent’s data sources, models, reasoning pathways, empowerments, and underlying business rules to identify the precise root causes. Traditional debugging methods often fall short in this complex environment, necessitating a more comprehensive approach to problem-solving.

Kurt Muehmel, head of AI strategy at Dataiku, highlights the limitations of conventional observability: “Traditional observability falls short because it only tracks success or failure, and with AI agents, you need to understand the reasoning pathway—which data the agent used, which models influenced it, and what rules shaped its output.” He further elaborates that incident management transforms into an inspection process, where the root cause is not merely a system crash, but potentially the agent using stale data due to an unrefreshed upstream model. Enterprises, therefore, require specialized tools that can inspect decision provenance and fine-tune orchestration, allowing teams to delve deeper into the agent’s internal workings.

Andy Sen, CTO of AppDirect, advises repurposing real-time monitoring tools and utilizing logging and performance metrics to track AI agents’ behavior. He recommends maintaining existing procedures for root cause analysis and post-incident reviews when incidents occur. Crucially, this collected data should be fed back to the agent as input for continuous improvement. This integrated approach not only enhances the performance of AI agents but also ensures a secure and efficient operational environment by closing the feedback loop. To prepare for this, IT operations teams should select appropriate tools and train SREs in critical concepts such as data lineage, provenance, and data quality. These areas will be essential for upskilling staff to effectively support incident and problem management related to AI agents.

Tracking Key Performance Indicators for AI Agents

Beyond traditional uptime and system performance metrics, DevOps organizations are accustomed to a more comprehensive view of application reliability, often utilizing error budgets to drive continuous improvements and reduce technical debt. With AI agents, this holistic approach becomes even more critical, necessitating new Key Performance Indicators (KPIs) and metrics to continuously track agent behaviors and their benefits to end-users.

Experts identify three crucial areas where new metrics are needed. Craig Wiley, senior director of product for AI/ML at Databricks, suggests that “defining KPIs can help you establish a proper monitoring system.” He provides an example: if accuracy must be higher than 95%, this can trigger alert mechanisms, providing the organization with a centralized visibility and response system. This proactive monitoring ensures agents meet performance thresholds.

Jacob Leverich, co-founder and CPO of Observe, Inc., points out that “With AI agents, teams may find themselves taking a heavy dependency on model providers, so it becomes critical to monitor token usage and understand how to optimize costs associated with the use of LLMs.” Cost optimization is a significant operational concern, especially with the consumption-based models often used for large language models. Ryan Peterson, EVP and CPO at Concentrix, emphasizes that “Data readiness isn’t a one-time check; it requires continuous audits for freshness and accuracy, bias testing, and alignment to brand voice.” He suggests that metrics such as knowledge base coverage, update frequency, and error rates are the true tests of AI-ready data. Leaders must define a comprehensive model of operational metrics for AI agents, applicable to both third-party and internally developed proprietary agents.

Incorporating User Feedback for Agent Usefulness

A frequently overlooked aspect in DevOps and IT operations is the importance of tracking customer and employee satisfaction. While end-user metrics and feedback are often delegated to product management and stakeholders, this oversight becomes particularly detrimental when supporting AI agents. Their direct interaction with users necessitates a more integrated approach to feedback.

Saurabh Sodani, chief development officer at Pendo, states that “Managing AI agents in production starts with visibility into how they operate and what outcomes they drive.” He elaborates on the need to connect agent behavior to the user experience, focusing not just on whether an agent responds, but “whether it actually helps someone complete a task, resolve an issue, or move through a workflow, all the while being compliant.” This level of insight allows teams to monitor performance, swiftly respond to issues, and continuously improve how agents support users across interactive, autonomous, and asynchronous modes. User feedback is not merely an optional data point; it is essential operational data that must be integrated into AIOps and incident management frameworks. This feedback is crucial not only for resolving immediate issues with AI agents but also for providing critical input that can refine and improve the AI agent’s language and reasoning models over time, leading to more useful and effective agents.

In conclusion, as AI agents become more prevalent, IT operations must proactively adopt the tools and practices required to manage them effectively in production environments. Organizations should begin by meticulously tracking end-user impacts and business outcomes, then progressively delve deeper into monitoring the agent’s performance in decision-making and response generation. Relying solely on system-level metrics will prove insufficient for comprehensive monitoring and resolution of issues with AI agents, making a human-centric and outcome-focused approach indispensable.