Evaluation & Observability for Production Agentic Systems: Metrics, Tracing, and Drift Detection Beyond the Demo
There is a moment that most teams building agentic AI systems eventually reach, typically sometime after a successful internal demo and sometime before a confident “it’s working great” from actual production users: the realisation that they don’t actually have a reliable way to know whether the system is working correctly at scale. The demo worked. Individual test cases look fine. But whether the system is performing well on the full, messy distribution of real inputs — making the right decisions, using its tools correctly, escalating appropriately, handling edge cases gracefully — is genuinely unclear, because the observability infrastructure to answer those questions either doesn’t exist or isn’t being used.
This final post in the Expert series addresses that gap directly. Evaluation and observability for production agentic systems is a discipline in its own right, distinct from conventional software monitoring and from the model performance monitoring covered in the MRM post, requiring its own architecture and its own organisational ownership. Getting it right is what separates agentic AI systems that improve over time from ones that silently degrade.
Why Conventional Monitoring Is Insufficient for Agentic Systems
Traditional application monitoring watches infrastructure health — response times, error rates, uptime — and tells you whether your system is running. Traditional model monitoring watches performance metrics against known ground truth — accuracy, precision, recall — and tells you whether your model is predicting correctly. Both of these are necessary for agentic systems, and neither is sufficient on its own.
What’s missing is a layer that watches the agent’s reasoning and behaviour — whether it’s taking the right sequence of actions, using its tools in appropriate ways, reasoning correctly from retrieved context to conclusions, and handling genuinely novel inputs gracefully rather than confidently producing wrong outputs. This is the observability gap that makes production agentic systems hard to govern: the parts of the system’s behaviour that matter most for safety, quality, and compliance are precisely the parts that conventional monitoring is least able to see.
The Full Observability Stack for Agentic Systems
Layer 1 — Infrastructure observability (necessary baseline). Latency, error rates, token consumption, and cost per task — the conventional monitoring metrics — provide the foundation and the early warning system for gross failures. An agent that’s suddenly consuming ten times its normal token budget per task is behaving anomalously in a way infrastructure monitoring will catch even if the specific nature of the anomaly isn’t yet clear. This layer should be fully in place from day one of any production deployment; it’s the easiest layer to build and the one with the most mature tooling.
Layer 2 — Trace-level observability (the distinctive layer). This is the layer unique to agentic systems and the one most commonly missing or incomplete in early production deployments. A trace captures the full internal reasoning path of an agent for a given task: every LLM call with its full input and output, every tool invocation with its parameters and results, every branching decision and which path was taken, and the full state at each point in the workflow graph. Without this, when something goes wrong — or right, and you want to understand why and replicate it — the system is effectively a black box: you can see what went in and what came out, but not the reasoning path in between.
Building trace-level observability correctly means instrumenting the orchestration layer to capture this information consistently for every task, storing it in a way that’s queryable and retention-compliant (a real consideration given how much sensitive data traces may contain), and building tooling that lets a human reviewer inspect a specific trace efficiently — not scrolling through thousands of lines of JSON, but a structured, human-readable representation of what the agent actually did and why.
Layer 3 — Quality evaluation (the hardest layer). Traces tell you what the agent did. Quality evaluation tells you whether what the agent did was correct — a fundamentally harder question, because “correct” for a complex agentic task often requires genuine expert judgment to assess, not a simple comparison against a ground-truth label. Several complementary approaches are necessary rather than any single method being sufficient:
Human evaluation sampling — a regularly-drawn sample of completed tasks reviewed by qualified domain experts against a defined rubric. This is the most direct measure of actual quality, the most expensive, and the most resistant to the failure modes that affect automated evaluation. The sample should be specifically designed to oversample the kinds of cases most likely to be challenging: edge cases, high-stakes decisions, cases where the agent’s confidence might not reliably track its accuracy.
LLM-as-judge evaluation — using a separate, capable language model to evaluate agent outputs against defined criteria, at higher volume than human review alone can sustain. This approach has real limitations (an LLM judge can be wrong, can be inconsistently calibrated, and can miss exactly the subtle failure modes that matter most in domain-specific applications) and should be used to supplement human evaluation at scale, not replace it. The LLM judge’s own quality needs to be validated periodically against human judgments on the same cases.
Automated consistency checks — validating that the agent’s outputs satisfy logical, structural, or factual constraints that can be verified programmatically: the correct output format is produced, stated policy references match the actual retrieved policy document, required fields are present, numerical outputs fall within plausible ranges. These checks can run at high volume and catch a useful category of errors quickly, but they say nothing about the harder quality dimensions that require genuine judgment to assess.
Layer 4 — Drift detection (the continuous vigilance layer). A system that was performing well at launch can silently degrade over time through several mechanisms: the distribution of real-world inputs shifts away from the distribution the system was developed and tested against (input drift); the performance of the underlying language model changes through a provider-side update (model drift); the knowledge base becomes stale or introduces conflicting information (knowledge drift); or the external tools the agent calls begin returning different results than they previously did (tool drift). Catching any of these requires monitoring specifically designed to detect distributional change rather than simply tracking average performance metrics, which can remain stable while the distribution of errors shifts in ways that matter enormously.
Building an Evaluation Dataset That Actually Reflects Production
One of the most common evaluation failures in production agentic systems is an evaluation dataset that reflects what developers expected users to ask, rather than what users actually ask. These two things diverge significantly, consistently, and in ways that are predictable in hindsight but hard to fully anticipate in advance. The practical solution: treat the production trace store as a continuous source of evaluation candidates, with a systematic process for identifying interesting or challenging cases from real production traffic (flagged by the quality monitoring layer, surfaced by anomaly detection, or drawn through stratified random sampling) and incorporating them into the evaluation dataset on an ongoing basis. An evaluation dataset that grows and evolves with real production patterns is dramatically more useful than one frozen at the development stage.
Organisational Ownership: Who Is Responsible for What
A recurring pattern in organisations where production agentic system quality is well-maintained versus poorly-maintained is the clarity of ownership for each layer of the observability stack. Infrastructure observability typically sits naturally with the platform or DevOps team and is rarely a gap. Trace-level observability more often falls between ownership — it requires instrumentation from the engineering team, storage and tooling decisions from the platform team, and actual use by the product and quality teams, and when that coordination isn’t explicit, it tends not to happen fully. Quality evaluation sampling requires genuine domain expertise to conduct correctly, meaning ownership needs to sit with people who understand what correct agent behaviour looks like for the specific use case, not with a generic quality engineering team. Drift detection requires both data science skill (to design appropriate statistical tests) and domain knowledge (to distinguish meaningful drift from acceptable variation), and often requires a dedicated owner to ensure it actually runs on schedule rather than being treated as a recurring low-priority backlog item.
Connecting Observability to Continuous Improvement
The purpose of this entire observability infrastructure is not just to know when something is wrong — it’s to create the feedback loops that allow the system to get better over time rather than remaining static. Concretely: human evaluation results feed into prompt refinement, knowledge base updates, and the identification of cases that should trigger automatic escalation. Drift detection triggers investigation into root causes and, where appropriate, formal model risk reviews under the MRM framework discussed in the previous post. Anomaly detection in the trace layer surfaces edge cases that reveal gaps in testing coverage and should be incorporated into future test suites. This requires making the observability outputs genuinely actionable — not dashboards that are looked at occasionally and generate no follow-through, but integrated into the development and operations workflow in a way that creates genuine accountability for acting on what the monitoring reveals.
A Maturity Model for Agentic AI Observability
Teams building or evaluating their own observability capability tend to fall somewhere on a spectrum worth naming explicitly:
Level 1 — Infrastructure only. Latency and error rates are monitored. No visibility into reasoning, quality, or drift. Common at initial production deployment; dangerously insufficient for sustained production reliance.
Level 2 — Traces stored, rarely reviewed. Trace-level data is captured but lives in a data store that nobody has good tooling to actually use. Quality evaluation is ad hoc and infrequent. Drift goes largely undetected until it’s obvious from business metrics.
Level 3 — Active quality monitoring. Regular human evaluation sampling with documented results. LLM-judge evaluation running at volume. Traces are queryable and reviewed when issues are suspected. Drift detection exists but may be incomplete in coverage.
Level 4 — Continuous improvement loops. Evaluation results systematically feed back into development priorities. Drift detection is automated and triggers defined response processes. The evaluation dataset is continuously refreshed from production traffic. Observability is a first-class capability with clear organisational ownership and genuine influence over how the system evolves.
Most mature production deployments in regulated industries should be aiming for Level 4, and most honest assessments of where organisations currently sit would land somewhere between Level 1 and Level 3 — which means the gap between current state and the standard that production, consequential, regulated agentic AI systems deserve is, in most organisations, still meaningful and worth closing with genuine urgency.
This is the final post in the Expert series. Across thirty posts — Elementary, Intermediate, and Expert — this series has covered the conceptual foundations of GenAI and Agentic AI through to production architecture, regulatory compliance, security, governance, cost engineering, and observability, with a consistent focus on how these technologies apply in banking, financial services, and insurance. Thank you for reading.
