TL;DR
- AI agent observability is distinct from traditional APM: agents fail non-deterministically, through reasoning errors and policy violations, not just crashes and latency spikes.
- The full instrumentation stack runs from model inference through orchestration, tool calls, data access, output rendering, and the customer-visible surface.
- Multi-agent systems require span propagation across agent boundaries, making failure attribution qualitatively harder than in single-agent architectures.
- Most observability stacks stop at agent behaviour and miss the output layer: what data was surfaced to which customer, under what permissions, and whether access was logged.
- When an agent starts touching customer data in a multi-tenant product, lightweight tracing is no longer sufficient; governance and observability must be treated as a coupled system.
What is AI agent observability?
AI agent observability is the practice of instrumenting, monitoring, and evaluating the behaviour of autonomous AI agents throughout their full execution stack. That stack runs from model inference and tool calls through data access and customer-visible outputs.
Unlike traditional application observability, which tracks deterministic code paths and infrastructure metrics, AI agent observability must account for non-deterministic reasoning, multi-step decision chains, and dynamic tool use. It also has to address the downstream trust implications of what the agent produces.
The scope extends beyond the model layer. It includes what the agent retrieved, what it decided to surface, to whom it surfaced it, and whether that access was governed and auditable.
Traditional APM tools like Datadog or New Relic excel at latency histograms, error rates, and infrastructure saturation: problems with knowable answer spaces. An AI agent operating on customer data introduces a different category of uncertainty. The same prompt can produce different tool calls, different retrieval paths, and different outputs across runs.
OpenTelemetry provides a vendor-neutral instrumentation foundation that many agent frameworks now build on, but the specification was designed for services, not reasoning loops. Capturing a trace that faithfully represents an agent's decision chain requires additional semantic layers that standard span hierarchies do not carry by default.
That gap is where AI agent observability begins.
Traditional observability vs AI agent observability
Traditional APM tools like Datadog and OpenTelemetry-instrumented services excel at answering deterministic questions: did the service respond, how fast, and where did latency spike? AI agent observability asks fundamentally different questions, because the failure modes are fundamentally different.
| Dimension | Traditional / APM observability | AI agent observability |
|---|---|---|
| Primary failure mode | Service crash, timeout, error rate | Wrong output, hallucination, skipped tool call |
| Determinism | Same input → same output | Same input → variable output |
| What breaks silently | Rarely; errors surface as exceptions | Frequently; bad reasoning produces no exception |
| Core signal | Latency, throughput, error rate | Trace quality, evaluation score, output accuracy |
| Auditability unit | Request / span | Reasoning step, tool call, retrieved document |
| Failure detection | Automated alerting on thresholds | Requires evaluation pipelines and human review |
| Multi-tenancy concern | Request isolation | Data access scope, row-level permission boundaries |
What to instrument in a production AI agent
Knowing observability matters is not the same as knowing where to put the probes. A production AI agent touches at least five distinct layers, and each one has different failure modes, different owners, and different consequences when something goes wrong.
OpenTelemetry's semantic conventions for generative AI (released in experimental status in 2024) provide a starting point for standardising spans across the model inference layer. Tools like LangSmith do this well for LangChain-based systems, offering detailed token-level traces and evaluation hooks out of the box. The gap is that most instrumentation frameworks stop at the orchestration boundary and never reach the data access or output rendering layers.
Here is what to capture at each layer:
- Model inference: prompt text (or a hash where PII is present), token counts, latency, model version, temperature settings, and finish reason. Cost per call is optional early on; it becomes mandatory at scale.
- Orchestration: step sequence, branching decisions, tool selection rationale, retries, and total wall-clock time across the full chain.
- Tool calls: which tool was invoked, with what parameters, and what it returned. For database or API tools, log the query or request shape — not just a success/failure flag.
- Data access: which dataset or table was queried, under which identity, and whether row-level security or access policies were evaluated. This is the layer most teams skip.
- Output rendering: what the agent produced — the chart type, the metric definition used, the summary text — and which user or tenant received it.
- Customer-visible surface: confirmation that the output matched the user's permission scope, with a persistent audit record.
The last two layers are where customer-facing trust is either established or broken.
Observability in multi-agent systems
Single-agent tracing is a solved problem compared to what multi-agent architectures demand.
When one agent delegates to another, a planner spawning a retrieval agent, which in turn calls a summarisation agent, the causal chain spans multiple independent processes, often with non-deterministic execution order. Standard distributed tracing protocols handle span propagation across service boundaries reasonably well. But agent-to-agent calls introduce a layer that infrastructure tracing was not designed for: reasoning state, intermediate outputs, and tool-use decisions that each agent makes autonomously.
If a downstream agent produces a hallucinated result, tracing the failure back to the originating agent requires that every span in the chain carries consistent trace context. It also requires that each agent logs not just its inputs and outputs but the specific tool calls and model responses that shaped its behaviour. LangSmith handles this reasonably for LangChain-native orchestration, offering a thread-level trace hierarchy that groups agent interactions into coherent sessions.
The harder problem is shared state. When two agents read from or write to the same memory or retrieval store mid-workflow, attribution of errors becomes genuinely ambiguous. Identifying which agent corrupted state, or which retrieval step introduced a bad context window, requires instrumenting the store itself, not just the agents querying it.
Consider a concrete failure: a planner agent and a summarisation agent both query the same retrieval store within milliseconds of each other. The summarisation agent returns a figure that looks plausible but is drawn from a stale context window the planner had already refreshed. The trace shows two clean spans. It does not show that they read inconsistent state. Without store-level instrumentation, you cannot reconstruct what happened.
The output-layer gap: what most observability stacks miss
A SaaS product ships an AI assistant that generates revenue summaries for enterprise customers. The agent trace looks clean: tool calls resolved, latency within budget, no hallucinations flagged. What the trace does not show is that a misconfigured access policy caused the agent to query a dataset scoped to a different tenant, and the summary it rendered contained that tenant's figures. The agent behaved correctly. The output was wrong in ways that mattered.
This is the output-layer gap. Most production observability stacks, including well-regarded tools like LangSmith, which excels at trace capture, evaluation pipelines, and prompt versioning, instrument agent behaviour thoroughly. They tell you what the agent did. They do not tell you what the agent surfaced to which customer, under which permissions, with which data definitions in scope at query time.
The distinction is consequential. Agent-behaviour observability covers reasoning steps, tool call sequences, token usage, and latency. Data-output observability covers what a specific user in a specific tenant context was actually shown, whether row-level security held at the data layer, and whether the metrics used were the governed definitions or a raw, ungoverned query result.
For internal agents, this gap is uncomfortable. For agents powering customer-facing analytics inside a multi-tenant product, it is a trust and compliance failure waiting to happen.
Data-output observability for customer-facing analytics
When an AI agent drives a customer-facing dashboard or generated summary inside a SaaS product, the trust surface shifts substantially. It is no longer enough to know that the agent completed its reasoning correctly.
You need to know which tenant's data it queried, whether row-level security held under query load, which metric definitions it used, and exactly what each customer was shown at what time.
This is data-output observability: logging and auditing not just agent behaviour but the governed output layer beneath it. Most infrastructure-focused tracing tools handle the model and orchestration layers well. What those tools do not surface is whether the access policy attached to a specific customer's session was enforced, or whether a metric surfaced to one tenant carries the same trusted definition applied to another.
A second scenario illustrates why this matters. An enterprise customer opens a support ticket: their usage summary shows a spike they cannot explain. Your agent trace confirms the query completed without errors. But there is no record of which version of the "active users" metric definition the agent used, because that definition changed between the previous report and this one. The trace tells you the agent ran. It cannot tell you what calculation the customer was shown or why it changed.
Each query an agent issues against a multi-tenant data layer should be attributable to a specific tenant context. Metric definitions should be versioned and auditable, not reconstructed from prompt history. Access policies need to be declared at the data model layer, not inferred from application logic at runtime.
Platforms like Embeddable handle this by keeping access policies and row-level security as explicit, code-defined constructs that persist regardless of how the query was initiated. To understand the broader context for what embedded analytics is and examples of user-facing analytics, those pieces provide useful grounding.
Governance and auditability: observability is not enough
Observability tells you what happened. Governance determines what was allowed to happen. For production systems, you need both, and they must be coupled, not bolted on separately.
Tracing an agent's tool calls is valuable. But if those calls accessed a customer's data without enforcing row-level security, the trace is a forensic record of a breach, not a control.
Audit trails, environment promotion gates, saved metric definitions, and role-aware access policies form the governance layer that makes observability actionable rather than merely descriptive. OpenTelemetry (the CNCF's vendor-neutral instrumentation standard) handles trace collection well; it has no opinion on whether the data access the agent performed was authorised for that tenant.
A third failure pattern makes this concrete. An agent powering a self-serve analytics feature receives a prompt asking for a breakdown by region. The model selects a tool that queries a table the user's role is not permitted to access directly. The agent returns a plausible chart. No exception is raised. The trace shows a successful tool call. Only a governance layer that evaluated the access policy at query time, and blocked or redacted the result, would have prevented that data from reaching the customer.
That is why governance must be structural. Platforms like Embeddable address this by coupling access policies and row-level security directly to the data layer agents query, so governance is enforced at the source rather than applied as an afterthought in application logic.
Production best practices and the full-observability threshold
Lightweight instrumentation is often enough for internal agents in early development. The threshold shifts the moment an agent touches customer data in a multi-tenant product: at that point, partial visibility is a liability, not a pragmatic tradeoff.
A practical investment framework, in order of priority:
Instrument spans and traces first. OpenTelemetry-compatible tracing (supported natively by tools like Arize Phoenix and LangSmith) gives you the foundation everything else builds on. Without it, evaluation and alerting have nothing to anchor to.
Add evaluation pipelines before go-live. Automated evals should run against a fixed golden dataset on every deploy. According to DORA research, teams that integrate quality gates into CI/CD detect regressions significantly earlier than those relying on post-production monitoring.
Set cost and latency budgets with hard alerts. Per-request token spend and p95 latency should have explicit thresholds; silent cost overruns are a common production failure mode.
Schedule red-teaming cadences. Adversarial testing should recur quarterly at minimum, model updates and prompt changes both reopen previously closed attack surfaces.
Add output-drift alerting last, not first. It is the most valuable signal in production but requires stable traces and evaluations to distinguish genuine drift from instrumentation noise.
Full observability is not free. Expect meaningful engineering overhead to instrument, maintain, and triage at each layer, budget for it explicitly.
Frequently Asked Questions
What is the difference between AI agent observability and traditional application observability?
Traditional observability tracks deterministic code paths using latency, error rates, and throughput. AI agent observability must account for non-deterministic reasoning, multi-step tool use, and outputs that can be wrong without throwing an exception. The failure modes are qualitatively different, so the instrumentation must be too.
What should you instrument in a production AI agent?
At minimum: model inference traces, tool call logs, retrieved documents, output content, data access scope, per-tenant permission boundaries, and cost and latency per run. Evaluation scores against known-good outputs are also essential once the agent touches production customer data.
How does observability work in multi-agent systems where agents call other agents?
Each sub-agent call must carry a shared trace context so the full reasoning chain can be reconstructed as a single hierarchy rather than disconnected spans. Without propagated trace IDs, you lose the ability to attribute a bad output to the specific step that caused it.
Why is model-level tracing insufficient when an AI agent surfaces data to end customers?
Tracing tells you what the agent did, but not whether the data it accessed was permitted for that specific customer or tenant. In a multi-tenant product, you also need to log what was shown to whom, under which access policies, and whether row-level security held at query time.
When does a production AI system need full observability versus lightweight instrumentation?
Lightweight tracing is usually sufficient for internal prototypes or single-user tools. The threshold shifts once an agent queries customer data in a multi-tenant product. At that point, per-tenant access logging, evaluation pipelines, and auditable output records become non-negotiable rather than optional.


