Agent Observability Stack — APM Misses Hallucinations
AI agents fail silently. Fluent output, wrong answer, no error code. This is the stack that closes that blind spot — OpenTelemetry, LangSmith, Sentry, and hard cost...
The stack (6 tools)
CNCF-governed standard prevents vendor lock-in; gen_ai semantic conventions work across every downstream platform
Native LangChain/LangGraph integration, P50/P99 per-span latency, multi-turn evals on production traffic
MIT license, ClickHouse backend for high-cardinality queries, full data residency control
Connects agent spans to infrastructure root causes — slow Postgres, API timeouts — in a single trace
Sub-200ms latency makes 100% production traffic scoring economically viable for the first time
Production failures auto-convert to regression test cases; Loop generates scorers from plain-language descriptions
TL;DR
- Standard APM (Datadog, New Relic) cannot detect agent failures — hallucinations and wrong tool calls return HTTP 200
- OpenTelemetry + OpenLLMetry provides vendor-neutral instrumentation using gen_ai semantic conventions across every trace backend
- LangSmith for LangChain-heavy stacks; Langfuse self-hosted for data residency requirements — both accept OTLP
- Galileo Luna-2 enables 100% production traffic evaluation at $0.02/M tokens and sub-200ms latency — the eval economics finally work
- Braintrust closes the feedback loop: production failures become CI/CD regression tests automatically
- This stack fires after code is in production — it does not replace pre-merge review or local debugging
The observability problem for AI agents is structurally different from anything we’ve dealt with before: failures look like success. A hallucinated API call that returns a well-formed result doesn’t surface in your error budget — it surfaces in your next customer escalation. Your APM stack tracks latency and HTTP status codes. It cannot tell you whether your agent called the wrong tool, produced a fluent but incorrect output, or silently burned $400 in tokens on a task that should have cost $2.
The standard response is to bolt on a tracing SDK and call it done. That only gives you what happened, not whether it was correct. The stack I’d actually run in 2026 layers OpenTelemetry for infrastructure continuity, a dedicated agent trace platform for reasoning visibility, and continuous evals on production traffic — not just in CI. If you’re not running evals in prod, you’re flying blind after every deploy.
Why Standard APM Fails for Agents
Consider what your APM dashboard shows when an agent processes a request: one HTTP 200, perhaps 800ms latency, maybe a database query. What it doesn’t show is that the agent made 15 sequential LLM calls, chose the wrong retrieval tool on call 7, hallucinated a document reference, and produced a coherent-sounding but factually wrong answer — all within that 800ms window.
Traditional APM was built for deterministic services. Request in, logic runs, response out. Failure modes are timeouts, exceptions, and bad status codes. Agents break this model in two specific ways.
First, the trace structure is deeply nested in a way APM tools aren’t designed to represent. A single user-facing request might contain agentic spans wrapping LLM spans wrapping retrieval spans wrapping tool execution spans. An agent running a research task will generate 15+ child spans before returning. Most APM tools collapse this or flatten the hierarchy into something unreadable.
Second — and this is the critical difference — agents fail semantically, not technically. The HTTP handshake succeeds. The JSON is valid. The response parses correctly. The failure is that the content is wrong, the tool selection was suboptimal, or the reasoning chain drifted from the user’s intent. There is no error code for this. Your error budget stays clean while your users experience degraded quality. You cannot close this gap with better dashboards on your existing APM. It requires a different instrumentation model.
Stack Overview
Five layers, each solving a distinct part of the visibility problem:
- OpenTelemetry + OpenLLMetry: Instruments every LLM call, tool invocation, and retrieval step into standardized gen_ai spans — the data foundation everything else depends on
- LangSmith or Langfuse: Stores agent traces and runs online evaluations against production traffic — the reasoning visibility layer
- Sentry AI Agent Monitoring: Connects agent spans to infrastructure context (database latency, API timeouts, frontend errors) — the root cause layer
- Galileo Luna-2: Scores 100% of production traces in real time at sub-200ms latency — the continuous eval layer
- Braintrust: Converts production failures into regression tests and blocks deploys when quality drops — the feedback loop layer
graph TD
A[Agent Request] --> B[OpenTelemetry Collector]
B --> C[LangSmith / Langfuse]
B --> D[Sentry AI Agent Monitoring]
B --> E[Galileo Luna-2 Evaluators]
C --> F[Online Evals + Trace UI]
D --> G[Full-Stack Root Cause]
E --> H[Production Scoring / Guardrails]
H --> I[Braintrust Eval Dataset]
I --> J[CI/CD Regression Gate]
J --> K[PR Block on Quality Drop]
Components
OpenTelemetry + OpenLLMetry (OTel 1.37+)
OpenTelemetry is the instrumentation standard you use regardless of which backend you choose. Version 1.37+ ships with gen_ai semantic conventions that standardize how LLM interactions appear in traces: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens. These attributes work the same whether you’re calling Anthropic, OpenAI, Bedrock, or a local model.
OpenLLMetry, originally built by Traceloop and acquired by ServiceNow in March 2026, auto-instruments 40+ LLM providers with a single import. You don’t write custom span logic for every SDK — it wraps your existing clients and emits OTLP-compatible traces. Instrumentation overhead is under 1ms per LLM call; your 100ms–30s API latency dominates entirely.
| Tool | Difference | Switch if |
|---|---|---|
| Vendor-specific SDKs | Tighter integration, less setup, locks trace format to one backend | You’re certain you’ll never switch backends |
| Manual OTel spans | Full control over trace structure | You have non-standard agent architectures |
LangSmith vs. Langfuse — Pick One
These two tools do the same job: store your agent traces and run evaluations against them. The choice is about your deployment model, not features.
LangSmith offers native integration with LangChain and LangGraph, P50/P99 latency breakdowns per span, and multi-turn eval support that covers conversation-level quality. The Insights Agent auto-clusters usage patterns to surface anomalies. Online evals run asynchronously — they score production traces without adding latency to user requests. Cloud-first; self-hosting requires an enterprise agreement.
Langfuse is the right choice when LangSmith’s cloud-first model conflicts with your data residency requirements. The January 2026 acquisition by ClickHouse has strengthened the self-hosting story, not weakened it — ClickHouse’s official statement: “Our roadmap stays the same…we remain committed to open source and self-hosting.” Langfuse was already running on ClickHouse internally; the acquisition means officially supported single-stack deployment. The OTLP ingestion endpoint (/api/public/otel) accepts standard OTel traces without modification. MIT license, unlimited self-hosting, 50K units/month on managed cloud.
Choose LangSmith if your team runs LangChain or LangGraph and eval workflow maturity matters more than data residency. Choose Langfuse if you need traces to stay on your infrastructure or your trace volume makes LangSmith’s pricing prohibitive at scale.
Sentry AI Agent Monitoring (April 2026)
The gap this fills becomes obvious once you’ve debugged a few agent failures: isolated AI observability creates a different kind of blind spot. You can see that your research agent made 8 LLM calls and cost $0.04. What you can’t see is that the reason it made 8 calls instead of 3 is that your search_docs tool is hitting a slow Postgres query, causing the agent to retry with a rephrased query each time.
Sentry’s April 2026 AI Agent Monitoring connects your gen_ai spans to the full backend context — database queries, external API latency, infrastructure errors — in a single trace view. Python and Node.js SDKs auto-instrument OpenAI and Anthropic clients via OpenAIIntegration() and AnthropicIntegration(). No duplicate instrumentation — it integrates with OTel 1.37+ gen_ai conventions directly.
| Tool | Difference | Switch if |
|---|---|---|
| Datadog APM + LLM Observability | Single-vendor full-stack, more mature APM features | You’re already on Datadog and consolidation value outweighs Sentry’s agent UX |
| Honeycomb | Better for exploratory query analysis | You run custom query-heavy investigations more than exception-driven debugging |
Galileo Luna-2
This is where the eval economics finally work. Running LLM-as-judge scorers on 100% of production traffic has been theoretically correct but practically unaffordable — GPT-3.5-based evals cost $6,248/month for 1M traces. Luna-2, Galileo’s fine-tuned Llama 3B/8B evaluators, run at $0.02 per million tokens with a 152ms average latency. Even running 10–20 metrics simultaneously, Luna-2 stays under 200ms on L4 GPUs. That’s a 97% cost reduction against frontier LLM-as-judge evals — the first time 100% production traffic coverage is economically viable.
For real-time guardrails (blocking bad outputs before they reach users), set mode="blocking" — this adds ~152ms before the response is returned. For async scoring with no latency impact, use mode="async".
| Tool | Difference | Switch if |
|---|---|---|
| Braintrust online evals | More flexible scorer authoring; higher cost per eval | You need custom eval logic that Luna-2’s pre-trained metrics don’t cover |
| RAGAS | Open-source RAG framework; ~$7,994/month at 1M traces | You’re evaluating RAG pipelines specifically and want full framework control |
Braintrust
Braintrust closes the loop every other tool leaves open: it converts production failures into regression tests automatically. When a production trace fails an eval, you add it to your eval dataset with one click. Those failures run in CI/CD on every PR, and the test blocks the merge if quality drops.
Loop, Braintrust’s AI assistant, generates scorers from plain-language descriptions — write “score whether the agent selected the correct tool for the user’s intent” and Loop generates a production-ready scorer, sample dataset, and test cases. Scorers run both online (production sampling) and offline (full CI suite) from the same code. The free tier covers 1M spans/month — sufficient for agents handling under 500 traces per day.
Setup: Instrument and Route Traces
The core pattern is: OTel init → collector → OTLP export to your chosen backends. Everything else plugs into this pipe.
# agent_init.py — core instrumentation setup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor
# Point at your OTel Collector — collector routes to Langfuse/LangSmith + Sentry
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Auto-instruments Anthropic SDK; captures gen_ai.usage.* attributes on every LLM call
AnthropicInstrumentor().instrument()
The collector config routes a single OTLP stream to multiple backends simultaneously — you update the collector, not your application code, when you add or swap backends. The full collector YAML and cost attribution span processor (which maps token counts to USD at export time using a model pricing table) are in the reference repo.
Cost attribution is the piece teams skip and regret. The math on why: an agent trace calling Claude 3 Opus with 50K input tokens and 20K output tokens costs $0.45. A buggy agent running a retry loop at 1,000 traces per day costs $450/day — and every trace returns HTTP 200. With per-trace cost tracking and a Slack webhook that fires when any single trace exceeds $5, you catch the loop within minutes. Set the thresholds when you build the stack, not after your first incident.
Pricing
| Component | License | Free Tier | Paid from | Note |
|---|---|---|---|---|
| OpenTelemetry | Apache 2.0 | Unlimited | — | Infrastructure costs only (self-host collector) |
| OpenLLMetry | Apache 2.0 | Unlimited | — | ServiceNow-owned; OSS commitment unchanged |
| LangSmith | Commercial | Dev tier | Volume-based | Self-hosting requires enterprise agreement |
| Langfuse | MIT | 50K units/month cloud; unlimited self-hosted | Managed cloud plans | Self-hosting on ClickHouse cheapest at scale |
| Sentry | Commercial | 5K errors/month | $29/month (Team) | AI Agent Monitoring on Team tier+ |
| Galileo Luna-2 | SaaS | Limited trial | ~$0.02/M tokens | No lightweight CPU self-hosting option |
| Braintrust | Commercial | 1M spans/month | $249/month (Pro) | Free tier sufficient for <500 traces/day |
As of April 2026. At 100 traces/day, the free tiers of Braintrust and Langfuse cover you — you pay only for Sentry ($29/month) and Galileo usage. At 10K traces/day with 100% Luna-2 eval coverage and Braintrust Pro, expect $450–800/month.
Galileo Luna-2 runs on Galileo’s SaaS inference infrastructure. If your data residency requirements prohibit sending trace data to Galileo, use Braintrust’s LLM-as-judge evals instead — at higher per-eval cost and latency.
When This Stack Fits
- You’re running autonomous agents in production — agents that take actions or generate outputs without human review on every response. The eval-in-prod layer is non-negotiable here.
- You have more than one LLM provider in your stack — mixing Anthropic, OpenAI, and local models makes cost attribution impossible without per-span tracking. OTel handles this uniformly.
- Your agents make multi-step decisions — any workflow where the agent selects tools, chains LLM calls, or retrieves context needs nested span visibility. Flat APM traces are useless.
- You deploy agent updates frequently — the Braintrust CI/CD gate pays for itself the first time it catches a prompt change that degrades hallucination rate by 15% before it hits production.
When This Stack Does Not Fit
- You’re prototyping — LangSmith’s free tier plus basic OTel instrumentation is sufficient. Don’t configure Luna-2 evaluators until you have production traffic to evaluate.
- Your agents run as offline batch jobs — real-time guardrails and sub-200ms evals don’t add value when there’s no user-facing response to block. Use Braintrust’s offline eval pipeline instead.
- You need pre-merge code review for agent-generated code — this stack starts after the code is deployed and the agent is live. Catching bugs before they merge is a separate problem (see agentic-code-verification-stack).
- Your total trace volume is under 50 traces/day — at that scale, manual trace review in Langfuse’s free tier is faster than building an eval pipeline. Add automation when manual review becomes the bottleneck.
This stack does not replace code review, testing, or local debugging (see ai-debugging-stack-close-the-loop). It fires after your agent is in production and making real decisions. Treat it as your production quality layer — not your development quality layer. Those are different problems with different tools.
The thing nobody tells you about running evals only in CI: you’ll catch 60–70% of regressions before deploy, feel good about your coverage, and miss the rest. Prompt drift, tool selection degradation, and hallucination rate changes under real traffic patterns don’t appear in your test dataset — they appear in production, gradually, until a customer escalation makes the pattern visible. Luna-2 at $0.02/M tokens means there’s no longer a cost reason to accept that blind spot. The economics changed in 2026. Your eval strategy should change with them.