How to Build a Production-Grade Claude Agent Observability Stack Without Paying for SaaS

When a Claude agent burns through 200K tokens on a single run because it got stuck in a tool-calling loop, you find out two days later from a billing alert. By then the user has churned, the bug is buried under newer changes, and the only trace is a print() line you wrote at 11pm last Tuesday.

Claude agent observability is not optional once you ship agents to real users. The real question is whether you pay $99 to $499 a month for LangSmith, Arize, or Helicone, or whether you self-host a stack that gives you 90% of the value for the cost of a small VPS.

This article walks through a free, self-hosted observability stack built around three planes:

Per-run trace logging with OpenTelemetry spans
Cost tracking with input/output token attribution per tool
A three-panel dashboard you can build in either Grafana + Loki or a 50-line FastAPI + HTMX page

By the end you will have a TracingAgentLoop decorator that emits structured spans for every agent step, a cost calculator keyed to current Claude pricing, and a dashboard that surfaces success rate, p50/p95 latency, and daily spend.

Why agents need different observability than regular services

A traditional FastAPI service has a request, some database calls, maybe a downstream API hit, then a response. The trace is linear and short. Three to five spans tells the whole story.

An agent loop is not linear. One user message can trigger 14 model calls, 9 tool invocations, 3 retries on a flaky search API, and a final synthesis step. If you log each model call as a separate trace, you lose the parent-child structure that tells you "this whole conversation was one logical operation." If you log it as a single trace with no internal structure, you can never answer "which tool was slow on Tuesday's spike?"

The instrumentation has to mirror the agent's actual decision tree: one root span per run, child spans per model call, grandchild spans per tool invocation. OpenTelemetry handles this naturally because spans are already a tree. You just need to be disciplined about parent context.

Plane 1: Per-run trace with OpenTelemetry

OpenTelemetry is the open standard maintained by the CNCF. The Python SDK lives at https://github.com/open-telemetry/opentelemetry-python and the language docs are at https://opentelemetry.io/docs/languages/python/. Both are free. You can export spans to Jaeger, Tempo, Loki, or a local file. For solo operators a local file rotated daily is enough to start.

Here is the decorator. Wrap your agent's run() method and every model call and tool invocation gets a span automatically:

import json
import time
import uuid
from contextlib import contextmanager
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
)

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("claude.agent")


class TracingAgentLoop:
    def __init__(self, wrapped):
        self.wrapped = wrapped

    def run(self, user_input: str) -> dict:
        run_id = str(uuid.uuid4())
        with tracer.start_as_current_span("agent.run") as root:
            root.set_attribute("agent.run_id", run_id)
            root.set_attribute("agent.user_input_len", len(user_input))
            result = self.wrapped.run(user_input)
            root.set_attribute("agent.tokens_in", result["tokens_in"])
            root.set_attribute("agent.tokens_out", result["tokens_out"])
            root.set_attribute("agent.status", result["status"])
            return result

    @contextmanager
    def model_call(self, model: str):
        with tracer.start_as_current_span("agent.model_call") as span:
            span.set_attribute("model.name", model)
            t0 = time.monotonic()
            yield span
            span.set_attribute(
                "model.latency_ms", int((time.monotonic() - t0) * 1000)
            )

    @contextmanager
    def tool_call(self, tool_name: str, args: dict):
        with tracer.start_as_current_span("agent.tool_call") as span:
            span.set_attribute("tool.name", tool_name)
            span.set_attribute("tool.args_json", json.dumps(args)[:512])
            t0 = time.monotonic()
            yield span
            span.set_attribute(
                "tool.latency_ms", int((time.monotonic() - t0) * 1000)
            )

A few choices worth flagging. Args are truncated to 512 chars because tool arguments can include large file contents, and storing those in every span balloons disk usage. The user input length is captured but not the input itself, which keeps PII out of logs by default. Latency is recorded explicitly in milliseconds because OpenTelemetry's auto-duration is in nanoseconds and dashboard math is cleaner with ms.

Once you swap ConsoleSpanExporter for OTLPSpanExporter, the same spans flow to Tempo, Jaeger, or any OTLP-compatible backend without code changes.

Plane 2: Cost tracking with per-tool attribution

The Anthropic API returns token counts in the response usage field. The trick is attributing those tokens correctly when one tool call triggers another model call.

Pricing as of early 2026, per million tokens, from https://docs.anthropic.com/en/docs/about-claude/pricing:

Model	Input	Output
Opus 4.7	$15.00	$75.00
Sonnet 4.6	$3.00	$15.00
Haiku 4.5	$0.80	$4.00

Prompt caching cuts input costs by roughly 90% on hits and adds a 25% premium on writes. Build the calculator to handle cached and uncached tokens separately:

PRICING = {
    "claude-opus-4-7":   {"in": 15.00, "out": 75.00, "cache_read": 1.50,  "cache_write": 18.75},
    "claude-sonnet-4-6": {"in":  3.00, "out": 15.00, "cache_read": 0.30,  "cache_write":  3.75},
    "claude-haiku-4-5":  {"in":  0.80, "out":  4.00, "cache_read": 0.08,  "cache_write":  1.00},
}


def calc_cost(model: str, usage: dict) -> float:
    """Return USD cost for one model response.

    `usage` mirrors the Anthropic API response.usage object.
    """
    p = PRICING[model]
    cost = 0.0
    cost += (usage.get("input_tokens", 0)          / 1_000_000) * p["in"]
    cost += (usage.get("output_tokens", 0)         / 1_000_000) * p["out"]
    cost += (usage.get("cache_read_tokens", 0)     / 1_000_000) * p["cache_read"]
    cost += (usage.get("cache_creation_tokens", 0) / 1_000_000) * p["cache_write"]
    return round(cost, 6)

Tag the cost on the span so it travels with the trace:

with self.model_call(model="claude-sonnet-4-6") as span:
    response = anthropic.messages.create(...)
    usage = response.usage.model_dump()
    cost = calc_cost("claude-sonnet-4-6", usage)
    span.set_attribute("model.cost_usd", cost)
    span.set_attribute("model.tokens_in", usage["input_tokens"])
    span.set_attribute("model.tokens_out", usage["output_tokens"])

Because spans are a tree, the dashboard can roll up cost two ways without any extra instrumentation. Sum model.cost_usd across all spans for a per-day total. Sum model.cost_usd filtered to descendants of a specific agent.tool_call span for per-tool attribution. That second number is the one you need when a search tool quietly racks up $40 a day because it returns 8K-token result blobs that get fed back to the model.

A useful comparison: LangSmith charges $39/seat/month for the Plus tier and $99/seat/month for the Enterprise starter. Helicone's Pro tier is $50/month with 100K requests included. Self-hosting OpenTelemetry plus Loki on a 4GB VPS costs about $5 to $12 a month and has no per-request ceiling. The trade-off is that you write the dashboards yourself. For a solo operator with three to ten agents in production, the math favors self-hosting until you cross roughly 500K requests per month, at which point ops overhead starts dominating.

Plane 3: The minimal dashboard

You have two reasonable options. Pick based on whether you already run any observability infra.

Option A: Grafana + Loki

If you run any other services, you probably want Grafana. Loki (https://grafana.com/docs/loki/latest/) ingests structured logs cheaply because it indexes labels, not log content. Configure the OpenTelemetry collector to write spans as JSON to Loki, then build three panels:

Success rate — sum(rate({app="agent"} | json | agent_status="ok" [5m])) / sum(rate({app="agent"} [5m]))
p95 latency — quantile_over_time(0.95, {app="agent"} | json | unwrap agent_latency_ms [5m])
Daily cost — sum_over_time({app="agent"} | json | unwrap model_cost_usd [24h])

Grafana renders all three on one dashboard. Add an alert on the cost panel set to ping Telegram or email if today's spend crosses 2x the trailing 7-day average. That single alert catches roughly 80% of the runaway-loop bugs that previously burned a weekend of tokens.

Option B: 50 lines of FastAPI + HTMX

If you do not have Grafana and do not want to install it, a single Python file gives you the same three numbers. HTMX (https://htmx.org/docs/) does live refresh with no JavaScript build step:

from fastapi import FastAPI, Request
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates
import sqlite3

app = FastAPI()
templates = Jinja2Templates(directory="templates")


def percentile(values: list[int], p: float) -> int:
    if not values:
        return 0
    values = sorted(values)
    k = int(p * (len(values) - 1))
    return values[k]


@app.get("/", response_class=HTMLResponse)
def dashboard(request: Request):
    db = sqlite3.connect("traces.db")
    rows = db.execute("""
        SELECT status, latency_ms, cost_usd FROM runs
        WHERE created_at > datetime('now', '-24 hours')
    """).fetchall()
    total = len(rows) or 1
    success = sum(1 for r in rows if r[0] == "ok") / total
    p95 = percentile([r[1] for r in rows], 0.95)
    daily_cost = sum(r[2] for r in rows)
    return templates.TemplateResponse("dashboard.html", {
        "request": request,
        "success_rate": f"{success * 100:.1f}%",
        "p95_ms": p95,
        "daily_cost": f"${daily_cost:.2f}",
    })

The template uses hx-trigger="every 30s" on a wrapper div to refresh without a full page reload. You get a dashboard that runs in 40MB of RAM, has no external dependencies, and surfaces the three numbers you actually look at every morning.

A typical layout: three big-number cards across the top (success %, p95 ms, $ spent today), a sparkline of hourly cost underneath, and a 20-row table of the slowest runs in the last hour at the bottom. The whole thing fits on one screen and renders in under 50ms because SQLite reads from page cache.

When to graduate to a paid platform

Self-hosting is right for the first 12 to 24 months. The signals that you should move to LangSmith, Arize, or Helicone are concrete:

You need shared dashboards for a team of 5+ engineers who do not want shell access to the VPS
You want LLM-as-judge eval pipelines (LangSmith has the most mature offering here)
You hit 1M+ agent requests per month and the storage cost of raw spans crosses what a SaaS tier charges
You need SOC 2 evidence for an enterprise contract and do not want to harden the infra yourself

Until any of those triggers fires, the OpenTelemetry + Loki + Grafana stack catches the bugs that actually matter. The cost-attribution panel pays for itself the first time it surfaces a tool that quietly costs $30 a day in tokens.

What to instrument next

Once the three planes are live, layer three derived metrics on top. The raw spans already carry the data:

Cost per successful run — daily cost divided by successful run count, plotted over a 14-day window. A rising line means agents are taking more turns to converge, often a regression in tool quality.
Tool retry rate per tool — count spans with tool.retry_attempt > 0. A tool that retries 30% of the time is degrading the upstream service or your timeout is wrong.
Tokens-per-message ratio — average output tokens divided by average input tokens. If output grows faster than input, the model is rambling, often because the system prompt has drifted out of focus.

These three catch the slow-burn issues that single-run traces miss. They cost nothing extra to compute because the data already lives in your span store.

The full stack — decorator, cost calculator, dashboard — is about 200 lines of code and one Docker Compose file. That is the price of knowing, every morning, exactly what your agents did yesterday and what they cost you.

References: