Isolate the Anthropic SDK Behind One Adapter

Agent codebases rot fastest at the LLM seam. A from anthropic import Anthropic in three modules turns into seven, then twelve. Each callsite invents its own retry logic, its own system prompt assembly, its own model string. When Sonnet 4.6 ships and you want to A/B against 4.5, you grep for claude-sonnet and find nine places to edit. When a test needs to run offline, you monkey-patch anthropic.Anthropic and pray the mock matches the real shape.

The fix is boring and well-understood: put every Anthropic call behind one adapter module. The non-obvious part is what belongs inside the adapter versus what stays at the callsite. Get that boundary wrong and you either build a god-object that leaks prompt logic, or a thin pass-through that solves nothing.

What the adapter owns

A working adapter for an agent codebase owns five things:

SDK instantiation — exactly one Anthropic() client per process, configured from env once.
Model selection — model IDs live as named constants, not string literals at callsites.
Retry + timeout policy — exponential backoff on RateLimitError and APIConnectionError, hard cap on total wall time.
Token + cost accounting — every call records input/output tokens and dollar cost to a metrics sink.
Prompt versioning hook — system prompts are looked up by ID, not pasted inline at the callsite.

What it does NOT own: the actual prompt content, the business logic that decides which prompt to use, or the parsing of tool calls. Those stay at the use-case layer. The adapter is a transport with policy, not a planner.

Minimum viable adapter

Here is the shape that has held up across several production agent codebases. Single file, ~120 lines, no abstractions you don't use.

from __future__ import annotations

import logging
import time
from dataclasses import dataclass
from typing import Any, Literal

from anthropic import Anthropic, APIConnectionError, RateLimitError
from anthropic.types import Message

logger = logging.getLogger(__name__)

# Model IDs as constants — never literal strings at callsites
SONNET_4_6 = "claude-sonnet-4-6"
HAIKU_4_5 = "claude-haiku-4-5-20251001"
OPUS_4_7 = "claude-opus-4-7"

# Pricing per million tokens (USD) — keep in sync with Anthropic pricing page
PRICING = {
    SONNET_4_6: {"input": 3.00, "output": 15.00},
    HAIKU_4_5: {"input": 0.80, "output": 4.00},
    OPUS_4_7: {"input": 15.00, "output": 75.00},
}


@dataclass(frozen=True)
class LlmResult:
    text: str
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float
    prompt_id: str
    latency_ms: int


class AnthropicAdapter:
    def __init__(self, client: Anthropic | None = None) -> None:
        self._client = client or Anthropic()

    def complete(
        self,
        *,
        prompt_id: str,
        system: str,
        user: str,
        model: str = SONNET_4_6,
        max_tokens: int = 2048,
        max_retries: int = 3,
    ) -> LlmResult:
        start = time.monotonic()
        msg = self._call_with_retry(
            system=system, user=user, model=model,
            max_tokens=max_tokens, max_retries=max_retries,
        )
        elapsed_ms = int((time.monotonic() - start) * 1000)
        return self._build_result(msg, model, prompt_id, elapsed_ms)

    def _call_with_retry(
        self, *, system: str, user: str, model: str,
        max_tokens: int, max_retries: int,
    ) -> Message:
        delay = 1.0
        for attempt in range(max_retries):
            try:
                return self._client.messages.create(
                    model=model,
                    max_tokens=max_tokens,
                    system=system,
                    messages=[{"role": "user", "content": user}],
                )
            except (RateLimitError, APIConnectionError) as exc:
                if attempt == max_retries - 1:
                    raise
                logger.warning("anthropic retry %d/%d: %s", attempt + 1, max_retries, exc)
                time.sleep(delay)
                delay *= 2
        raise RuntimeError("unreachable")

    def _build_result(
        self, msg: Message, model: str, prompt_id: str, elapsed_ms: int,
    ) -> LlmResult:
        text = "".join(b.text for b in msg.content if b.type == "text")
        cost = (
            msg.usage.input_tokens * PRICING[model]["input"]
            + msg.usage.output_tokens * PRICING[model]["output"]
        ) / 1_000_000
        return LlmResult(
            text=text, model=model,
            input_tokens=msg.usage.input_tokens,
            output_tokens=msg.usage.output_tokens,
            cost_usd=cost, prompt_id=prompt_id, latency_ms=elapsed_ms,
        )

Notice the helper split. complete is a flat dispatch; _call_with_retry handles transport failures; _build_result normalizes the response. Each method has one job. No nested try blocks, no nested if chains. That matters less for line count and more for the diff: when you change retry policy, you touch one method.

Why prompt_id is a parameter

Every call carries a prompt_id. This is the lever that unlocks the rest of the adapter's value. With it, you can:

Log structured rows: prompt_id=plan.decompose model=sonnet-4-6 cost=$0.0034 latency=820ms
Aggregate cost-by-prompt across a day: cheaper to debug than cost-by-endpoint
A/B prompts behind a flag: route 10% of plan.decompose calls to plan.decompose.v2
Catch regressions: if plan.decompose p95 latency jumps from 800ms to 2200ms after a prompt edit, the metrics show it inside an hour

Without prompt_id, your token logs become an undifferentiated stream and you lose the ability to attribute cost or quality to specific prompts.

Testability — the actual reason this is worth doing

Mocking the Anthropic SDK directly is fragile. The SDK's response shapes change between versions, the streaming API has subtly different types from the non-streaming one, and any test that patches anthropic.Anthropic ends up coupled to internal SDK structure.

The adapter solves this by giving you a single seam to fake:

from unittest.mock import MagicMock

class FakeAdapter:
    def __init__(self, response_text: str) -> None:
        self._response = response_text

    def complete(self, *, prompt_id: str, **kwargs: Any) -> LlmResult:
        return LlmResult(
            text=self._response, model="fake",
            input_tokens=10, output_tokens=20, cost_usd=0.0,
            prompt_id=prompt_id, latency_ms=5,
        )

def test_planner_handles_empty_plan():
    adapter = FakeAdapter(response_text="[]")
    planner = Planner(llm=adapter)
    assert planner.decompose("noop task") == []

Use cases inject the adapter (or a fake). Tests run offline, take milliseconds, and don't break when the SDK ships a minor version bump.

Adapter vs framework: when to skip LangChain

A reasonable question: why not use LangChain or LlamaIndex, which already wrap the LLM call? For a focused agent codebase where you understand the prompt flow and want deterministic behavior, hand-rolled adapters win on three axes:

Surface area: ~120 lines of your code over ~50k lines of framework code you don't control
Debuggability: when a call fails, the stack trace points at five frames, not fifty
Cost visibility: your accounting integrates directly with your metrics backend, no scraping framework callbacks

Frameworks make sense when you genuinely need their composability — agents-of-agents, complex retrieval graphs, swappable vector stores. For a codebase that calls Sonnet from four or five use cases, the adapter pattern is roughly 80% of the framework value at 5% of the dependency cost.

Migration recipe for an existing codebase

If your codebase already has Anthropic calls scattered around:

Grep for Anthropic( and messages.create( — you now have a list of every callsite.
Build the adapter module first; do not modify callsites yet.
Pick the smallest callsite. Replace its body with adapter.complete(prompt_id=..., system=..., user=...). Run its tests.
Repeat for each callsite. Each migration is one PR, isolated.
Once all callsites use the adapter, delete the direct from anthropic import Anthropic lines outside the adapter module. Add a lint rule (ruff custom check or a CI grep) that fails on direct imports anywhere else.

The lint rule is what makes this stick. Without it, the next contributor adds a sixth direct import and you lose the invariant.

What doesn't belong in the adapter

Resist these temptations:

Prompt templating — Jinja, f-strings, prompt composition belong at the use-case layer. The adapter receives finished system + user strings.
Tool-call parsing — if you're using tool use, parse the response at the use case. The adapter returns text and usage; what the text means is domain logic.
Caching — prompt caching is a transport-level concern for Anthropic's API (the cache_control field), and that does belong in the adapter. But application-level result caching ("we already asked this question yesterday") belongs at the use case where you know the cache key semantics.
Streaming — if you need streaming, add a second method stream(...) that returns an iterator of deltas. Don't try to make complete async-generator-shaped.

The payoff

Six months in, a codebase with one adapter and forty use cases will have: one place to edit when SDK signatures change, one place to flip when a new model ships, one place to add structured logging, one place to enforce a per-tenant budget cap. A codebase without the adapter will have those same concerns scattered across forty files, and each migration takes a Friday afternoon instead of fifteen minutes.

The adapter pattern isn't novel. The reason to write it down for agent codebases is that the temptation to call the SDK directly is high — the SDK is genuinely pleasant to use — and that pleasantness is exactly what makes the scatter happen.

References: