When Your Local AI Agent Forgets: The Silent 1024-Token Context Ceiling in Ollama
When Your Local AI Agent Forgets: The Silent 1024-Token Context Ceiling in Ollama
You wire up a tidy little local agent. It calls tools, summarizes results, decides the next step, calls another tool, and eventually answers the user. Everything is fast, free, and runs on your own machine via Ollama. For the first few turns it looks brilliant. Then, somewhere around the fourth or fifth tool call, the agent starts producing answers that ignore earlier context. It re-asks for the user's filename. It "forgets" the schema you injected two turns ago. It hallucinates a function signature you patiently corrected. Nothing in the logs screams "error." The HTTP responses are still 200 OK. The model still produces fluent prose.
You have just been bitten by Ollama's silent context ceiling. By default, Ollama caps the working context at a value so small that any non-trivial agent loop overruns it within minutes, and instead of raising an error, the runtime quietly trims the oldest tokens off the front of your prompt. The conversation continues without warning. The agent's memory is gone, and the only signal you get is degraded behavior. This article walks through the trap, why it exists, how to detect it from outside the model, and how to raise the ceiling deliberately so your agent's working memory matches your architecture diagram.
The default is smaller than you think
Open-source local models usually advertise long context windows. Llama 3 family checkpoints commonly support 8K or higher. Mistral and Qwen variants push past 32K. So most builders assume that when they pull a model with ollama pull llama3 and start a chat, they get the full advertised window. They do not. The runtime parameter that actually controls the working context is num_ctx, and unless you set it explicitly, Ollama applies a conservative default \u2014 historically 2048 tokens, and on some legacy configurations even smaller. The model card's max-context advertisement is a capability, not a configuration. The runtime ceiling is whatever num_ctx resolves to at request time.
For a single-shot completion this is fine. A user types a question, the model answers, the session ends. But agent loops are not single-shot. An agent loop typically replays the full message history on every step so the model can reason about prior tool results. Each step adds the previous tool output, the previous assistant message, sometimes a chain-of-thought scratchpad, and the next user-visible instruction. Token counts accumulate fast. A modest ReAct-style agent invoking three tools with JSON payloads will easily blow past 2048 tokens before the user has finished their first question.
The behavior at the ceiling is not what most people expect. Ollama does not return HTTP 413 Payload Too Large. It does not include a warning field in the response. It silently drops tokens from the front of the prompt to make the request fit, then runs the model on the truncated remainder. From the client's perspective the call succeeded. From the model's perspective, the system prompt that explained the tool schema is simply gone.
Why "silent" is worse than "broken"
A noisy failure mode trains you to fix the underlying problem. Silent truncation, by contrast, lets the agent continue almost working, which is the worst place to debug from. You spend an afternoon convinced that the model has degraded, or that your prompt template is malformed, or that the tool-result formatter is corrupting JSON. None of those are true. The model is doing exactly what it was asked to do given the inputs it actually saw \u2014 which is a strict subset of the inputs you sent.
This pattern is not unique to Ollama. Every inference runtime that caps n_ctx at load time has a similar trap. llama.cpp, which Ollama wraps, exposes the same parameter at the C API level. The difference is that when you use llama.cpp directly you almost always notice the ceiling, because you set it explicitly at model load. Ollama hides the load step behind its convenient ollama run and /api/chat interfaces, so the default leaks into production agents.
The cost of the leak compounds. Every additional turn that fits in the truncated window pushes earlier turns further toward the chopping block. By the time you notice that the agent has forgotten the user's name, you cannot tell from the API response which tokens were dropped, because the response only contains the completion, not the residual prompt the runtime actually saw.
Detecting the ceiling from the outside
Because the runtime does not warn you, detection has to be active. Three signals work well in practice.
The first is response-shape regression. Maintain a small fixture of multi-turn dialogues where the correct answer depends on a fact mentioned in the first user message. Run the fixture against your agent at every CI step. If the agent ever fails the fixture without any code change, your context is being trimmed somewhere.
The second is explicit token accounting on the client. Before each call to /api/chat, run the request payload through the same tokenizer the model uses (for Llama-family models, the SentencePiece or tiktoken-compatible counter shipped with the upstream weights) and compare the count to your configured num_ctx. If your client thinks the request is below the limit but the model still loses context, you have evidence that num_ctx is not what you think it is.
The third is to log the eval_count and prompt_eval_count fields that Ollama returns on each call. These come from the underlying llama.cpp evaluator. If prompt_eval_count flatlines at the same value across turns even though your message history is growing, you have caught the truncation red-handed. The model is being shown the same prefix length every time, which means tokens are being removed to make room.
Here is the minimal probe many teams add to their agent harness. It sends a deliberately oversized request and inspects the response metadata for evidence of trimming:
curl -s http://localhost:11434/api/chat \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3",
"messages": [
{"role": "system", "content": "You are a precise assistant. Repeat any number you see."},
{"role": "user", "content": "REMEMBER_THIS_NUMBER=42. Now: a long filler block of about three thousand tokens..."},
{"role": "user", "content": "What was REMEMBER_THIS_NUMBER?"}
],
"options": { "num_ctx": 2048 },
"stream": false
}' | jq '{prompt_eval_count, eval_count, content: .message.content}'
If the response content is anything other than "42," and prompt_eval_count is close to 2048 rather than the actual token length of your messages, you are watching the ceiling silently shave the front of your prompt.
Raising the ceiling deliberately
There are two places to set num_ctx: in the Modelfile that defines a custom model, or per-request via the options field on /api/chat and /api/generate. Both paths are documented in the Ollama API reference and the Modelfile reference.
A Modelfile change creates a new named variant that always loads with a larger window. This is the right choice when you control the model registry and want every consumer of myagent-llama3 to get the same ceiling. The downside is that a larger num_ctx value costs VRAM at load time, regardless of whether any given request actually needs it. If you flip a 7B model from 2048 to 32768 you should expect to consume several additional gigabytes of resident memory, and you will be unable to load the model at all on hardware that was previously sufficient.
The per-request approach lets you size the window to the actual conversation. You pay the memory cost only when a long request arrives. The downside is that every client must remember to set it. Forget the option in one code path \u2014 a debug script, a fallback retry, a notebook used by another team \u2014 and that path will silently regress to the default ceiling. In agent codebases the per-request approach is usually wrapped inside the HTTP client so it cannot be forgotten. The Modelfile approach is usually layered on top so the default is also safe.
A reasonable convention is: pick a num_ctx value that matches the model's documented maximum, build a Modelfile variant at that ceiling, and have the agent's HTTP client additionally pass num_ctx per request as defense in depth. That way a stale Modelfile, a model swap, or a deployment that pulled an upstream image does not silently regress the ceiling.
What this means for agent architecture
Once you know the ceiling is real and configurable, several agent-design rules of thumb shift.
Memory budgeting becomes a first-class concern. Treat num_ctx as a hard resource just like VRAM and CPU. Build a tokenizer-aware reservation system: every sub-component of your prompt (system instructions, tool schemas, scratchpad, recent dialogue, retrieved chunks) declares an upper bound, and the orchestration layer refuses to pack a request whose reservations exceed the configured ceiling. This is the same discipline good RAG systems already apply to retrieved context; the only news here is that the local-inference layer needs the same accounting.
Long-running agents need a summarization step. When the rolling dialogue starts to approach the budget \u2014 say, 70% of num_ctx \u2014 collapse older turns into a shorter recap before they get silently trimmed by the runtime. The trick is that you control the collapse, so you can preserve the facts the agent will need (user name, current task, open subgoals) while sacrificing low-value scaffolding (acknowledgments, verbose tool outputs, intermediate reasoning).
Tool result handling deserves special care. Many agents store raw JSON tool outputs in the message history. A single noisy tool call can flood the context with several hundred tokens of repeated keys and quoted strings. Wrap tool results in a normalizer that produces a compact textual summary plus a stable reference key (a hash, an ID, a local file path) that downstream calls can resolve again. The full payload lives outside the context window; only the summary stays in the conversation.
Finally, your evaluation harness needs a memory-spanning test. Single-turn benchmarks will never detect a context regression. Add at least one fixture where the answer depends on information mentioned ten or twenty turns earlier. Run it on every model change, every prompt change, every config bump. If the silent ceiling drops again \u2014 because a new model has a different default, because an upgrade reset a config, because a refactor accidentally stripped the num_ctx option \u2014 the fixture will fail loudly and you will catch the regression at CI time instead of in front of a confused user.
The takeaway
Ollama is a delightful tool for running models locally. Its default context ceiling, however, is sized for casual chat, not for the kinds of long, tool-heavy reasoning loops that AI agents demand. The dangerous part is not the default value itself but the silence around it: the runtime trims your prompt without telling you, your HTTP calls return 200, and your agent degrades in ways that look like model regression instead of configuration drift. Set num_ctx explicitly, log prompt_eval_count, build memory-spanning fixtures into your test suite, and design your agent's working memory around a token budget you actually own. The local-inference stack rewards this kind of discipline. It also punishes its absence \u2014 quietly, persistently, and at exactly the moment your users start to trust the agent.