What we were sold
We have spent years chasing a myth.
The promise was simple: give the model more tokens, and it will understand more. We built context windows that stretch into the hundreds of thousands, then millions. We treated the input buffer like a storage tank, assuming volume alone would solve the problem of memory and reasoning.
Then came the panic over context rot. The reports were dire. Performance would hold steady, then collapse. The cliff was real. The panic was misplaced.
Today, the landscape has shifted. The crisis is no longer universal. It has relocated.
Where it went
The crisis didn't end, it relocated.
If you run a dense, multi-step workflow on a modern frontier system, you'll notice something unexpected. The model doesn't simply stretch further. It reads differently. It learns to skim, flag, and return to key passages without losing its thread. The early "lost in the middle" failure that plagued the first generation of long-context models has largely been overcome. The degradation curve that once dropped like a stone now flattens out well beyond the practical limits of most applications.
But take that exact same prompt and run it on a local model or a smaller open-weight system, and the rot hits fast. Much faster. Well before the advertised window fills, these systems begin to lose their grip. The cliff hasn't moved. It's just been buried under the floorboards of larger architectures. The real limit isn't the number of tokens we can load. It's how we manage the complexity of those tokens before the signal drowns in noise.
Frontier models
Frontier systems have largely conquered the early collapse.
The improvement isn't magic. It's the result of targeted training and architectural tweaks that force the model to develop internal routing mechanisms for long sequences. When fed a dense report or a multi-turn conversation, the system doesn't treat every paragraph equally. It learns to allocate attention dynamically, weighting recent outputs and initial instructions heavily while selectively returning to critical mid-sequence data.
The U-shaped recall curve that once defined long-context failure has been flattened. This isn't a guarantee. It's a threshold. Frontier models have moved the cliff far enough out that most practical applications never see it. But they aren't immune forever. They're simply better at buying time.
Local and small models
Local and smaller models hit the wall early.
The rot hasn't disappeared. It's just relocated to the systems that run on consumer hardware, private servers, and edge devices. These models optimize their internal representations for short to medium contexts during training, so they fail to maintain discriminative capacity as sequences grow.
It looks fine at first. The model responds coherently. It follows instructions. It generates plausible text. Then, as the sequence approaches a critical threshold, the degradation strikes abruptly. Recall drops. Reasoning fractures. The model stops tracking constraints it was just following. This happens at a much smaller fraction of the advertised window than on frontier systems. A twenty-thousand-token window on a smaller model might effectively behave like a four-thousand-token window in practice.
The advertised capacity is a lie. The operational reality is what matters.
Why it happens
Attention dilution is the engine of the collapse.
The math is unforgiving. Transformer attention relies on a softmax function that enforces a strict zero-sum constraint: the total attention mass across all tokens must always equal exactly one. As the sequence length increases, that mass spreads thin. A relevant token that once commanded eighty-eight percent of the model's focus can drop to a non-dominant twelve percent. The signal doesn't vanish. It gets buried under uniform noise.
This isn't a bug in the training data. It's a feature of the normalization function. Early tokens, or attention sinks, attract disproportionately high scores regardless of semantic relevance. They act as fixed dumping grounds that continuously siphon attention away from later, contextually critical information. The model isn't forgetting. It's diluting.
Compounding factors
Positional bias and memory eviction compound the damage.
Models naturally weight recent and initial tokens more heavily, creating a U-shaped performance curve across the input sequence. Information in the middle suffers a steep reduction in accuracy. In conversational agents, this is particularly destructive. System instructions occupy the initial positions. The most recent tool outputs or conversation turns sit at the terminal positions. The intermediate layers, which typically hold historical reasoning, retrieved knowledge chunks, and previous action outputs, fall into a vulnerable zone where attention weights are systematically suppressed.
Add finite memory constraints, and the key-value cache forces a hard eviction strategy once the window fills. The memory footprint scales quadratically with sequence length. New information actively pushes older history out of the fixed-size queue. Critical early instructions or domain definitions get overwritten. Agents drop specific guidelines. They replace precise definitions with vague approximations. This is silent drift. It looks like competence until it isn't.
There's a big difference between a model's full context capacity and what it can actually use well. Overload an LLM the way too much information overwhelms you, and the quality drops, often before you notice.
— Mandelson Fleurival
What actually works
Treat context like a workspace, not storage.
The mitigation strategies aren't theoretical. They're operational. We need to stop treating context like storage and start treating it like a workspace, and three moves carry the load.
- Short isolated sessionsKeep it lean
- Keep the active context lean. You don't need to feed the model everything at once. You break the workflow into discrete, focused passes.
- Compaction and summarizationPrune
- Prune dead weight before it dilutes the signal. Compress historical turns, discard low-signal retrievals, and keep only the structural backbone of the conversation.
- A lean orchestrator with sub-agentsMost important
- Use a lean orchestrating model that delegates context-heavy work to sub-agents. The orchestrator stays focused on strategy, decision-making, and constraint tracking. The heavy lifting gets pushed down to specialized workers that handle their own narrow contexts.
The orchestrator doesn't hold the entire conversation. It holds the map. The sub-agents hold the terrain. This isn't a workaround. It's the only reliable path forward.
Honest limits
No amount of prompting fixes the architecture.
No amount of clever prompting or window stretching will fix the underlying architecture. Scaling parameters and expanding context limits will keep yielding diminishing returns. Diverse architectural families converge on the same structural inductive biases. They all lack the native mechanism for reliable document-scale reasoning.
Even frontier models will eventually hit a wall. The rot isn't a bug in the current design. It's a feature of how sequence models aggregate distant dependencies. Audio models exhibit the same collapse. State-space models converge on the same scaling curves. The limitation is fundamental. It's baked into the math. We can mitigate it. We can delay it. We can't eliminate it without changing how the model processes sequence data.
Takeaway
Build better workflows, not bigger buckets.
The future of long-context AI isn't about building bigger buckets. It's about building better workflows. We need to accept that context is a finite resource, not an infinite one, and manage it like one. Keep the active window tight. Delegate the heavy lifting. Let the models do what they do best.
The rot will stop mattering when we stop feeding it.
Share this post
Keep reading
More worth your time
Blog post · AI, Building, Process
Catching the 7,000-character write
When your local model's tool call drops a required parameter and a long file almost gets thrown away.
Read post →
Blog post · AI, Building, Process
The bottom-up edit rule
When a model queues five edits against one file, working top-down is a bug. Here's the order that fixed it.
Read post →Stay in the loop
New posts, same voice.
Get a short email when I publish something new. No weekly digests, no link dumps — just the essays.