Last week I watched a coding agent lose its mind at the 35-minute mark. It had 120K tokens of accumulated input — well within the model's advertised 200K window — and it started confidently refactoring files it had already fixed, contradicting instructions it had acknowledged two minutes earlier. The window wasn't full. The model's brain was.
Researchers at Chroma tested 18 frontier models and found every single one degrades as input grows. Not near the limit — continuously, from the very first kilotoken. They call it context rot, and once you understand the mechanics, you stop trusting long prompts.
The Numbers Are Brutal
The headline finding comes from multi-document retrieval tasks. Hand a model 20 documents and ask it to find information in one of them. Where you place the target document matters more than what's in it.
| Document position | Retrieval accuracy |
|---|---|
| 1 (beginning) | ~75% |
| 5–15 (middle) | 45–55% |
| 20 (end) | ~72% |
That's a 20-plus percentage point gap driven entirely by position, not relevance.
Scale it up and things get uglier. The NoLiMa benchmark tested 13 frontier models at 32K tokens — not even a quarter of most advertised limits. Eleven of them dropped below 50% of their baseline accuracy. GPT-4o went from 99.3% to 69.7%. These aren't research toys; they're the models people ship production systems on.
The practical rule: effective reasoning capacity sits at roughly 60–70% of the number on the spec sheet. Your 200K model is a 130K model. Your 128K model is an 85K model. Every token beyond that threshold is paying rent but not contributing.
Three Mechanisms Compounding the Decay
Lost-in-the-middle gets all the press, but two other forces make it worse.
Attention dilution is pure arithmetic. At 10K tokens, self-attention manages 100 million pairwise relationships. At 100K tokens, that's 10 billion. Each token's share of the model's focus shrinks with scale. The model doesn't just ignore the middle — it gradually loses grip on everything.
Distractor interference is the sneaky one. Semantically similar but irrelevant content actively misleads retrieval. When a coding agent accumulates 80K tokens of file exploration — grepping dependencies, reading test fixtures, scanning dead ends — those plausible-but-wrong code snippets compete with the actual answer for attention weight. This isn't passive noise. It's adversarial noise your own agent generated.
Chain-of-Thought Backfires Here
This one stung. In short inputs, CoT prompting reliably boosts reasoning. In long inputs, it can degrade performance. The reasoning chain generates additional tokens, which means more pairwise attention relationships, which means faster dilution. You asked the model to think harder, and the thinking itself crowded out the signal it needed.
The instinct when an agent gets confused at 80K tokens is to add structured reasoning — "let me think through this step by step." That instinct is wrong. You're adding fuel to the problem that's already burning.
Fixes That Survive Production
None of the solutions involve expanding the window. They all involve shrinking what goes into it.
Subagent isolation delivers the biggest gains. Instead of one agent accumulating 150K tokens, spawn child agents with fresh, bounded windows. Each subagent handles a focused subtask — scan a directory, run a test suite, analyze a single module — then returns a compressed 2K-token summary to the parent. Morph reports a 90% improvement over single-agent approaches with this architecture. The parent model never sees the noise; it only sees conclusions.
Observation masking tackles the token bloat that accumulates during multi-turn tool use. Your agent ran git log twenty turns ago and that 3K-token output is still sitting in the conversation. Replace it with a pointer: [git-log: 47 commits, latest abc1234]. Studies show this achieves over 50% cost reduction while matching problem-solving performance. The information isn't gone — the agent can re-fetch if needed. It's just not polluting the attention budget.
Memory pointers push the same principle further. Instead of embedding raw data in the prompt, store it externally and reference it by ID. "See file_analysis_7 for the dependency graph." One evaluation measured 84% token reduction in web search tasks with negligible accuracy loss.
Position-aware structuring is the simplest lever. Put critical instructions at the very beginning and very end of your prompt. Never bury constraints in the middle of a long system prompt. The U-curve is a fact of the architecture — design around it instead of pretending it isn't there.
The Arms Race Nobody Won
The industry spent 2025 racing to expand windows. Gemini hit 1M tokens. Claude and Llama followed. But if meaningful degradation kicks in at 32K and effective capacity tops out at 60–70% of the advertised number, most of that headroom is decorative.
The people shipping reliable agents aren't the ones with the biggest input buffers. They're the ones who got ruthless about what goes in.