You've been debugging your prompt for an hour. You've tried different phrasings, added examples, restructured the whole thing. The model still gives garbage. Here's a thought: maybe the prompt was never the problem.

The Chroma Study Nobody Can Ignore

Chroma Research recently tested 18 frontier models — Claude Opus 4, GPT-4.1, Gemini 2.5 Pro, Qwen3-235B, and more — on a simple question: does adding more context make models worse, even when the task stays the same difficulty?

The answer is yes. Universally. Every single model degraded as input length grew. Not because the task got harder. Not because the instructions got confusing. Just... more tokens in, worse output out.

They called it context rot, and it's an architectural property of transformers, not a bug you can patch with better training data.

The study controlled for task difficulty — the needle-in-a-haystack task was identical regardless of how much surrounding text was packed in. The only variable was input length. And performance still dropped. Low semantic similarity between the question and the answer made it worse. Adding distractor passages compounded the decline further.

The Weirdest Finding

Here's the one that broke my mental model: models perform worse when the surrounding context is coherent.

Chroma found that shuffling the haystack — destroying its logical flow — actually improved retrieval performance. When the filler text reads like a well-structured document, the model tries to follow the narrative thread and gets lost. When it's jumbled nonsense, the model apparently has an easier time spotting what doesn't belong.

This has direct implications for RAG pipelines. If you're stuffing retrieved chunks into the context window and carefully ordering them by relevance or chronology, you might be making things worse. The model isn't reading your context like a human reader — it's attending to it through a softmax distribution, and coherent text creates attention sinks that pull focus away from the actual answer.

Multi-Turn Is Where It Really Hurts

Single-turn degradation is one thing. Multi-turn conversations are where context rot becomes catastrophic.

PromptHub's research found that accuracy can drop by 39% after just two back-and-forths. Not twenty. Not fifty. Two.

They tested three recovery strategies:

Strategy How It Works Recovery
CONCAT Collect all info, send as one fresh prompt 95.1% of baseline
RECAP Re-send all prior shards on the final turn 66-77% of baseline
SNOWBALL Prepend prior context at each turn +12-15 pts over naive

CONCAT dominates. The takeaway is blunt: if you care about output quality, stop having long conversations. Collect your information, then make one clean call.

What Actually Works

Forget the theoretical fixes. Here's what I've seen work in production:

Kill multi-turn when quality matters. If you're building an agent that gathers information across steps, don't let the conversation accumulate. Collect context externally, then send a single consolidated prompt for the final generation. The CONCAT pattern isn't elegant, but 95.1% vs 60% isn't a close call.

Front-load and back-load your critical content. The "lost in the middle" effect is well-documented by now — models attend more to the beginning and end of context windows. If you're injecting retrieved documents, put the most relevant ones first and last. Bury the lower-confidence matches in the middle where the model was going to half-ignore them anyway.

Watch your context utilization, not just your token count. The Chroma study tested models well within their stated context windows and still found degradation. A model with a 200K context window doesn't perform equally well at 10K and 100K input tokens. "Fits in the window" is necessary but nowhere near sufficient.

Compress aggressively. Prompt compression methods like LLMLingua-2 can shrink inputs 2-5x with limited quality loss. If your pipeline is stuffing 50K tokens of context when 15K would do, you're paying a quality tax on every call — not just a cost tax.

The Claude-Specific Angle

One detail from Chroma's study caught my eye: Claude models showed the lowest hallucination rates across the board, but Claude Opus 4 also showed the most pronounced performance gap between focused (300 token) and full (113K token) inputs. It's more cautious — it abstains rather than guessing when context gets noisy — but that caution means the gap between "clean context" and "bloated context" is even starker.

If you're building on Claude and wondering why your carefully crafted system prompt seems to lose its grip after a few turns, this is probably why. The model isn't forgetting your instructions. It's drowning in accumulated context noise.

The Uncomfortable Takeaway

We spent 2024 and 2025 celebrating ever-larger context windows. Million-token models! Stuff entire codebases in! The context window race was treated like the parameter count race before it — bigger is better, always.

Context rot says: not really. Bigger windows give you capacity, but every token you add past what's strictly necessary degrades the model's ability to use the tokens that matter. The skill isn't maximizing how much context you can fit. It's minimizing how much context you actually send.

The best prompt engineers I know aren't writing longer prompts. They're writing shorter ones.