Writing · On context rot

Why more context makes a model less reliable

June 2026Kingsfield

The industry sells bigger context windows as the fix for accuracy. A 2024 study measured the opposite effect.

Frontier models now ship context windows of a million tokens or more. The advice that comes with them is consistent: give the model everything. The whole matter file. The full set of regulations. The stack of fifty contracts. Then ask. More context, the pitch goes, means a better answer.

In 2024, Chroma published a technical report titled Context Rot: How Increasing Input Tokens Impacts LLM Performance. It tested models across a range of context lengths and found that accuracy degrades as the window fills, even on simple tasks, and that the degradation worsens under the multi-step reasoning real work demands. Filling the window does not make the model sharper. The measured effect runs the other way.

Why the cost stays hidden

The loss is easy to miss, because the model keeps writing well. Fluent prose comes out of a full window the same as an empty one. The accuracy that slips does not show up in the grammar.

The accuracy loss does not show up in the grammar. It shows up in the citations.

It shows up in the specific claims and the citations, the parts a reader is least likely to check line by line. A brief that reads smoothly and cites an authority for something it does not hold is the exact output this curve produces.

Three patterns sitting on the curve

Dump-the-corpus retrieval. Paste a large body of authority into the window and ask the model to find the answer. The bigger the paste, the more the model's discrimination on any single source slips.

The multi-document workflow. Combine fifty contracts or two hundred brief sections in one context and ask for a comparison. The same failure, at the working level.

The bigger-window roadmap. The promise that the next, larger window will fix it. A larger window runs against the same curve. On the simple-discrimination tasks that decide accuracy, more room makes the problem worse, not better.

What the lawyer already does

An experienced lawyer holds an enormous amount of context: the doctrine, the facts, the judge's habits, the posture of the matter. They still work one issue at a time. The broad context is judgment, and it lives with the lawyer. The individual ruling is decided on a small, clear input. A verification step should work the same way: rule on one citation against the authority it points to, on the smallest evidence that settles the question, rather than by scanning everything at once.

Bigger windows are sold as the road to reliability. On the tasks where reliability is actually measured, they move in the wrong direction.

Kingsfield is a judge for legal AI: it rules ACCEPT, REJECT, or INCONCLUSIVE on every citation an AI tool produces, on input from which client identifiers have already been stripped on your own machine. See how it works.

The citation that looked perfect and did not exist The third party in your privileged work