Writing · On hallucination

The citation that looked perfect and did not exist

June 2026Kingsfield

Generators fabricate citations at a measurable rate. Asking the same model to verify its own output does not find them.

In 2023, lawyers filed a brief in federal court that cited Varghese v. China Southern Airlines. The citation had a reporter number, a holding, and, when the judge asked for it, a quoted excerpt. None of it was real. A generative model had invented the case, the citation, and the quote, and had produced the fake excerpt on request. The court sanctioned the lawyers. The matter, Mata v. Avianca, became the reference point for what these tools do when no one checks them.

The list has grown since. Park v. Kim. Garner v. Kadince. Al-Hamim v. Star Hearthstone. Different courts, the same failure: a filing went out resting on authority that did not support the point it was cited for, or did not exist at all.

This is not a rare event you can avoid with care. Measured against legal-citation benchmarks, frontier models produce a fabricated or unsupported citation in a large share of legal answers, in some studies the majority of them. On a per-citation basis, a Stanford RegLab analysis in 2024 found that roughly one in five citations from a leading model did not hold up. A lawyer who runs ten cites through a generator and trusts the output is, on those numbers, shipping two bad ones.

Why a model cannot check its own work

The common response is to ask the model to check itself, or to add a second model that reviews the first. Both run into the same wall. A model that just produced a fabrication is drawing from the same training distribution on the second pass. The reviewing model learned from the same data in the same way, so it carries the same blind spots. What looks like a second opinion is usually the first opinion, restated with confidence.

A reviewer that learned from the same data as the generator shares its blind spots. That is fluent agreement, not scrutiny.

A narrower technique is to look up whether the cited case exists, or whether the regulation number resolves. That catches the pure inventions. It misses the harder errors: a real case cited for a proposition it never stated, or the correct case with the wrong pinpoint. Those read as correct, because the citation itself is real. The only way to catch them is to read the authority.

What actually catches it

The reliable check is the unglamorous one. Resolve the citation to the authority it points to, read what that authority says, and decide whether it supports the claim. Accept, reject, or inconclusive. One citation at a time. That is a different job from writing the brief, and it has to be performed by something that did not write the brief and does not share its incentives to agree.

The court in Mata did not care which model produced the citation. The signature on the brief belonged to the lawyer. Verification is the step that decides whether that signature is safe to give.

Kingsfield is a judge for legal AI: it rules ACCEPT, REJECT, or INCONCLUSIVE on every citation an AI tool produces, on input from which client identifiers have already been stripped on your own machine. See how it works.

The third party in your privileged work Why more context makes a model less reliable