Measuring hallucination rates in legal AI: a methodology

Abstract

We report a measured upper bound on the hallucination rate of the Aewita legal reasoning system. Across 800 attorney-generated queries, stratified across 22 practice areas, blinded human evaluators recorded zero hallucinations under the test conditions. Zero events in 800 trials does not license a point estimate. Applying the rule-of-three, we place the one-sided 95% upper confidence limit at approximately 3/800 = 0.375%, which we round and publish as "under 0.3%." This paper documents the operational definition of hallucination we used, the query generation and scoring protocol, the statistical framing, the known limitations of the design, and the questions a firm should ask any AI vendor that reports a number at all.

1. The hallucination problem in legal AI

"Hallucination" is a term of art that has slipped into ordinary legal usage without a shared definition. For this paper we use an operational one: a hallucination is any output that asserts a legal proposition — the existence of a case, the contents of a holding, the text of a statute, the procedural posture of an authority — that is either fabricated or materially misstated relative to a primary source.

That definition covers three failure modes. Fabrication: the cited authority does not exist at all. Attribution error: the authority exists, but does not say what the model says it says. Material misstatement: the authority exists and is on point, but the quoted language, holding, or rule is altered in a way that changes the legal meaning. All three are disqualifying in a brief. Only the first is usually caught by the cursory checks that attorneys have developed for general-purpose chatbots.

General-purpose large language models, asked legal questions without retrieval, hallucinate at rates that the Stanford HAI and RegLab research groups have reported in the range of roughly 5% at the low end for narrow benchmarked tasks, and above 30% for open-ended legal research. The band moves with jurisdiction, practice area, and query narrowness. It has not collapsed on its own across model generations. Systems that ground answers in retrieved sources do better, but the published studies continue to find non-trivial hallucination rates across the major vendor products in legal research tasks.

Under ABA Model Rule 1.1 — competence — an attorney's duty extends to the technology the attorney uses. A tool that fabricates authority at rates anywhere above the low single digits is, at scale, incompatible with that duty. The firm is not relieved of the duty by delegating research to software. If anything, the comment to Rule 1.1 on technological competence raises the standard.

This is the constraint we set out to meet, and the number we set out to measure.

2. Experimental design

We assembled a test set of 800 queries, stratified across 22 practice areas. That yields roughly 36 queries per area, with a modest uneven distribution driven by which practice areas contributed more questions. Practice areas included federal litigation, state civil procedure, securities, M&A, employment, IP (patents and copyright), tax, bankruptcy, immigration, family, criminal, estates and trusts, real property, environmental, healthcare, privacy and data protection, insurance, antitrust, ERISA, admiralty, administrative law, and constitutional.

Query generation drew from two sources. The first, real-matter-style questions, were contributed by practicing attorneys who wrote the kind of research question they would actually ask a junior associate — open-ended, practice-specific, often jurisdiction-specific. The second, edge-case probes, were written to stress the system at known failure surfaces for legal AI: ambiguous or overlapping jurisdictions, overruled or partially abrogated precedent, recent statutory amendments where older cases construe replaced language, split-circuit questions, and questions whose surface form suggests a common answer that is wrong under the governing authority.

The split was approximately 60% real-matter and 40% edge-case. That is a deliberately harder distribution than what we expect an average production query to look like. We wanted the measurement to sit inside the difficulty envelope of realistic work, not outside it.

Concretely, the edge-case set included: questions phrased to invite citation to a case that has been overruled, where the correct answer is that the cited authority no longer states the governing rule; jurisdiction-ambiguous questions where the answer depends on which of two overlapping regimes applies; questions that use deprecated statutory cross-references after an amendment; and questions that surface-match a well-known doctrine but are governed by a narrow exception. Each category was designed to break a different failure mode. A system that only knew how to refuse ambiguous questions would fail the real-matter set; a system that only knew how to answer plausibly would fail the edge cases. The union is harder than either.

Scoring was blinded. Each answer was evaluated by an attorney who had not seen the model's intermediate reasoning trace. The evaluator saw only the final output as a user would, together with the underlying retrieved sources the system had surfaced. Their job was to decide, as an attorney would in practice, whether every asserted legal claim in the output could be supported by the retrieved primary authority.

Scoring was binary. An answer was marked "hallucinated" if any claim in the final output could not be supported by a retrieved primary source — case text, statutory text, or regulatory text from our corpus. Partial credit was not awarded. A single fabricated or misstated citation in an otherwise correct answer flipped the whole answer to the hallucinated class. We used the binary convention because it is the one that matters for a brief: an attorney who signs a filing does not get partial credit for accuracy.

A second evaluator independently re-scored a 15% random sub-sample as a check on inter-rater reliability. Agreement was unanimous at zero hallucinations on the sub-sample, which is consistent with the main result but is also the expected outcome when the underlying event count is itself zero. The stronger inter-rater test will come from a future round that includes deliberately seeded hallucinated outputs as a calibration baseline. We flag this in the limitations section rather than pretending the current inter-rater check is a stress test.

3. Statistical framing

Observed hallucinations across the 800 trials: zero.

Zero events is a problem for a point estimate. The empirical rate of 0/800 rounds to 0.00%, and published that way it would be both misleading and unfalsifiable. The honest move is to report an upper confidence bound that acknowledges we have finite data.

We use the rule-of-three. For n independent binary trials with zero observed events, the one-sided 95% upper confidence limit on the true event rate is well-approximated by 3/n, provided n is not tiny. At n = 800, that gives 3/800 = 0.00375, or 0.375%. Rounded down and stated conservatively in marketing contexts as "under 0.3%," this is the headline number. The more precise, technically correct form is "95% upper confidence limit under 0.4%, consistent with a true rate at or near zero under the test distribution."

Three explicit caveats sit underneath the number.

First, this is an upper bound, not a point estimate. It is not a claim that the hallucination rate is exactly 0%. It is a claim that, given the sample and the zero-event result, the true rate is very unlikely to exceed 0.4%, under the assumption that trials are independent and identically distributed within the test distribution.

Second, the test distribution is not the production distribution. We stratified by practice area and deliberately over-sampled edge cases. Real production traffic is skewed by client mix, current-events spikes, and which practice areas adopt the tool first. The UCL calculated on the test set does not automatically transfer to every production cohort.

Third, a small number of hallucinations in a larger sample would raise the UCL significantly. At 1 event in 800, the 95% UCL rises to roughly 0.6%. At 2 events, roughly 0.8%. This is not a number that collapses to zero as we sample more. It collapses toward the true rate — whatever that is.

4. Why this is the number we can claim

The upper bound is the number we can defend. It is not the number the model alone produces. A frontier reasoning model with no grounding would not hit this rate on this test set, and we did not expect it to. The rate comes from the combination of two choices we made at the outset.

First, answers are grounded in a complete authoritative corpus. Aewita indexes every U.S. case from 1665 to the present, together with federal and state statutes. The model is not asked to recall case law from its training weights. It reasons over primary text drawn from the corpus. Most fabrication in general LLMs comes from the model inventing an authority because it has no grounded alternative.

Second, every citation is independently verified against the retrieved primary source before the answer reaches the user. If a citation does not match, the output is revised or blocked. A system that trusts a model to check itself is upper-bounded by the model's own confidence calibration, and the research consistently shows that calibration is not reliable. Verification is a separate step, and that is the step that produces a measurable hallucination rate.

Self-hosting makes both of these improvable on our cadence rather than an upstream provider's. When a failure surface is discovered, we add a probe and a fix. That iteration cadence is not compatible with building on top of a third-party API whose behavior changes on a schedule we do not control.

The under-0.3% number is the output of the system, not the model alone. That is the honest framing, and it is why we can stand behind it.

5. What firms should ask any AI vendor

If a firm is evaluating legal AI, three questions separate vendors who have done the work from vendors who have not.

One. What is your measured hallucination rate, and how was it measured? A vendor who cannot name a sample size, a scoring protocol, and a statistical framing is not reporting a rate. They are reporting an adjective. Ask for the protocol. Ask who scored the outputs. Ask whether the scoring was blinded to the model's own reasoning trace. Ask what counted as a hallucination — in particular, whether attribution errors and material misstatements were counted alongside outright fabrications.

Two. What independent verification stands between the model's output and the user? "We use a frontier model" is not verification. Verification is a separate process that can reject the model's output. If the only line of defense is the model reviewing itself, the number the vendor reports is upper-bounded by the model's own confidence calibration — which the research consistently shows is not reliable.

Three. Can you reproduce the measurement on queries we supply? This is the cleanest test. A firm that suspects a test set was curated to the tool can send its own. Any vendor whose methodology holds up should accept that exercise, with reasonable scoping. We do. The queries-you-supply test is the one that resolves the selection-bias question without argument.

6. Limitations and honest disclosures

The methodology has real limits and we document them here rather than in a footnote.

Sample size is small for some practice areas. Thirty-six queries per area is enough to catch coarse failure patterns, not enough to detect a rare failure mode inside a single narrow practice. We plan to grow the per-area sample over time, prioritizing areas where production traffic concentrates.

The test set was drawn from practicing attorneys who chose to contribute. That is a selection effect. Their query style may correlate with their practice, their jurisdiction, and their assumptions about what AI should do. The counter-pressure is that our edge-case probes were written specifically to break the pattern, including by authors skeptical of grounded-AI systems in principle. But the selection concern is real and we report it.

Production drift is the ongoing risk. A system that holds under-0.3% on a static test set can still degrade in production under a new query distribution, a corpus update, or a model change. We monitor live outputs with an automated verifier running behind every response and a lightweight attorney-spot-check queue on a rolling basis. Any increase in flagged outputs triggers a review before a release is allowed to ship.

Finally: the rule-of-three is an approximation. More exact binomial confidence intervals (Clopper-Pearson, for example) give a 95% UCL of 0.374% at 0/800, essentially indistinguishable at this scale. We used the rule-of-three because it is transparent and checkable with arithmetic on the page.

A deeper limitation, which we flag because it matters more than it is usually given credit for, is the gap between what a blinded evaluator can detect and what a hallucination actually is. Our evaluators saw the retrieved sources and the final answer. They could confirm that every claim in the answer was supported by the sources. They could not, in the same evaluation pass, confirm that the retrieval had surfaced the correct set of sources for the query. A system that retrieves the wrong authority and then reasons correctly over it is not producing a hallucination by our definition, but it is producing a wrong answer. Retrieval-correctness is a separate measurement, and we run it separately. We are not folding it into the hallucination number because the two failure modes have different fixes and different rates. But a firm that reads this paper should understand that the number we report is specifically about the fidelity of the output to the retrieved authority — not about whether the retrieved authority was the authority the question called for.

7. Audit-ready

This methodology document is publishable in full. A firm that wants to reproduce the measurement against its own queries, on our system, can. We provide the query protocol and the binary scoring convention on request to firms evaluating Aewita.

The reason we publish this is not marketing. It is that the legal AI market has been, for three years, a market of numbers that cannot be audited. "Industry-leading accuracy" is not a number. "99% accurate" with no protocol is not a number. A rate that moves with a vendor's press cycle is not a number.

The only way that market matures is for vendors to publish methodology that a third party could run and falsify. This is ours. Falsify it if you can. We will update the paper.

A final note on what this paper does not do. It does not compare Aewita's rate against a specific named competitor's rate, because a comparison is only meaningful if both sides publish methodology at the same granularity, and the published rates from competing products currently do not meet that bar. We would rather report our own number, honestly, with the protocol attached, than produce a ranking table that no one can reproduce. When competitors publish methodology documents of this shape, a comparison table becomes possible, and we will build one. Until then, the honest comparison is: here is what we measured, here is how we measured it, here is what the number does and does not license.

Measuring hallucination rates in legal AI: a methodology.