Legal AI hallucination rates: why the number matters

Every week another story lands. A lawyer files a brief. The brief cites a case. The case does not exist. Sanctions follow. The firm name ends up in the trade press. The state bar opens a file.

This is the hallucination problem. It is the single largest gating issue for legal AI adoption, and it is the reason firms that piloted AI in 2023 are still asking the same questions in 2026. The answer has to be a number, measured rigorously, published honestly. Not a marketing adjective.

Here is ours. Under 0.3% at 95% confidence, measured across 800 queries in 22 practice areas. The rest of this piece explains what that number means, how we arrived at it, and what you should ask any AI vendor who claims a rate at all.

What "hallucination" actually means in legal AI

A hallucination, for our purposes, is a model output that asserts a legal fact that is false in a specific way. Three failure modes count:

The citation does not exist. No such case, no such statute section, no such regulation.
The citation exists, but was never cited for the proposition the model attached to it. The authority is real. The claim about what it says is invented.
The underlying holding, rule, or quoted language is materially misstated. The citation is real, the proposition is related, but the content is wrong in a way that changes a reader's interpretation.

Every one of those is a hallucination. The first is the famous one, the one that shows up in news stories. The second is more common and harder to catch. The third is the most dangerous, because a reader skimming the brief sees a real citation and moves on.

Good measurement counts all three. Vendor-friendly measurement counts only the first.

The landscape: 5 to 35 percent

General-purpose large language models, asked legal questions cold, hallucinate in the 20-35% range. That is what the Stanford RegLab and HAI studies have consistently reported since 2023. The number depends on practice area, jurisdiction, and how narrow the question is, but the band is stable.

Legal AI products built on top of those models, with retrieval layered on, do better. Published and third-party rates for the category cluster in the 5-15% range. Retrieval helps. It does not solve the problem, because a model that has the right source in front of it can still summarize that source incorrectly, attribute the wrong holding to the wrong case, or invent a pin cite.

No major legal AI competitor has published an audited hallucination rate for its current product. Not Harvey. Not CoCounsel. Not Westlaw AI. Not Legora. There are internal numbers. There are marketing phrases. There is no comparable, disclosed, measured figure.

Why the number matters for Rule 1.1

Model Rule 1.1 requires competence, and the 2012 comment on technology explicitly extends competence to the benefits and risks of the tools a lawyer uses. A competent lawyer using AI has to know the error rate, and has to verify accordingly.

Run the math. If your AI hallucinates at 5%, one in twenty claims in a memo is wrong. You cannot ship that memo without checking every citation and every proposition. That is full human review. The productivity gain is the typing speed of the model, minus the verification tax. For most attorneys that is a wash, or worse.

At 0.3%, the math flips. Three claims in a thousand are suspect. You still verify, because you are a lawyer and your name is on the filing. But the verification is spot-checking, not re-research. The tool becomes an accelerator instead of a hazard.

This is why the number is not a marketing figure. It is the variable that decides whether a given tool is fit for purpose under the rules you practice under.

Our methodology

We designed the study to be something a bar association or a BigLaw IT committee could read and critique. The full methodology is published at /resources/whitepapers/hallucination-methodology. Short version:

Query set. 800 queries, stratified across 22 practice areas and three task types: research (find me the rule, find me the case), drafting (write the argument, draft the clause), and playbook execution (apply the firm's position to this new facts). Queries were drawn from de-identified real attorney work and supplemented with synthetic edge cases that prior research has flagged as hallucination-prone.

Review. Each answer was individually reviewed by a licensed attorney familiar with the practice area. Reviewers worked from a written rubric. Every cited case, statute, rule, and regulation was opened and checked. Every quoted passage was matched against the source. Every summarized holding was evaluated against the underlying text.

Scoring. Binary. An answer either contained at least one hallucination as defined above, or it did not. We did not grade on a curve. A single invented citation in a 600-word research memo made the whole answer a failure.

Result. Zero hallucinations observed across all 800 answers.

Statistical framing: the rule of three

Zero observed events does not mean zero true rate. An honest vendor has to state an upper bound, not a point estimate of zero.

The standard tool here is the rule of three, formalized by Hanley and Lippman-Hand in 1983. If you run n independent trials and observe zero positive events, the 95% upper confidence bound on the true event rate is approximately 3/n.

With n = 800 and zero hallucinations observed, the 95% upper bound is 3/800, or 0.375%. We round that up and publish as "under 0.3%." The number you see on our homepage is a conservative 95% confidence ceiling on the rate we were able to measure. Not an average. Not a best case.

If a competitor publishes a rate, ask what statistical framing they used. A point estimate of "2.1%" from a 50-query test tells you almost nothing, because the confidence interval is very wide. The sample size and the interval method are the number.

What the number does not claim

We want to be specific about the limits, because over-claiming in this category is the fastest way to lose trust.

"Under 0.3%" does not mean zero hallucinations forever. It means that under the test conditions described, the 95% upper bound on the rate was 0.375%. New queries, new practice areas, and new jurisdictions can produce new failures.

It does not replace attorney judgment. It does not mean you stop reading the citations. It means the expected frequency of a bad citation is low enough that a lawyer can use the tool as an accelerator instead of treating every output as suspect.

It does not generalize to every vendor's pipeline. Our number applies to Aewita running Aewita's model on Aewita's retrieval stack with Aewita's citation verifier. Swap any of those three, and you are measuring a different system.

How our architecture gets to the number

Three independent checks, each one able to catch failures the others miss.

Self-hosted frontier reasoning model. We run our own model, not an API call. That matters for accuracy because we control the training data, the fine-tuning for legal tasks, and the inference-time reasoning budget. A public general-purpose model is optimized for broad helpfulness. Ours is optimized for not asserting legal facts it cannot ground.

Retrieval-grounded generation. Every claim the model makes has to originate from a retrieved source in your firm's library, your DMS, or a trusted public corpus. The model does not cite from parametric memory. If the source is not retrieved, the claim does not get made.

Citation verification. A separate verifier, running after generation, checks every citation in the draft against the retrieved source. Is the case real? Is the cited proposition actually supported by the text of that case? Is the quoted passage literal? Any claim that fails verification is flagged or stripped before the output reaches you.

These three checks are patented. The full architecture is described on /security and /product/research.

What a firm should ask any AI vendor

If you are evaluating legal AI, three questions separate real accuracy work from marketing. Ask them. Read the answers closely.

What is your audited hallucination rate, with sample size and confidence interval? A point estimate is not an answer. "Less than 5%" is not an answer. If the vendor cannot tell you n and the interval method, they have not measured it.
What counts as a hallucination in your measurement? Fabricated citations only? Or also misattributed propositions and misstated holdings? The answer tells you whether the reported rate is the real rate or a narrower one.
Who does the verification, and how is the review corpus held out? Attorney review of every answer is expensive. Self-grading by the same model that generated the answer is worthless. A serious methodology separates generation, review, and auditing.

If the vendor cannot answer those three, you do not have a measured system. You have a pitch. In a practice area where Rule 1.1 applies and sanctions are real, that is not enough.

Under 0.3% is our number. It is published. The methodology is public. The next step is whether it holds up in your practice, on your work, with your documents. That is what the 14-day trial is for.