Self-hosted legal AI: the architecture argument

Abstract

Nearly every product in the current legal AI market — including the ones backed by billion-dollar valuations and the ones sold inside incumbent research platforms — is a front end on a third-party large language model served by OpenAI, Anthropic, or a similar provider. This paper describes the practical consequences of that architectural choice for ABA Model Rule 1.6 duties of confidentiality. It lays out what "self-hosted" means concretely, what it takes to build a legal AI that runs inference inside a single vendor's controlled infrastructure, and the tradeoffs a firm should weigh when it evaluates vendors. We built Aewita as a self-hosted system because we believe the Rule 1.6 analysis, done honestly, forces that architecture for U.S. attorney workflows at scale.

1. The industry's open secret: everyone is a wrapper

The legal AI market, as of April 2026, is a market of interfaces sitting on top of a very small number of underlying models. Harvey's product runs on OpenAI. Thomson Reuters' CoCounsel ships on a stack that includes OpenAI and Anthropic models under a Westlaw retrieval layer. Westlaw Precision AI and Lexis+ AI both expose third-party LLMs under branded user interfaces, with retrieval from their respective proprietary corpora. Legora is a European-origin product running on third-party frontier models. This is not a secret. It appears in vendor disclosures, procurement paperwork, and SOC 2 subprocessor lists. It is the default.

It is the default because it is a rational business decision. Training or hosting a frontier reasoning model is capital-intensive and operationally heavy. Wrapping an API ships faster, scales easier, and lets a small team compete with a large one on the dimension of product experience. The explosion of legal AI startups between 2023 and 2026 runs on that asymmetry. Without API access, there is no industry.

But a rational business decision at the vendor level produces a structural problem at the industry level: every legal AI product becomes, architecturally, a client of the same small number of non-legal-industry infrastructure providers. The data that flows into those products flows through those providers. And the duties that govern that data are not the vendor's duties. They are the lawyer's.

2. Why wrappers happened

The economic case for a wrapper is clean. An API call to a frontier model costs the vendor some fraction of a cent to a few cents, depending on the model, the prompt length, and the context window used. Running the same inference on self-managed hardware costs substantially more per call in infrastructure amortization, and requires the vendor to take on operational complexity they would otherwise offload — model updates, GPU capacity planning, availability, security at the weights layer. For a startup competing on time-to-market, the calculation is not close.

The hidden cost is in what the vendor can credibly promise about what happens to a client's data after it leaves the browser. Every wrapper ultimately bottoms out in the provider's terms of service: what the provider retains, for how long, for what purposes, under what subpoenas and what jurisdictional regimes. No amount of vendor-level contract language on top can override what the underlying provider actually does with the bytes. The wrapper vendor can promise what they, the wrapper vendor, will do. They cannot promise what the underlying LLM provider will do, because they are not the underlying LLM provider.

This is the structural privilege problem. It is not an indictment of any particular provider's practices. It is a claim about what an attorney can represent to a client about where the bytes end up.

3. ABA Rule 1.6 and "reasonable efforts"

Model Rule 1.6(c) reads: "A lawyer shall make reasonable efforts to prevent the inadvertent or unauthorized disclosure of, or unauthorized access to, information relating to the representation of a client." Comment 18 to the Rule, and the ABA's subsequent formal opinion on electronic communications, make clear that "reasonable efforts" is a contextual, multi-factor analysis. The factors the Comment lists include the sensitivity of the information, the likelihood of disclosure absent additional safeguards, the cost of additional safeguards, the difficulty of implementing them, and the extent to which the safeguards adversely affect the lawyer's ability to represent clients.

The ABA's formal opinion on securing client communications in the electronic era (numbered in the 477 series, issued in 2017 and informally supplemented since) is the starting point for how this analysis gets applied to cloud services. Its core move is to refuse a bright-line rule and instead require a fact-specific analysis of the service, the sensitivity of the representation, and the vendor's posture on disclosure and unauthorized access. That framework is what governs a firm's use of AI vendors today.

Inside that framework, the architectural choice of an AI vendor matters. A legal AI product that transmits client-matter text to a third-party LLM provider's inference infrastructure creates a disclosure surface that did not exist before the firm adopted the tool. Whether that surface is "reasonable" under Rule 1.6 depends on the sensitivity of the matter, the terms under which the provider handles the data, the safeguards the vendor has layered on top, and whether the client has consented with informed understanding of where the bytes actually go. Those are all open questions that the firm, not the vendor, has to answer for each matter.

Self-hosted infrastructure does not eliminate the Rule 1.6 analysis. It simplifies it. If inference happens inside a single vendor's U.S.-controlled infrastructure, with no outbound API call during inference, the disclosure surface the firm has to analyze is the vendor's. One contract, one subprocessor list, one SOC 2 scope, one incident-response plan. That is a reasonable-efforts analysis a firm can actually complete.

4. What "self-hosted" means concretely for Aewita

We use the term precisely. For Aewita, self-hosted means four things.

The model weights run on infrastructure Aewita controls. The weights are deployed to machines inside our production environment. We do not stream prompts to an external inference endpoint owned by a third-party LLM vendor. The inference path is internal end-to-end.

No outbound API calls to third-party LLMs during inference. A user's query is answered by our model, on our machines, against our corpus. It is not proxied, forwarded, or sampled to a third-party model for any part of the answer generation pipeline. This includes auxiliary steps that are often offloaded to external models in wrapper products — query rewriting, re-ranking, summarization. Those happen locally.

Every byte of a client query stays inside Aewita's U.S. infrastructure. Queries are not replicated to non-U.S. regions for load balancing. Logs that contain query content are scoped and retained under the privacy policy that governs the primary service. There is no secondary data pipeline exporting queries to an analytics vendor that runs its own model on them.

Model updates are deployed, not called. When we ship a model update, it is rolled out to our inference fleet under our release process. It is not a version bump on a dependency that updates underneath us on a third party's schedule. This is why the hallucination rate we report is reproducible at a point in time against a known build.

5. What it takes to actually build this

The path to a self-hosted legal AI is not mysterious. It is, however, a path that chooses against the dominant business shortcut. What we committed to, at a high level:

We benchmarked the frontier reasoning models available to us and picked the top performer on legal work. We host that model on infrastructure we operate. We indexed every U.S. case from 1665 to the present and every federal and state statute, so answers draw from primary authority rather than the model's training recollection. And every answer is independently verified against the retrieved primary sources before it reaches the user — a discrepancy means the output is revised or blocked.

None of these choices is novel in the abstract. What is new in this market is a single vendor owning the stack end-to-end and not offloading the most capital-intensive piece — the model — to a third party. We have patents pending on the specifics of how we do this. The fact of self-hosting is the architectural claim. The techniques inside are the competitive moat, and we keep them there.

6. Economic tradeoffs

Self-hosting is more expensive per inference than calling an API. A vendor who built on API calls can push their marginal cost per answer close to zero and pocket the margin. We cannot. Inference on our own hardware costs what GPU time costs, and that is not close to zero at current prices.

We eat the difference on purpose. Aewita is priced at $99 per month per seat, or $720 per year per seat, with no seat minimums. That number does not reflect the cheapest architecture we could have built. It reflects the architecture we think is compatible with Rule 1.6 for U.S. attorneys, run at a price that is actually accessible to solo and small-firm practice — the segment that has historically been priced out of enterprise legal research tools.

The tradeoff is explicit. If an attorney's only concern is per-query marginal cost, a wrapper product can be cheaper. If the concern is the disclosure surface that accompanies cheap inference, the calculus looks different. We think the second concern is the one the ABA comments actually require the attorney to reason about. We built around that requirement, and we priced the product to make the architecture reachable rather than reserved for enterprise-only buyers.

7. What a firm's IT team should audit

A firm that is evaluating any legal AI vendor — us included — should ask a specific set of questions. These are the questions we would ask.

Is inference on infrastructure the vendor controls? Literally: when the user hits submit, does the request leave the vendor's network before the answer comes back? If the answer is "it goes to OpenAI" or "it goes to Anthropic," that is not per se disqualifying, but it is a fact the firm's Rule 1.6 analysis has to accommodate.

Can the vendor produce a SOC 2 Type II, or equivalent, that covers the inference path specifically? Some vendors have a SOC 2 that covers the application layer but not the inference layer, because the inference layer is someone else's infrastructure. Ask explicitly what is in scope.

Who are the subprocessors? Any third party that touches query content — the LLM provider, the embedding provider, the analytics vendor, the logging vendor — is a subprocessor for purposes of the firm's privilege analysis, whether or not the vendor labels them that way on a data-processing addendum.

What is the incident-response posture for a breach at the underlying LLM provider? If an OpenAI or Anthropic-scale incident exposes queries that transited through a vendor's wrapper, what does the vendor do? When do they notify the firm? What is their forensic capacity into the provider's environment? For most wrapper vendors the honest answer is "we would wait for the provider's disclosure." A firm should know that answer in advance.

What is the data retention policy at the LLM provider layer, not just the vendor layer? The vendor's policy can say queries are not retained. The LLM provider's policy may or may not agree. For self-hosted vendors these collapse to one policy. For wrappers they are two.

8. Privilege is an architecture

Privilege is not a promise. It is a property of a system. A system that transmits client-matter text to a third-party inference provider has to justify that transmission under Rule 1.6, every time, for every matter, with a client-specific analysis. A system that does not transmit is not relieved of confidentiality duties, but it sharply reduces the number of parties the analysis has to reach.

We built Aewita as a self-hosted platform because, after working through the Rule 1.6 analysis honestly, the self-hosted architecture was the one compatible with attorneys using AI at the scale their actual practice requires. That is an architectural claim, and architecture is testable. A firm's IT team can audit it. A client's counsel can audit it. We welcome both.

There is a version of the legal AI market that matures into a vendor ecosystem where the architecture question is asked and answered honestly, on every procurement, in every RFP. That is the market we want to compete in. This paper is part of asking for it.

Self-hosted legal AI: the architecture argument.