CommitLLM: How to Verify an LLM Inference
You send a prompt to an LLM API. The provider says it ran Llama 70B. Maybe it did. Maybe it served a smaller model to save money, changed the quantization, altered the decode settings, or patched the answer after generation. Today you usually cannot tell. You get text back, an invoice, and a promise.
For casual use, a promise is often enough. For enterprise procurement, regulated systems, benchmark evaluation, or agent workflows making consequential decisions, it is not. If the model behind the answer matters, “trust us” is not a satisfying interface.
Today you get two unsatisfying options. Statistical fingerprinting can give you evidence, but not exact per-response verification. Zero-knowledge proofs can give you much stronger guarantees, but the prover cost is still too high for production serving. You either get weak signals or strong proofs you cannot afford.
We built CommitLLM to sit in that gap. The provider keeps the normal GPU serving path. No proving circuit. No per-response proof generation. The model answers normally, returns a compact receipt, and only opens internal trace data if the client asks for an audit.
#How one audited response works
At a high level, the protocol is simple:
- The provider commits to the deployment surface: model weights, quantization, tokenizer, chat template, decode policy, and post-processing.
- The client sends a prompt and gets back both the model output and a receipt binding that response to the committed deployment.
- Most of the time, that is enough. The normal path stays fast.
- If the response matters, the client challenges specific tokens and internal states.
- The provider opens the requested trace data, and the verifier checks it on CPU against the committed model and configuration.
The point is not to prove every inference up front. The point is to make cheating risky and cheap to detect without forcing the provider to run a cryptographic proving farm.
#Why this is cheap enough to matter
The practical insight is that transformers spend most of their time doing matrix multiplication. If you can check those multiplies cheaply, the rest becomes manageable.
The trick that makes this work is old. Freivalds published it in 1977. It gives you a way to test whether a matrix multiplication was done correctly without fully recomputing it.
Suppose the provider claims it computed z = W @ x for some public weight matrix W. Recomputing W @ x directly is expensive. But if the verifier has a secret random vector r and has precomputed v = r^T @ W, then checking whether v . x == r^T . z costs only a dot product. If the provider used the wrong weights or produced the wrong output, the check fails with overwhelming probability.
That covers the expensive shell of the transformer: W_q, W_k, W_v, W_o, W_gate, W_up, W_down, and LM_head. The remaining operations, RMSNorm, RoPE, SiLU, and the quantization bridges, are small enough to replay exactly.
#What the receipt binds
The receipt does not just bind “some model ran.” It binds the full surface that changes what comes out:
- Model identity: a Merkle root over the checkpoint
- Quantization scheme and configuration
- Tokenizer, chat template, preprocessing
- Decode policy: temperature, top-k, top-p, penalties, stop conditions
- Output post-processing
Change any of these and the receipt changes. The provider commits before learning which tokens and layers the verifier will challenge.
#Where the guarantees are exact and where they are not
We are honest about what the protocol can and cannot do.
Exact. Shell matmuls (Freivalds), quantization bridges, embedding lookup, the final-token tail from a captured boundary state, LM-head binding, logits, decode replay, output-policy replay. Algebraic verification or canonical recomputation. If it is wrong, the check fails.
Approximate. The attention interior. Native FP16/BF16 attention is not bit-reproducible across GPUs. We constrain attention from both sides (shell-verified Q/K/V going in, committed post-attention output coming out) but we do not pretend it is exact.
Statistical. Prefix/KV provenance in routine audit mode. The commitment binding is exact, but unopened positions are covered by challenge sampling. Deep audit upgrades this to exact.
Fail-closed. Anything the verifier does not know how to replay is rejected. No silent best-effort fallbacks.
#Numbers
The prototype adds roughly 12-14% overhead during generation. That is the first important number because it means the normal serving path still looks like normal serving. You are not replacing inference with a proof system. You are adding auditability to it.
For Llama 70B, routine audit costs about 1.3 ms per challenged token, while a full single-token audit costs about 10 ms. Verification runs on CPU. No client-side GPU is required. That is the second important number: the verifier can be lightweight even when the model is not.
On the corrected replay path for Qwen2.5-7B-W8A8 and Llama-3.1-8B-W8A8, the attention mismatch beyond 1k tokens is already narrow: worst-case L_inf of 8 and 9, with more than 99.8% of elements staying within one quantization bucket. In plain English, the only part we do not claim as exact is already confined to a tight corridor.
#Why not ZK
ZK proofs give you a transferable proof object that anyone can verify. That is a stronger property than what CommitLLM provides. The cost is that ZK prover overhead is still orders of magnitude too high for production LLM serving.
CommitLLM makes a different bet: interactive audit, client-held verifier key, small normal-path overhead. A fully disclosed audit transcript can be re-checked by third parties, but the receipt itself is not a succinct proof. For enterprise, regulated deployments, and decentralized compute, the interactive model is enough and the economics work today.
#Open work
We need more model families beyond Qwen and Llama, tighter analysis of adversarial freedom after the attention corridor, stronger KV provenance, and Lean formalization of the core protocol properties.
LLM infrastructure has made a strange peace with unverifiability. A provider puts a model name on a dashboard and the customer accepts it because there is no practical alternative. I do not think that equilibrium lasts. If model provenance matters, the interface should not be a logo and a promise. It should be a receipt and the ability to audit it.