Verifiable inference · OGONG Labs

OGONG rests on a single asymmetry: verifying an answer is far cheaper than producing it. Denote by rho the ratio of verification cost to generation cost. On datacenter GPUs rho is approximately 1%, a hundredfold reduction; even on Apple Silicon, the weakest backend we target, it remains near 5%.

That number is the whole game. If verifying is cheap, the network can check nearly every answer, and once almost everything is checked you no longer need providers to post a large slashable deposit to stay honest. Cheap full-coverage checking replaces the bond.

Once almost everything is checked, the cash bond is no longer needed.The whole game

Verify by re-prefill, not re-generation

Prior commitment schemes verify by re-running or re-generating the answer. Our contribution is to show you can verify with a single teacher-forced prefill of the prompt plus the claimed output, with no autoregressive decoding, and that this is sound, not merely cheaper. Generation is one slow step per token; a teacher-forced prefill does the whole sequence in one batched pass. That is where the 100x comes from.

We validated the substitution empirically: a teacher-forced re-prefill reproduces generation-time hidden states to relative-L2 of about 0.005 and log-probabilities to total variation of about 0.005, with the argmax token matching at every position. A purpose-built score mode engine path emits per-token logprobs and hidden states from one prefill without generating.

The commitment

A provider generates in fixed windows of 32 tokens and emits a small leaf per window. The leaves form a Merkle tree whose root is the commit_root. Each leaf binds two complementary fingerprints of how the output was produced:

Hidden-state sketch. A commitment over the model's per-token last hidden states, captured as a sign-random-projection (SRP) sketch: a fixed, public bank of random plus-or-minus-one directions, identical for provider and verifier. Every direction mixes all coordinates, so the checked subspace is not the provider's to choose. This departs from the magnitude-top-k sketch of prior work, where a provider picks the checked coordinates and a substitute can hide in the rest. The fixed projection closes that hiding attack and makes the comparison well-conditioned.
Logprob digest. The top-k log-probabilities at every decode position in the window. Committing all positions is strictly stronger than sampling a few.

The provider signs a per-reply record binding request, response, and model identity, with a hybrid post-quantum signature (Ed25519 and ML-DSA-44), pushed to a validator at end-of-stream so the commitment is anchored even if the provider goes offline.

record = ( reply_id, req_hash, resp_hash, model_root, commit_root, n_tokens, t0, t1 )
sig    = Ed25519(record)  ||  ML-DSA-44(record)

The audit: the Golden Eyes

A provider that serves a cheaper model in place of the one it promised is wearing a disguise. The network's auditors, the Golden Eyes (named for the fiery gaze that sees through any transformation), catch it. A validator decides whether to audit a reply using a coverage rate drawn from a threshold-BLS randomness beacon, a drand-style construction where no coalition can predict or steer the draw and no validator can move it by withholding. Because the draw is unpredictable and an audit can run any time in the reply's window, a provider cannot tell which replies are checked, so it cannot serve the real model only when watched.

Verifier selection is model-aware: the verifier is drawn from other providers already serving the same model_root, since only a peer running the same model can teacher-force it. The verifier fetches the request and claimed output, runs one teacher-forced prefill, recomputes both fingerprints, and returns Accept or Reject.

Check	Rejects if	Catches
Hidden-state (SRP)	relative-L2 > 0.10	wrong or smaller model, distillation mimics, coarse quantization
Logprob	KS sup-norm > 0.10	near-lossless quant fraud (Q8 to Q4_K_M)

An honest re-run scores essentially zero on both; a substituted model scores about 1.1 on the hidden-state check, an order-of-magnitude margin that is decisive on a single reply. Verifiers are paid a flat fee per audit regardless of verdict, so they are neutral, and validators slip in honeypot audits carrying a known-bad output; a verifier that rubber-stamps one is itself slashed.

Every modality: diffusion, audio, video

This is the part the cheap text checks could not reach. Commitment methods like TOPLOC and logprob spot-checks work only for autoregressive text, because they need a per-token distribution. Diffusion and flow models produce no such thing; they denoise a continuous latent over N steps. Prior image-inference checks lean on heavy zero-knowledge proofs, generic optimistic re-execution, or output fingerprinting; our move is a lightweight, diffusion-specific trajectory commitment: a Merkle root over (step, latent digest) at sampled steps plus the final latent. A verifier re-runs one reference denoising step and checks that its predicted next latent matches the committed one within a tolerance band, at cost rho of about 1/N.

We implemented and measured it on three independent engines, a 3.5B diffusion-transformer audio model, a Euler latent-diffusion image model, and video. An honest re-run reproduces each sampled step to relative-L2 of zero; a 5% conditioning perturbation diverges to 0.27. The cheap-check result carries from text to every modality.

Vision-language models A generated token's hidden state attends over the input image, so the existing hidden-state commitment already binds correct image processing for free: a different image moves the first token's hidden state by about 0.30.

Zero bond: honesty without capital at risk

The usual choice between "post a large slashable bond" and "subjective scoring with no bond" is an artifact of verification cost. We prove that under full-coverage cheap verification a correctness bond is redundant: simply forfeiting the cheated request's escrowed fee already makes honesty the best response, because market viability forces the gain from cheating below the fee. The result is machine-checked, in Z3 and Lean 4, including the Ville maximal inequality behind the false-ejection bound, with PRISM-games reproducing the honesty boundary.

Sybil resistance is reassigned from staked capital to a proof-of-distinct-GPU throughput challenge that bounds an operator's identity fraction by its share of physical throughput, and that challenge doubles as verification duty, so the anti-Sybil cost is not burned, it is the audit. Individual verdicts feed a sequential probability ratio test, so honest cross-hardware noise will not eject a provider while persistent cheating crosses the threshold in a number of audits that grows only logarithmically.

Bigger than one GPU: verified split inference

With the bond gone, the network can do something a bonded design cannot. A model too large for any single GPU is sharded across a cohort of independent commodity machines, each running a contiguous range of layers. The activation tensor passes from one segment to the next, and at every boundary the provider signs the activations it consumed and produced, so the commitments chain: the output of one segment is the input of the next, and a forged boundary cannot pass. The same cheap re-check verifies each segment on its own, a bad shard is ejected without disturbing the rest, and settlement pays each provider for exactly the layers it served under a conservation invariant. A bonded design cannot reach casual scale here, because its capital barrier multiplies by every shard. Built and measured end to end: a two-shard cohort reproduces the monolithic model to a relative L2 of about 1e-5, and a machine-checked corollary (Z3, Lean 4) proves sharding cannot weaken the deterrence. See Verified split inference.

The honest caveat

We state the open gate plainly. On the Verified tier the statistical soundness has a measured, not proven margin: the honest cross-hardware drift tail (different GPUs, flash-attention, accumulation precision) has not been shown to sit clear of near-lossless quant fraud, specifically Q8 billed as f16, where the activation signal (about 0.016) is the same magnitude as the teacher-forcing reproduction noise floor (about 0.007 to 0.015). For that one band the cheap statistical check alone may not separate, and the network falls back to the Confidential (TEE) tier plus weight declaration and pricing. The diffusion verification and the zero-bond mechanism are the cleanest novelties; the verified-tier numbers are promising but unsettled, and we say so.

Next: Trust tiers → · Full protocol docs ↗