Trajectory Commitments

Cheap verifiable inference for diffusion and codec models.

Abstract

A provider paid to run a 3.5-billion-parameter diffusion model has an obvious move: run a smaller one, or fewer denoising steps, and keep the difference. The consumer receives a plausible waveform or image and cannot tell. This is the incentive that motivates verifiable inference for language models, but the language-model defenses do not transfer, for a structural reason: they commit a per-token output distribution and spot-check it, and a diffusion model produces no such distribution. It integrates an ODE/SDE from noise over N steps, and the only thing the consumer sees is the last latent. We give a verification primitive for this setting and measure it. The provider commits a trajectory: a Merkle root over (step index, latent digest) at sampled steps plus the final latent. A verifier re-runs one reference denoising step from a committed latent and checks that its prediction matches the next committed latent within a tolerance, at cost rho of about 1/N of the generation. On an honest re-run the step reproduces exactly (relative-L2 = 0); a substituted computation diverges far outside any reasonable tolerance. We implemented this on three independent engines and modalities, a 3.5B diffusion-transformer audio (flow) model, a 1.5B Euler latent-diffusion image model, and a latent video-diffusion model, and the primitive holds on all three. We are precise about the one thing it does not give for free: the accept tolerance is a measured quantity set from the honest cross-hardware reproduction tail, not a proven constant, and we say exactly where that leaves the guarantee.

Contributions

A verification primitive for diffusion and codec models, where per-token methods structurally cannot apply: the trajectory commitment.
A single-step re-check at cost rho of about 1/N; an honest re-run reproduces the step exactly (relative-L2 = 0) while a substitution diverges far outside tolerance.
Implemented and measured on three independent engines: a 3.5B diffusion-transformer audio model, a 1.5B Euler latent-diffusion image model, and a latent video-diffusion model.
An honest account of the limit: the accept tolerance is measured from the cross-hardware reproduction tail, not proven.

← All papers · The research it underpins →