Cheap verification past text. Diffusion and codec models produce no per-token distribution to spot-check, so the language-model defenses do not transfer. We built one that does.
A provider compensated to run a large diffusion model has an obvious incentive to defect: substitute a smaller model, or truncate the denoising schedule, and retain the savings. The consumer observes a plausible image or waveform and has no means by which to detect the substitution.
A diffusion model integrates from noise over N steps, and the only thing the consumer sees is the final latent. So the provider commits the whole trajectory: a Merkle root over (step index, latent digest) at sampled steps, plus the final latent. A verifier re-runs one reference denoising step from a committed latent and checks that its prediction matches the next committed latent within a tolerance, at a cost of about 1/N of the generation.
We implemented it on three independent engines and modalities: a 3.5B diffusion-transformer audio model, a 1.5B Euler latent-diffusion image model, and a latent video-diffusion model. On an honest re-run the step reproduces exactly (relative-L2 of zero); a substituted computation diverges far outside any reasonable tolerance. The cheap-check result carries from text to every modality.