Verified split inference

The frontier no longer fits on one card. A model of several hundred billion parameters will not load on a casual provider's GPU, and renting a multi-GPU datacenter node reintroduces precisely the gatekeeper the network exists to remove. Split inference shards the model across a cohort of independent machines, each holding a contiguous range of layers, and verifies each range on its own.

How a cohort serves one model

A lead assembles the cohort, and the router matches providers, each advertising the layer-range segment it can serve. The activation tensor flows from one segment to the next, and at every boundary a provider commits and signs the activations it consumed and produced. The commitments chain: the output of segment k is the input of segment k+1, so a forged or substituted boundary cannot pass unnoticed.

Verified per segment

The same cheap re-check that covers a single-node request, at a small constant fraction rho of the generation cost, applies to each segment independently. A verifier re-executes one segment's layer range and matches its signed boundary commitments. A shard that quietly runs a smaller computation is rejected and ejected on its own, while the honest shards in the cohort are untouched. That sharding cannot weaken this deterrence is a machine-checked corollary, in Z3 and Lean 4: a cheating shard saves only its slice of the compute while forfeiting the same franchise, so a smaller shard is in fact more deterred.

Paid per slice, under a conservation invariant

Settlement divides the fee across the cohort so that each provider is paid for exactly the layers it served, the parts sum to the whole with the lead taking the rounding residual, and a shard that fails its audit forfeits its slice while the rest are paid in full. The split is itself the on-chain evidence of who served which layers.

How a single answer is verified → · Read the paper →