Run the same query three times and a model will not give you the same answer three times. Decoding is stochastic. But some claims survive the variance — they reappear, run after run, paraphrase after paraphrase — and the ones that survive are functionally more visible than the ones that flicker.
Reproducibility, then, is a property worth measuring and worth engineering. A claim reproduces reliably when it is consistent with the model’s prior, well-sourced, and repeated across the corpus the model trusts.
We score this directly: N runs, measure the rate at which a target claim recurs. It is the closest thing the generative era has to a rank-tracking metric, and it behaves — usefully — like one.
