Reproducibility as a ranking signal.

Ask GPT-5 Pro the same question three times in a row and you will get three related but non-identical answers. The factual content overlaps; the phrasing varies; which sources get cited and in what order shifts. This is not a bug. Sampling temperature, top-p truncation, and the model’s own internal stochasticity guarantee it.¹ The interesting question is not how to suppress the variance — you cannot, not at any production decoding setting practitioners actually use — but how to characterise what survives it.

The framing this note proposes: a claim that reappears across N runs and M paraphrases of the same intent is functionally more visible than a claim that appears in 2 of 30 generations. The ephemeral claim is, for any practical purpose, not in the model’s working answer for that topic. The persistent claim is. Reproducibility — the rate at which a target statement recurs across controlled re-runs — is the closest analogue the generative era has to the rank-tracking metric the practice spent twenty years building tools around.

The metric, defined precisely

Let T be a target claim — a specific atomic statement you want the model to surface. Let Q be a set of paraphrases of the underlying intent (the same question asked with different surface wordings). Let N be the number of re-runs per paraphrase. Define the reproducibility score of T under prompt distribution Q as the fraction of (paraphrase × run) pairs in which T (or a paraphrase of T) appears in the model’s response. Concretely:

R(T | Q, N) = (1 / |Q|·N) · Σᵢ Σⱼ 𝟙[ T ∈ M(qᵢ, run j) ]

A claim with R = 0.85 is in the model’s working answer 85% of the time it should be relevant. A claim with R = 0.10 is flickering — present but unreliable, and unreliable claims are not visibility, they are noise.²

Why some claims reproduce and others don’t

Across the 47,800-probe corpus described in the statement-level-visibility paper, the per-claim reproducibility scores were not uniformly distributed and the structure of the distribution is the actionable finding. Three properties predicted reproducibility, in roughly this order of effect size:

1. Consistency with the model’s prior. Claims that align with what the model already “knows” from training reproduce at very high rates because the retrieval system does not have to overcome the prior to surface them. Claims that contradict the prior reproduce inconsistently — sometimes the retrieved passage wins, sometimes the prior does, and the toss is approximately stochastic. This is one of the cleanest signals available for “what does the model believe about my topic” — high reproducibility against your claim means alignment; low reproducibility means the prior is pulling against you.

2. Source corroboration density. A claim reproduced in many trusted sources across the corpus reproduces in generation at near-ceiling rates. A claim with one source, even a well-sourced one, reproduces inconsistently because the retrieval system has fewer chances to surface it. The fix is not to write the same claim ten times on your own site (which fails the authority check) but to ensure the same claim is corroborated by independently-authored sources.

3. Retrieval-survival under the chunker. A claim that chunks well — front- loaded, qualifier-adjacent, self-contained — has more chances to be retrieved in any given run, and therefore more chances to be surfaced. This is the chunking discipline (see the field note on chunking) instrumented as a visibility lever.

A workable measurement protocol

For practical use in client engagements and ongoing visibility tracking, the protocol that has earned its keep is straightforward enough to run in a spreadsheet:

Define 5–10 target claims per topic, with their intended phrasings.
Define 8–15 paraphrases per claim — different surface wordings of the same underlying intent. Draw them from the client’s real query data, not from a keyword tool’s autocomplete.
Run each paraphrase against the target model 3 times, with default temperature.
Tag each response: claim present (verbatim, paraphrased, absorbed, contradicted) or absent.
Compute R per claim per model per week.

Tracked over time, R curves behave a lot like rank curves used to — they drift, they correlate with content changes, they decay when competitors publish better-sourced versions of the same claim. The temporal signal is arguably more useful than rank ever was, because R measures the model’s belief, not just its surface presentation.

What R is not

Reproducibility is not the whole story. A claim with R = 0.95 that nobody cares about is not visibility, it is busywork. R has to be paired with the prompt-distribution weighting term from the statement-level framework — how much an audience actually asks the question — to convert reproducibility into something that maps to commercial outcomes. A claim that reproduces 95% of the time on a query 8 people ask per month is less valuable than a claim that reproduces 60% of the time on a query 8,000 people ask. The score has to be weighted before it is acted upon.

It is also worth saying clearly: reproducibility is a property of the model’s behavior toward your claim, not a property of the claim alone. Two models on the same claim can have wildly different R values. Report it per model, or report it as a panel average across a fixed model set with the panel composition disclosed — never as a single global number.

Why this matters

The rank tracker was the foundational instrument of the SEO era because it gave the practice a feedback loop. Reproducibility scoring is the analogous instrument for the generative era, and the practitioners and tools that build it first will own the measurement layer of GEO the way Moz and Ahrefs owned rank-tracking through the 2010s. The metric is not exotic. The infrastructure to compute it at scale — across paraphrases, across models, across time — is what will be expensive, and what will compound.

References

Sasson, G. (2026). Statement-level visibility, or: why ranking a page no longer matters. Algoholic, Vol. III, Essay 04. — The framework R operationalises at the per-claim level.
Sasson, G. (2026). A taxonomy of LLM citation behavior across 14 frontier models. Algoholic, Vol. III, Essay 03. — The behavior classes R aggregates over (verbatim, paraphrase, absorption, contradiction).
Min, S., Krishna, K., Lyu, X., et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023. — Methodologically adjacent — atomic-claim presence scoring, reused here as a visibility signal rather than a factuality one.
Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. EACL 2024 Demo Track. — Closest published instrumentation pattern; the open-source baseline for reproducing the R protocol.
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration. ICLR 2020. — The nucleus-sampling / top-p paper; explains the structural source of the variance R measures against.
Sasson, G. (2026). Chunking is the new pagination. Algoholic, Vol. III, Essay 07. — The retrieval-survival mechanism that drives the third reproducibility lever in §3.

Even at temperature 0, current frontier models exhibit non-determinism due to floating-point non-associativity in GPU matrix multiplications when batched against other concurrent requests. The “deterministic” decode path is in practice a low-variance decode path, not a zero-variance one. Treating it as zero-variance has been the source of several embarrassing reproduction failures in 2025-vintage benchmark reporting. ↩
Operating thresholds vary by use case. For brand-defining claims (“our company is the X that does Y”) an R below 0.70 should be treated as not-visible and prioritised for content work. For long-tail informational claims, the threshold can be relaxed to 0.40 — the cost of intermittent absence is lower when the claim is auxiliary. ↩

Gilad Sasson

aka Algoholic · גלעד ששון

Gilad Sasson, also known as Algoholic, is an Israeli digital marketing expert, founder & CEO of nekuda Web Solutions, and a pioneer in search engine optimization and data analytics since 1999. Head of internet & search at Zap Group 2002–2006; CMO at Interlogic 2006–2009. Speaker at SMX Israel, TNW Amsterdam, Web Summit Dublin, DMIEXPO.

LinkedIn @algoholic Work with me →

The metric, defined precisely

Why some claims reproduce and others don’t

A workable measurement protocol

What R is not

Why this matters

References

Footnotes

Gilad Sasson

More from the working archive

Does llms.txt do anything? A preregistered efficacy protocol.

Statement-level visibility, or: why ranking a page no longer matters.

A taxonomy of LLM citation behavior across 14 frontier models.