The rolling reference · v1.0 · May 30, 2026
How the empirical claims are measured.
Every working paper carries a Methods block disclosing the window, sample, and protocol for its specific empirical claims. This page is the rolling reference behind those blocks — the shared infrastructure, model panel, extraction rubric, and limitations that every individual study inherits and is entitled to depart from.
The standard fourteen-model panel.
Empirical studies aggregate across a fixed panel of fourteen frontier models drawn from seven vendors. The panel composition is reviewed quarterly; substitutions are noted in the panel changelog at the bottom of this page. Reporting per-model results is permitted; relying on any single model's number as decisive is not.
| Vendor | Model | Build / version | Role on the panel |
|---|---|---|---|
| OpenAI | GPT-5 Pro | Rolling weekly builds | Flagship closed-book + RAG hybrid |
| OpenAI | GPT-5 Standard | Stable monthly | Volume-mode reference |
| Anthropic | Claude Sonnet 4.6 | claude-sonnet-4-6 | Reasoning-emphasis baseline |
| Anthropic | Claude Opus 4.7 | claude-opus-4-7 | Long-context baseline |
| Gemini 2.5 Pro | Stable | Web-grounded retrieval baseline | |
| Gemini 2.5 Flash | Stable | Latency-sensitive RAG baseline | |
| Mistral | Mistral Large 3 | Stable | EU-resident model baseline |
| Perplexity | Sonar Pro | Rolling | Citation-first retrieval system |
| Perplexity | Sonar Standard | Rolling | Volume citation-first baseline |
| You.com | Genius | Stable | Multi-agent retrieval baseline |
| Phind | Pro | Stable | Technical-corpus specialist |
| Brave | Leo | Stable | Privacy-mode retrieval baseline |
| Kagi | Assistant | Stable | Paid-search aligned baseline |
| Meta (self-hosted) | Llama-3.3 70B | Pinned weights | Open-weights baseline |
Substitution policy: A model leaves the panel when (a) the vendor end-of-lifes it, (b) it drifts more than 20% from its panel debut on standard probes, or (c) a successor from the same vendor consistently outperforms it on three quarterly reviews. Substitutions are disclosed in the per-paper Methods block.
Extract, probe, compare.
Every empirical claim on this site is produced by some version of the three-operation framework. The operations are mechanical; the choices about which statements to track and which prompts to use are editorial. This section documents both.
- 01
Extract
Decompose target documents into atomic statements. Each statement is the smallest unit that can carry a truth-value — an assertion, a quantified claim, a named-entity attribution, a recommendation with its qualifier. Tag by type (factual, evaluative, prescriptive, historical) and by whether the statement is measurable as written.
Rubric maintained per vertical. Inter-rater agreement target: κ ≥ 0.75. Documents whose claim density falls below 0.4 atomic-statements-per-100-words are flagged as structurally unrecoverable and excluded from the corpus. - 02
Probe
Run a controlled prompt set against each model in the panel. Prompts drawn from real audience query data (search logs, sales-call transcripts, support tickets, customer interviews) — never from keyword-tool autocomplete. Each prompt repeated three times to dampen stochastic-decoding noise.
Standard audit: 40–80 prompts per topic. Position-defining studies: 200+ prompts. Temperature held at model default unless temperature-sensitivity is the dependent variable. Per-statement scoring: verbatim / paraphrased-with-source / silently-absorbed / contradicted / absent. - 03
Compare
Re-run the probe against a held-out set of competing documents to estimate marginal contribution — not just 'did my claim appear' but 'did it appear instead of a competitor's.' Visibility is relative; a claim that always loses to a better-sourced rival has effective visibility near zero.
Competitor sets are matched on commercial intent, not on superficial similarity. The compare step is the operation that flips a 'high raw visibility' document into a 'low relative visibility' one — typically 25–35% of documents in the top raw-visibility quartile fall out of the top quartile after compare.
Assumptions every paper inherits.
These are the standing methodological commitments. A paper that departs from any of them discloses the departure in its Methods block.
- All retention curves are mean across the panel, weighted by model traffic-share where known and uniformly otherwise. 95% bootstrap CIs reported in the underlying spreadsheet.
- All claim-presence judgments use a fine-tuned classifier with 10% human spot-checking. Inter-rater κ for the spot-check sample is reported per-paper.
- All temporal claims use the publication date of the paper; if the underlying audit window differs, the difference is disclosed in the Methods block of that paper.
- All cross-model comparisons hold prompt content constant per condition. Where prompts vary by language (Hebrew/English studies), matched-pair design is used.
What the framework gets wrong, by our own measurement.
A methodology that does not name its own failure modes is propaganda. Five limitations that apply across every empirical claim on this site:
-
Per-model variance is large week-to-week
Flagship models exhibit 8–15% per-statement variance week-to-week, even at nominally stable version pins. Panel aggregation dampens this but cannot eliminate it. Reading single-model numbers as decisive is a misuse of the data.
-
Audit windows are short
Most studies use 30- or 90-day windows. The substrate moves faster than that. We are extending key studies to 180 days; expect headline durability gaps to widen, not close.
-
Enterprise-RAG systems are out of scope (mostly)
The standard panel is consumer-facing open-web models. Enterprise-RAG behaviour deserves its own panel and study series; what is reported here generalises to those systems only with caveats.
-
Claim-matching has a measurable error rate
The fine-tuned classifier scores ~93% precision and ~89% recall against human judgment on the spot-check sample. Headline numbers should be read with an implicit ±3-point uncertainty band that future paper revisions will make explicit.
-
Weighting v(c) is set by hand
The per-claim value weighting in the visibility formula is editorial judgment. There is no obvious way to automate it without producing a metric that optimises against itself. Two reviewers grading the same corpus will produce different weightings; we report the range rather than collapsing it.
Replicate. Disagree. Publish counter-results.
Methodology that is not reproducible is not science. Three concrete affordances for serious replication:
- Probe sets are shareable. Email algoholic@algoholic.com with the paper you want to reproduce and the use you intend; serious requests receive the prompt set, the extraction rubric, and the response-tagging schema.
- Datasets are public-by-default. Aggregate data tables for each published paper are committed to github.com/nekudai-google/algoholic.com under
data/. Per-probe response logs are retained for one year and shared on request under a non-redistribution agreement (to protect the underlying model providers' terms of service). - Counter-results are welcomed and credited. A replication that fails to reproduce a headline finding is treated as a research contribution — acknowledged by name in the next paper revision, with the disconfirming data published alongside ours.
Changelog.
- v1.0 — 2026-05-30. Initial public methodology reference. Panel composition: 14 models as listed. Standing assumptions and limitations crystallised from the v2.0 revisions of the Volume III working papers.