The rolling reference · v1.0 · May 30, 2026

How the empirical claims are measured.

Every working paper carries a Methods block disclosing the window, sample, and protocol for its specific empirical claims. This page is the rolling reference behind those blocks — the shared infrastructure, model panel, extraction rubric, and limitations that every individual study inherits and is entitled to depart from.

The standard fourteen-model panel.

Empirical studies aggregate across a fixed panel of fourteen frontier models drawn from seven vendors. The panel composition is reviewed quarterly; substitutions are noted in the panel changelog at the bottom of this page. Reporting per-model results is permitted; relying on any single model's number as decisive is not.

VendorModelBuild / versionRole on the panel
OpenAI GPT-5 Pro Rolling weekly builds Flagship closed-book + RAG hybrid
OpenAI GPT-5 Standard Stable monthly Volume-mode reference
Anthropic Claude Sonnet 4.6 claude-sonnet-4-6 Reasoning-emphasis baseline
Anthropic Claude Opus 4.7 claude-opus-4-7 Long-context baseline
Google Gemini 2.5 Pro Stable Web-grounded retrieval baseline
Google Gemini 2.5 Flash Stable Latency-sensitive RAG baseline
Mistral Mistral Large 3 Stable EU-resident model baseline
Perplexity Sonar Pro Rolling Citation-first retrieval system
Perplexity Sonar Standard Rolling Volume citation-first baseline
You.com Genius Stable Multi-agent retrieval baseline
Phind Pro Stable Technical-corpus specialist
Brave Leo Stable Privacy-mode retrieval baseline
Kagi Assistant Stable Paid-search aligned baseline
Meta (self-hosted) Llama-3.3 70B Pinned weights Open-weights baseline

Substitution policy: A model leaves the panel when (a) the vendor end-of-lifes it, (b) it drifts more than 20% from its panel debut on standard probes, or (c) a successor from the same vendor consistently outperforms it on three quarterly reviews. Substitutions are disclosed in the per-paper Methods block.

Extract, probe, compare.

Every empirical claim on this site is produced by some version of the three-operation framework. The operations are mechanical; the choices about which statements to track and which prompts to use are editorial. This section documents both.

  1. 01

    Extract

    Decompose target documents into atomic statements. Each statement is the smallest unit that can carry a truth-value — an assertion, a quantified claim, a named-entity attribution, a recommendation with its qualifier. Tag by type (factual, evaluative, prescriptive, historical) and by whether the statement is measurable as written.

    Rubric maintained per vertical. Inter-rater agreement target: κ ≥ 0.75. Documents whose claim density falls below 0.4 atomic-statements-per-100-words are flagged as structurally unrecoverable and excluded from the corpus.
  2. 02

    Probe

    Run a controlled prompt set against each model in the panel. Prompts drawn from real audience query data (search logs, sales-call transcripts, support tickets, customer interviews) — never from keyword-tool autocomplete. Each prompt repeated three times to dampen stochastic-decoding noise.

    Standard audit: 40–80 prompts per topic. Position-defining studies: 200+ prompts. Temperature held at model default unless temperature-sensitivity is the dependent variable. Per-statement scoring: verbatim / paraphrased-with-source / silently-absorbed / contradicted / absent.
  3. 03

    Compare

    Re-run the probe against a held-out set of competing documents to estimate marginal contribution — not just 'did my claim appear' but 'did it appear instead of a competitor's.' Visibility is relative; a claim that always loses to a better-sourced rival has effective visibility near zero.

    Competitor sets are matched on commercial intent, not on superficial similarity. The compare step is the operation that flips a 'high raw visibility' document into a 'low relative visibility' one — typically 25–35% of documents in the top raw-visibility quartile fall out of the top quartile after compare.

Assumptions every paper inherits.

These are the standing methodological commitments. A paper that departs from any of them discloses the departure in its Methods block.

  • All retention curves are mean across the panel, weighted by model traffic-share where known and uniformly otherwise. 95% bootstrap CIs reported in the underlying spreadsheet.
  • All claim-presence judgments use a fine-tuned classifier with 10% human spot-checking. Inter-rater κ for the spot-check sample is reported per-paper.
  • All temporal claims use the publication date of the paper; if the underlying audit window differs, the difference is disclosed in the Methods block of that paper.
  • All cross-model comparisons hold prompt content constant per condition. Where prompts vary by language (Hebrew/English studies), matched-pair design is used.

What the framework gets wrong, by our own measurement.

A methodology that does not name its own failure modes is propaganda. Five limitations that apply across every empirical claim on this site:

  1. Per-model variance is large week-to-week

    Flagship models exhibit 8–15% per-statement variance week-to-week, even at nominally stable version pins. Panel aggregation dampens this but cannot eliminate it. Reading single-model numbers as decisive is a misuse of the data.

  2. Audit windows are short

    Most studies use 30- or 90-day windows. The substrate moves faster than that. We are extending key studies to 180 days; expect headline durability gaps to widen, not close.

  3. Enterprise-RAG systems are out of scope (mostly)

    The standard panel is consumer-facing open-web models. Enterprise-RAG behaviour deserves its own panel and study series; what is reported here generalises to those systems only with caveats.

  4. Claim-matching has a measurable error rate

    The fine-tuned classifier scores ~93% precision and ~89% recall against human judgment on the spot-check sample. Headline numbers should be read with an implicit ±3-point uncertainty band that future paper revisions will make explicit.

  5. Weighting v(c) is set by hand

    The per-claim value weighting in the visibility formula is editorial judgment. There is no obvious way to automate it without producing a metric that optimises against itself. Two reviewers grading the same corpus will produce different weightings; we report the range rather than collapsing it.

Replicate. Disagree. Publish counter-results.

Methodology that is not reproducible is not science. Three concrete affordances for serious replication:

  • Probe sets are shareable. Email algoholic@algoholic.com with the paper you want to reproduce and the use you intend; serious requests receive the prompt set, the extraction rubric, and the response-tagging schema.
  • Datasets are public-by-default. Aggregate data tables for each published paper are committed to github.com/nekudai-google/algoholic.com under data/. Per-probe response logs are retained for one year and shared on request under a non-redistribution agreement (to protect the underlying model providers' terms of service).
  • Counter-results are welcomed and credited. A replication that fails to reproduce a headline finding is treated as a research contribution — acknowledged by name in the next paper revision, with the disconfirming data published alongside ours.

Changelog.

  • v1.0 — 2026-05-30. Initial public methodology reference. Panel composition: 14 models as listed. Standing assumptions and limitations crystallised from the v2.0 revisions of the Volume III working papers.