For twenty-seven years, the unit on which search optimisation was practised has been steadily decomposing — first from the site to the page, then from the page to the snippet, and most recently from the snippet to the claim. The previous essay in this volume argued that the claim is now the atomic unit of competition in generative-retrieval systems.1 This essay does the unglamorous next step: it asks what the model does with a given claim once it encounters one, and demonstrates that the answer sorts into a small, stable, and measurable set of behaviours that vary far more by how a claim is written than by which model is asked. I will defend that argument across roughly five thousand words of audit data, formalism, mechanism, and steelmanned objections. The roadmap is conventional for a working paper: four citation classes (§1), a formal definition with measurable inter-rater agreement (§2), methodology (§3), the fourteen-model matrix (§4), the durability premium (§5), the mechanism behind the variance (§6), three serious objections taken seriously (§7), and an honest accounting of what the framework still gets wrong (§8).
The four citation behaviors
Across the 47,800 probe runs that anchor this paper, every observed model response to a target claim resolved into one of four classes. The boundaries are fuzzy where you would expect — a heavily-edited paraphrase shades toward silent absorption; a marginally-attributed quote shades toward verbatim — but the centres of mass are unmistakable and the inter-rater agreement on class assignment held at κ = 0.79 across two independent annotators with a third adjudicating ties.2
Verbatim cite. The model reproduces a span of the source document more or less word-for-word — typically eight to forty tokens — and attaches an attribution that names the source. This is the rarest outcome in the audit (median rate across models: 11.4%) and the highest-value one. It is the closest functional equivalent to “ranking number one” the new substrate offers: the user sees your prose, in your voice, with your name attached. The edge case worth flagging is the near-verbatim with attribution lost in synthesis — the model lifts the sentence but the citation footnote points at the wrong source, usually a higher-authority document that made an adjacent but distinct claim. We tagged 6.2% of apparent verbatim cites as mis-attributed; in commercial domains the rate rose to 9.1%, which is its own research project.
Paraphrase-with-source. The claim is restated — sometimes lightly, sometimes to the point where the only retained tokens are the named entities — but the attribution survives in some form, whether as an inline citation, an end-of-answer source list, or a clickable footnote in the rendered interface. Median rate: 27.8%. This is the workhorse outcome. The entity-to-claim binding holds even when the surface form does not, and from a downstream-visibility perspective the paraphrase functions like a soft brand mention: a user who follows the citation lands on your page, a user who does not still gets your finding routed through your name.
Silent absorption. The claim is reproduced as fact, the model presents it with confidence, and the source has evaporated. Your insight has become the model’s “common knowledge.” This is the most common single outcome (median rate: 41.6%), the most demoralising one for content investors, and the hardest to recover from once it has happened — because the model is not withholding the citation out of laziness; it has, in some functional sense, forgotten that the claim came from anywhere specific. Influence is preserved. Credit is annihilated. The slow consequence is that the entity-to-claim link weakens with each generation that surfaces the claim uncited, and eventually the next better-sourced version of the same claim displaces yours in the model’s working context, with no public signal that the swap has occurred.
Contradiction. The model surfaces a competing claim and yours loses outright — either explicitly (“the more commonly cited figure is X”) or silently (a different number is given as the answer with no acknowledgement that your number exists). Median rate: 8.7%, with substantial vertical variance — in regulated domains (medical, financial) the rate climbs above 14% because the models are trained to default to the most institutionally authoritative source. Counter-intuitively, contradiction is often the single most actionable finding in a baseline audit, because it points at the exact sentence where a rival has out-sourced you. Fix the sourcing and the contradiction class converts upward to paraphrase-with-source within the next re-probe cycle.
The remaining 10.5% of probe runs fell into a residual no-show bucket — the target claim did not appear in any form, the model answered the prompt without touching the topic. We treat no-show as a fifth class for accounting purposes but exclude it from the per-behaviour analysis below; methodologically it is indistinguishable from a prompt-relevance failure on our end, and the four substantive classes account for 89.5% of the observed signal.3
A formal definition of citation behavior
To measure citation behavior across models you need a definition precise enough to operationalise and lenient enough to survive the messy reality of natural language. Let D be a target document and c a single atomic claim inside D. Let M be a model and p a prompt drawn from some distribution P. Define R(M, p) as the model’s response to prompt p. The citation event for claim c under (M, p) is the assignment of c to exactly one of the four behaviour classes — verbatim cite, paraphrase-with-source, silent absorption, contradiction — based on whether c (or its semantic equivalent) appears in R(M, p) and, if so, in what form.
The two analytical primitives that fall out of that definition matter more than the equation:
Citation rate per class. For each (M, D, p) triple and each behaviour class k, the citation rate rk(M, D) is the proportion of probe runs in which a claim from D surfaced in R(M, p) and was classified as k, marginalised over the prompt set. We report these as percentages of observed citation events, excluding no-shows from the denominator. This is the choice that affects how you read every per-model figure in this paper: a model with high no-show rate can still have a high attribution rate (verbatim + paraphrase-with-source) on the claims it does surface, and the operational lever the practitioner has — write the claim better — affects what gets surfaced rather than whether the topic gets answered at all.
Inter-rater agreement on class assignment. Two independent human annotators classified a stratified 5% sample (n = 2,390 probe runs) into the four classes. Cohen’s κ on the four-way classification was 0.79 unweighted and 0.84 weighted by class-pair distance — comfortably above the 0.7 threshold the literature treats as “substantial agreement.” The bulk of the disagreement was on the verbatim/paraphrase boundary, which is genuinely fuzzy when the model lifts a seven-token span and adds two of its own. The verbatim-vs-everything-else distinction held at κ = 0.91; the attribution-vs-no-attribution distinction held at κ = 0.93. For the practitioner this is the punchline: even where the fine-grained class boundary is contested, the load-bearing distinction — was the source credited or wasn’t it — is robustly identifiable.
The third primitive is the conceptual separation the existing GEO literature keeps eliding:
Attribution versus reproduction. A claim can be reproduced without being attributed (silent absorption) and a claim can be attributed without being faithfully reproduced (a paraphrase that drifts past the qualifier and ends up asserting a stronger version of the original). The two failure modes have different remedies. Reproduction failure is a retrieval problem — the claim did not survive chunking, embedding, or relevance ranking, and the fix is at the document-structure level. Attribution failure is a generation problem — the model reached for the fact but not for the source, and the fix is at the claim-framing level. Conflating them produces the kind of generic content advice (“add more E-E-A-T signals”) that is technically correct and operationally useless.
Methodology
A few choices deserve to be flagged before the results. We probed against topic-level prompts rather than document-level queries — what a user actually asks, not what a head-term keyword tool would predict. We held temperature at the model’s published default. We treated repeated probes as independent observations of the same underlying distribution, which is a defensible but contestable assumption that the per-model variance reported below speaks to. And we omitted enterprise-RAG systems (vendor-deployed, private-corpus configurations) from this slice entirely — their behaviour is materially different from public assistants and deserves its own audit, which is queued for Volume IV.4
The fourteen-model matrix
| Model | Dominant behavior | Attribution rate | Contradiction rate | Notes |
|---|---|---|---|---|
| GPT-5 Pro | Paraphrase-with-source | 78.4% | 6.2% | Highest single-model attribution; behaviour stable across temperature 0.2–0.7. Lifts the qualifier with the claim more reliably than any other model in the panel. |
| GPT-5 Standard | Paraphrase-with-source | 64.1% | 8.4% | Mid-panel. Drops attribution sharply on prompts the routing layer classifies as conversational. |
| Claude Sonnet 4.6 | Paraphrase-with-source | 71.7% | 5.1% | Lowest contradiction rate in the panel; conservative in surfacing competing claims, which is either a feature or a bug depending on your priors. |
| Claude Opus 4.7 | Verbatim cite | 81.3% | 6.8% | Highest verbatim share (18.9%) — closer to lifting whole sentences than any other model. The qualifier-attached cite is its default behaviour, not its exception. |
| Gemini 2.5 Pro | Paraphrase-with-source | 69.0% | 11.4% | Aggressive about surfacing alternatives — high contradiction rate is structural, not noise. Strong on documents with embedded structured data. |
| Gemini 2.5 Flash | Silent absorption | 51.8% | 9.7% | Speed-tuned variant trades attribution for latency. Most-absorbed claims of any flagship model in the panel. |
| Mistral Large 3 | Silent absorption | 47.2% | 7.9% | Trained for closed-book reasoning. Attributes only on high-confidence retrievals; otherwise answers from prior. |
| Perplexity Sonar Pro | Verbatim cite | 91.6% | 4.3% | Highest attribution rate in the panel by a substantial margin — the architecture is built around the citation, not bolted on. Verbatim share is 34.2%. |
| Perplexity Sonar Std | Paraphrase-with-source | 86.0% | 5.0% | Same architectural premise as Pro at lower retrieval depth. Still the second-most-attributing system in the audit. |
| You.com Genius | Paraphrase-with-source | 72.8% | 8.1% | Strong middle-of-the-pack performance; favours editorial sources over commercial ones in tie-break situations. |
| Phind Pro | Verbatim cite | 79.5% | 7.4% | Developer-doc-leaning corpus; verbatim share elevated on technical-doc prompts (28.6%) and lower on consumer-vertical prompts (12.1%). |
| Brave Leo | Paraphrase-with-source | 61.3% | 9.2% | Privacy-architecture choices constrain retrieval; attribution suffers correspondingly on long-tail prompts. |
| Kagi Assistant | Paraphrase-with-source | 68.7% | 8.6% | The “lens” feature meaningfully shifts behaviour — domain-restricted lenses raised attribution to 84.1% in our sub-sample. |
| Llama-3.3 70B (baseline) | Silent absorption | 39.6% | 10.3% | Closed-book reference. The floor against which the retrieval-augmented systems should be judged. |
Read the matrix vertically and the conventional intuition holds: retrieval-first architectures (Perplexity Sonar Pro, Perplexity Sonar Standard, Phind Pro, Claude Opus 4.7) attribute heavily; closed-book architectures (Llama-3.3, Mistral Large 3, Gemini 2.5 Flash) absorb heavily. The vendor architecture explains roughly 62% of the inter-model variance in attribution rate. That is genuinely a lot, and a tempting place to stop reading.
Read the matrix horizontally — across the same model on different document shapes — and the more important finding emerges: inter-document variance inside a single model is larger than inter-model variance across the panel. On the same prompt, the same GPT-5 Pro instance moves from 38% attribution (on unsourced marketing copy) to 92% attribution (on sourced research prose). The delta inside a single model dwarfs the delta across the model panel for any fixed input. That is the result on which everything else in the paper turns, and it is the lever the practitioner controls.
The durability premium
The headline finding of the audit, isolated from the model-level noise: the same factual claim, written once as defensible research prose and once as confident marketing assertion, was attributed at 3.4× the rate in its research form across the full fourteen-model panel.5 The effect held with the identical entities named, the identical numerical content, the identical underlying truth — the only thing that varied was the epistemic shape of the sentence. Hedged where the underlying uncertainty was real. Sourced to a named primary. Quantified to a specific figure with a specific date. Bounded by an explicit qualifier in the same chunk as the claim.
| Statement shape | Example | Dominant behavior | Relative attribution |
|---|---|---|---|
| Sourced, quantified, dated | ”Independent testing by Cloudflare Research in March 2026 measured a 38% reduction in egress latency versus the named comparison.” | Verbatim / paraphrase-with-source | 3.4× |
| Entity-anchored, unquantified | ”Cloudflare’s edge architecture reduces egress latency for most enterprise workloads.” | Paraphrase-with-source (mixed) | 1.6× |
| Quantified, no source attribution | ”Edge architectures reduce egress latency by roughly 38%.” | Mixed — paraphrase or absorption | 1.9× |
| Generic best-practice claim | ”Modern edge architectures provide measurable latency improvements.” | Silent absorption | 1.1× |
| Confident, superlative, unsourced | ”Our platform delivers the fastest egress in the industry.” | Silent absorption / no-show | 1.0× (baseline) |
Read that figure twice, because it inverts about twenty years of conversion-copy instinct. The model treats epistemic humility as a trust signal. “Our platform is the fastest” is a sentence the model has seen ten thousand times from ten thousand vendors; it is noise, and it gets absorbed or ignored. “Independent testing by Cloudflare Research in March 2026 measured a 38% reduction in egress latency versus the named comparison” is a sentence with edges — falsifiable, dated, attributable, with a named source the model can defer to — and the model can safely hand it to a user with a citation attached.
The interesting structural point is that the four contributing properties compound multiplicatively, not additively. A sentence that quantifies without sourcing earns roughly 1.9×; sources without quantifying earns roughly 1.8×; date-binding alone earns 1.4×; entity-anchoring alone earns 1.7×. A sentence that does all four earns 3.4× — close to the product (1.9 × 1.4 ≈ 2.66; the panel-averaged compounding is steeper because the properties are not independent and reinforce each other inside the model’s internal evaluation). The practitioner implication: there is no useful “one big lever” to pull. There are four medium levers, and you have to pull all four on the same sentence to see the headline multiplier.
The second structural point is that AI-generated content failed the durability test catastrophically. The 480 documents we tagged as predominantly AI-generated had a baseline attribution rate of 21.4% and a 30-day decay that dropped it below 5%, regardless of how much sourcing and quantification their content carried.6 The leading explanation is that the panel models detect their own paraphrase distribution as such and down-weight it in favour of inputs whose phrasing they have not previously seen embedded. The practical reading: producing more content with the same generation pipeline is not a defence against this dynamic; it is the failure mode the dynamic was trained to penalise.
Why models behave this way
The mechanism is not mysterious once you think about what the generation step is actually doing. When a model attaches a citation to a claim, it is making a small, low-stakes bet: that the cited source will withstand a user clicking through and reading it. Vague superlatives are a bad bet — there is nothing specific for the citation to stand behind, and if the user clicks through they discover a marketing page that does not substantiate the model’s confidence. Precise, sourced, quantified statements are a good bet — the user clicks through and finds exactly the thing the model promised was there. The post-training process penalises the first kind of bet and rewards the second. Attribution, in this frame, is risk management, and the practitioner lowers the model’s risk by writing claims it can defend.
This frame also explains the contradiction class without recourse to any particular model’s training data. When two sources make opposing claims, the model does not flip a coin and it does not default to the more recent source — it defaults to the one that is more defensible under the same risk-management heuristic.7 The losing source in a contradiction is almost always losing on sourcing fidelity, not on phrasing or freshness or domain authority. The actionable corollary: losing a contradiction is an editorial problem, not a technical SEO problem. The remedy is to source the underlying claim better, not to add schema or improve crawl frequency.
A third consequence falls out of the same frame and deserves a callout. If the mechanism is risk management on the model’s part, then practitioners who write in a way that lowers the model’s risk will win attribution disproportionately even if their underlying content authority is lower. This is the most testable prediction the paper makes:
If the claim survives replication, it has uncomfortable consequences for the incumbents in any vertical where the content style is set by the dominant brand. It says, roughly, that the structural advantages of a high-authority domain are now spendable on attribution only if the editorial discipline of the content keeps pace — and that editorial discipline is the lever a challenger can move faster than a brand-style guide.
Steelmanning three serious objections
A position paper that fails to engage its strongest critics is propaganda. Three objections to this framework are worth taking seriously, and each gets a reply that is partial rather than triumphant.
Objection 1: the models update weekly, so what you are measuring is noise. The models do update weekly. We observed 8–15% per-statement variance inside a nominally version-pinned flagship across consecutive weeks of the audit window. The objection is therefore partly correct — single-model, single-week results are noisier than the field’s published audits typically acknowledge. But the objection conflates single-model noise with panel-aggregated signal. The 3.4× durability multiplier survives every weekly re-probe at panel level within ±0.3 of the headline figure. The dispersion across models is, in fact, why we run the audit across fourteen of them — the panel mean is designed to absorb the inter-week drift that any single model contributes. A measurement framework that depends on weekly stability of a single named model is fragile; one that aggregates across a panel and updates the panel quarterly is the operational state of the art and is what serious work in this area is converging toward. The objection is right where it bites and an excuse where it generalises to “therefore measure nothing.”
Objection 2: this is just classical information retrieval with a new vocabulary. Source your claims, quantify them, attribute them to named entities — these are the same recommendations the helpful-content era was already pushing, and before that the structured-data era, and before that the Knowledge Graph era. New paint, same wall. This one bites a little. The tactics overlap because the helpful-content systems were trained on roughly the same human-judgement signals the retrieval systems now consume, because Google’s quality-rater data is one of the foundational training inputs for the whole stack. The difference is what counts as success. HCU advice was: make the page useful so the page ranks. The advice here is: make the claim defensible so the claim survives the chunker. These overlap in tactics and diverge in what the metric is. The divergence shows up in the cases where HCU-passing pages have low statement-level visibility — they exist, we report them — and in the cases where ostensibly weaker pages outperform on attribution because their editorial framing is tighter at the claim level. The advice is not novel. The unit of optimisation against which the advice is given is novel, and the change of unit changes which pieces of the same advice matter most.8
Objection 3: Perplexity-style RAG assistants will dominate and closed-book model behaviour will be irrelevant within eighteen months. This is the strongest of the three because it would, if true, partially obsolete the silent-absorption and contradiction classes — both of which are most prevalent in closed-book and lightly-retrieval-augmented systems. The current reality is that Perplexity-class assistants collectively account for under 4% of the daily generative-answer query volume in the markets where we have telemetry, and the closed-book and lightly-augmented systems account for the rest.9 Even if the architectural mix shifted dramatically — and the 2025-Q3 to 2026-Q1 data shows it is shifting, but slowly — the underlying generation behaviour of “favour sources I can defend” is preserved across both architectures. The four behavioural classes are upstream of the retrieval choice, not downstream of it. The framework’s centre of gravity might shift — verbatim cite would become more common, silent absorption would become less so — but the taxonomy would survive the architectural shift intact. The objection is correct in its trajectory and overstated in its timeline.
Limitations and what we do not know
The framework is a first cut and is wrong in at least four places I can name without prompting.
The denominator is contested. We exclude no-shows from the per-class denominator, which makes the attribution rates higher than they would be on a total-probe denominator. Readers comparing our numbers to vendor-published ones should hold the denominator choice in mind; the convention is justified methodologically but it does flatter the systems with high no-show rates.
The verbatim/paraphrase boundary is fuzzy at scale. The κ of 0.79 on the four-way classification papers over a real annotation challenge that the fine-tuned classifier inherits. Headline class shares should be read with an implicit ±2-point uncertainty band on the verbatim and paraphrase classes individually; the verbatim-versus-everything-else and attribution-versus-no-attribution distinctions are sturdier and the practitioner advice rests on the sturdier distinctions.
The 90-day window is too short. The decay analysis underspecifies the long-run dynamics. Our working hypothesis from a partial 180-day extension is that the durability gap widens, not closes, but we will report whatever the longer data shows in the next revision.
Multi-language behaviour is uncharacterised. The audit was English-only; preliminary probes in Hebrew, Arabic, and German suggest the headline behaviour replicates but with materially lower attribution rates across all classes (model retrieval over non-English corpora is weaker), and the per-vertical breakouts diverge in ways the English data does not predict. A multilingual companion is queued.
The framework will be wrong in interesting ways. If you find one of them, publish — the archive exists to be corrected in public, and the practitioner who reproduces these numbers and reports a disconfirmation is doing the field a more useful service than the one who agrees in public and quietly hopes the result holds.10
References
- Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. — The canonical RAG paper. The architectural premise behind the per-vendor variance reported in §5.
- Liu, N. F., Lin, K., Hewitt, J., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the ACL, Volume 12. — Models systematically under-weight middle-of-context tokens — partial explanation for verbatim-cite preferring front-loaded claims.
- Min, S., Krishna, K., Lyu, X., et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023. — Atomic-claim decomposition methodology adapted here for citation classification.
- Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. EACL 2024 Demo Track. — Closest published cousin to the probe protocol in §4.
- Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. ICLR 2024. — Explains why the contradiction class behaves as a model preference rather than a random outcome.
- Gao, Y., Xiong, Y., Gao, X., et al. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. ACM Computing Surveys. — Most current comprehensive RAG survey.
- Karpukhin, V., Oğuz, B., Min, S., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. — The DPR foundation — passage-level retrieval as the primitive underneath every behavior in this taxonomy.
- Perplexity AI (2024). How Perplexity sources and ranks citations. Perplexity Engineering Blog. — Vendor disclosure of one of the four behavior modes characterised here.
- OpenAI (2025). GPT-5 Pro system card. OpenAI technical documentation, October 2025. — Reference behavior for GPT-5 Pro probes.
- Anthropic (2025). Claude Sonnet 4.6 model card. Anthropic documentation. — Reference behavior for Sonnet 4.6 probes.
- Sasson, G. (2026). Statement-level visibility, or: why ranking a page no longer matters. Algoholic, Vol. III, Essay 04. — The framework this taxonomy operationalises at the per-claim level.
- Sasson, G. (2026). Ranking ≠ retrieval ≠ generation. A decomposition. Algoholic, Vol. III, Essay 01. — Citation behavior lives in the generation stage; the decomposition this work assumes.
- Comscore & Sistrix (2025). AI Overview impact on zero-click rates: 2025-Q3 multi-market analysis. Industry telemetry report. — Independent measurement of why citation behavior now matters at the substrate level.
Footnotes
-
Sasson, G. (2026). Statement-level visibility, or: why ranking a page no longer matters. Algoholic, Vol. III, Essay 04. The taxonomy described here is the per-claim dependent variable in the visibility framework that essay defines. ↩
-
Annotators were a senior linguist with prior NLP-evaluation experience and a search-marketing analyst with three years of audit work. Disagreement resolution rules and the annotation rubric are reproduced in Appendix C of the technical companion. The κ figure here is the unweighted Cohen’s kappa over four classes; weighting by class-pair distance lifts it to 0.84. ↩
-
We re-probed the no-show subset with three reformulated prompts each, and 71% remained no-shows across all reformulations — strong evidence that the absence reflects a real model decision rather than a prompt-engineering artefact. The class is small enough to set aside without distorting the headline finding; it is large enough to merit its own paper. ↩
-
The short version of the enterprise-RAG result, from an in-flight pilot: attribution rates are higher (the retrieval layer is curated) but reproduction fidelity is lower (the synthesis layer is more aggressive about composing across passages). The trade-off lands in a different place than for public assistants and the practitioner advice diverges accordingly. ↩
-
3.4× is the panel-weighted geometric mean. Per-model, the multiplier ranged from 2.1× (Llama baseline; closed-book models reward research framing less) to 5.7× (Perplexity Sonar Pro; the architectural premise of the system rewards research framing the most). The dispersion itself is informative — the practitioners’ lever is most powerful exactly where the user is most likely to encounter the answer. ↩
-
We used a three-classifier majority vote with conservative thresholds, calibrated against a 400-document gold set hand-labelled by two annotators. False-positive rate on the gold set: 7.4%; false-negative rate: 11.8%. The headline effect survives at p < 0.001 across reasonable sensitivity analyses on the classifier thresholds, but the precise magnitudes should be read with the classifier error bars in mind. ↩
-
Self-RAG (Asai et al., 2023) makes the mechanism explicit in its training objective — the model is rewarded for retrieving and citing only when the retrieved evidence supports the candidate generation. The reflection token approach generalises across frontier systems that have absorbed the technique, which by 2026 is most of them. ↩
-
The pre-PageRank retrieval era — AltaVista, HotBot, Excite, Yahoo, the engines my practice optimised against from 1999 through 2003 — used signals that were a mix of on-page content quality and entity-recognition heuristics. The current generative-retrieval pipeline is closer in spirit to those engines than to the link-graph-dominated middle period that followed. Practitioners who never operated through the pre-PageRank period are reinventing some of the same intuitions from first principles, which is a fine way to discover them but unnecessarily slow. ↩
-
Public-sector telemetry from Comscore and Sistrix, 2025-Q3 multi-market analysis. The volume gap between Perplexity-class systems and the closed-book + lightly-augmented systems (ChatGPT, Claude.ai, Gemini consumer surface, AI Overviews) is currently 25× and narrowing at roughly 1.4× per quarter. At that rate, parity is a 2028 question, not a 2026 one. ↩
-
Replication data, prompt sets, and the annotation rubric are available on request for serious reproduction efforts. Corrections received before the v3.0 revision will be acknowledged by name in the changelog. ↩
