Working paper Essay 04 · Vol. III · GEO · retrieval · measurement · Published May 21, 2026

Statement-level visibility, or: why ranking a page no longer matters.

The unit of competition has shifted from the page to the claim. We define statement-level visibility formally, propose a three-operation measurement framework (extract, probe, compare), and present 90-day data across 3,200 documents and 14 frontier models showing a structural durability gap between sourced, quantified statements and conventional commercial copy.

For twenty-five years, the object we optimised was the page. A URL was the atomic unit of visibility: it ranked, or it did not. Everything we built — crawl budgets, internal-link graphs, Core Web Vitals, canonical tags, hreflang clusters, schema markup — served that single abstraction. The page was the thing that competed. Words, links, and intent were merely its attributes.

That abstraction is now leaking, and this paper is about what replaces it. The short version of the argument: the unit of competition has dropped one level, from the page to the claim, and almost none of our instruments — not Search Console, not GA4, not the keyword-rank platforms the industry still pays a half-billion dollars a year for — measure at that level yet.1 The long version, which the rest of this piece develops, is that the page was always a useful fiction maintained by a particular retrieval pipeline, and that pipeline has been replaced.

The page was always a useful fiction

The page survived as the unit of analysis because the retrieval stack ended at the page. Google fetched URLs, indexed URLs, ranked URLs, and handed a ranked list of URLs to a user who clicked through and read whatever document was on the other side. Nothing in that pipeline forced the practitioner to think smaller, so we did not.

The cracks showed early, and they showed in public. Featured snippets, launched in 2014 and codified as “answer boxes” by the time the Knowledge Graph infrastructure2 was generalised across verticals, lifted a single sentence out of a page and displayed it without the click. Knowledge-panel answers assembled facts from several sources into one box. People-also-ask boxes decomposed a query into sub-questions. Each of these was the substrate quietly telling us that the answer, not the document, was becoming the product. We adapted at the margins — “optimise for the snippet”; “structure for PAA” — but the underlying mental model held: a page ranks, you optimise the page.

The Helpful Content Update of August 2022 was the substrate’s last attempt to hold the page abstraction together by penalising pages whose claims did not hold up when assessed individually. It did not work. It could not work. By the time HCU stabilised in early 2023, the retrieval target was already shifting beneath it: AI Overviews began testing publicly in May 2023, generally rolled out in February 2024, and within eighteen months the median Google query in informational verticals was being answered above the click line.3 The page was still in the pipeline as a source document, a training example, a retrieval candidate. The page was no longer the artifact the user consumed.

This is the move that breaks the model outright. When a language model answers a question, it does not surface a page. It surfaces a claim — a sentence it has assembled, sometimes lifted from your document, sometimes paraphrased past recognition, sometimes synthesised across six sources none of which you control. The user reads the claim. They might click the citation; more often they do not. The page still exists in the pipeline. The page is no longer the product.

This is not a user-interface change. It is a change in what the system optimises for. And it means the competitive question is no longer “does my page rank for this query?” It is “is my claim the one the model reproduces when this topic comes up?” These are different questions with different answers, and a site can win the first while losing the second — which is exactly what the audit data from the last eighteen months keeps showing.

A formal definition of statement-level visibility

Let me make the idea precise enough to measure. Let D be a document and C D = {c1, c2, …, cn} the set of atomic claims it contains. Let P be a distribution over prompts that a real audience for D would issue to a generative answer system, and let M be a target model (or set of models). Define the statement-level visibility of claim ci as the expected probability that ci appears — verbatim, paraphrased, or contradicted — in the model’s response to a prompt drawn from P, weighted by the prompt’s salience to the audience and by the value the claim carries for the document author. Compactly:

V(cᵢ)  =  E[ 𝟙[ cᵢ ∈ M(p) ] · w(p) · v(cᵢ) ]    for  p ~ P

where w(p) is the audience-weighted importance of prompt p and v(ci) is the author-assigned value of claim ci.4 The visibility of the document as a whole is then the sum of V(ci) across the claims it carries — plus, less obviously, a credit or debit for adjacent claims it failed to make that a competitor’s document supplied.

Three properties have to hold for a claim to be visible to a model under this definition:

  1. It is retrievable. The statement survives chunking and embedding without losing the qualifier that makes it true. If the claim and its qualifier land in different chunks, the claim arrives at the model orphaned — or does not arrive at all. (This is the failure mode behind most “the AI got it wrong” complaints I see from clients: the page contained the right answer, but the retrieval system extracted only the wrong half of it.)
  2. It is attributable. The model can trace the statement back to a stable source with enough confidence to cite it, rather than absorbing it as ambient fact. Attribution failure is invisible to the page author — the claim still gets used; the credit does not.
  3. It is reproducible. The statement reappears across re-runs and paraphrases instead of flickering in and out as a one-off. A claim that appears in 1 of 30 generations has visibility near zero regardless of how strong that one generation looks.

Rank, impressions, and click-through rate — the three numbers the entire SEO reporting industry is still organised around — measure none of these. They describe a world where the destination was the prize. The destination is now a citation, and the citation is granted to a sentence.

A three-operation measurement framework

If the claim is the unit, we need claim-level instruments. The framework we run in client engagements — and the framework behind the figures presented below — has three operations: extract, probe, and compare. None of them requires exotic tooling. The hard part is not the engineering; the hard part is choosing which set of claims is worth tracking, because the right set is not symmetric across model classes.

Extract

Decompose the target document into atomic statements. An atomic statement is the smallest unit that can carry a truth-value — an assertion of fact, a quantified claim, a named-entity attribution, a recommendation with its qualifier. Tag each statement by type (factual, evaluative, prescriptive, historical) and by whether it is even measurable as written.

In practice — and this is the part nobody warns you about — roughly half the “claims” on a typical marketing page are unrecoverable as atomic statements. They are too vague to be matched against anything a model might produce. “Our platform delivers industry-leading performance” is not a statement; it is a slot. Pages whose substance is mostly slots have a structural visibility ceiling no amount of distribution can move.

Probe

Run a controlled prompt set against the target model or set of models. The prompt set should be drawn from a real distribution — query logs, sales-call transcripts, support tickets, the audit interviews we run with the client’s own customers — not from a keyword tool’s autocomplete suggestions. We typically use 40–80 prompts per topic for the standard audit, scaled up to 200+ for position-defining studies. Each prompt is repeated three times to dampen the sampling noise that stochastic decoding introduces, with temperature held at the model’s default unless we are specifically testing temperature sensitivity.

Record, per statement, whether it appears in the response and in what form: verbatim (exact lift), paraphrased-with-source (rephrased but cited), absorbed (the claim shows up but uncited), or contradicted (the model asserts the opposite). Aggregate to a per-statement visibility score across the prompt set.

Compare

Re-run the probe against a held-out set of competing documents to estimate marginal contribution — not just “did my claim appear” but “did it appear instead of a competitor’s.” This is the operation most teams skip, and it is the one that turns the framework from descriptive into operational. Visibility is relative; a claim that always loses to a better-sourced rival has a visibility near zero regardless of how often the topic arises. The compare step is what tells you whether a piece of work is winning, holding, or quietly being replaced.

In the corpus described above, the compare step reshuffled the per-document visibility ranking dramatically: of the documents in the top quartile by raw visibility, 31% dropped out of the top quartile once competitor displacement was factored in. The remaining 69% — the ones that held their position relative to competitors — were the ones that had statement-level distinctive content, not just statement-level correct content.

What the data shows

Across the 3,200-document corpus, six verticals, 14 models, and 90 days, one finding dominates everything else, and it is structural, not statistical. Research and long-form editorial content retained between 52% and 67% of its initial citation rate after thirty days. Commercial pages — including conventional, well-optimised SEO content with strong link profiles and clean technical health — dropped below 12% within fourteen days. The decay profile is the headline.

Content typeDay 0Day 7Day 14Day 3030-day retention
Peer-reviewed / academic88%81%74%67%76%
Long-form editorial (signed, dated, sourced)84%73%65%52%62%
Technical documentation (versioned)79%71%60%48%61%
News (original reporting)71%48%28%14%20%
Commercial SEO content (top-quartile)64%39%18%8%12%
Commercial SEO content (median)52%24%9%3%6%
AI-generated content (undisclosed)31%11%4%1%3%
Fig. 1. Citation-retention curves across content types, 30-day window. Each line is the mean across 14 frontier models, weighted by model traffic-share where known and uniformly otherwise. Shaded bands are 95% bootstrap CIs.

The decay is not a function of crawl freshness — we instrumented for that, and the freshness signal explains less than 8% of the variance in the commercial-content cohort. The decay tracks the weight a model places on statement durability over page recency. A claim that is specific, quantified, and tied to a verifiable source ages slowly. A claim that is broad, superlative, and unsourced ages out almost immediately, because the next better-sourced version of that claim displaces it inside the model’s working context.

The AI-generated cohort is worth dwelling on. We tested 480 documents flagged as predominantly AI-generated (using a panel of three current-generation classifiers with majority vote). Their initial visibility was lower than hand-written commercial content (31% vs. 64% at day 0), and their decay was catastrophic — 1% retention at day 30. The leading explanation is that the models, in selecting which version of a claim to surface, systematically prefer sources whose phrasing they have not previously seen embedded in their own training distribution. The mechanism is plausibly self-protective: surfacing your own paraphrase back at you degrades answer quality, and the post-training processes apparently penalise it. Pages built by AI lose their citation rights faster than any other category we measured.

Pages decay. Claims compound. The half-life of the artifact is no longer the half-life of the visibility.

A taxonomy of statement shapes that survive

Not all statements decay at the same rate, and the per-statement differences inside a document are larger than the per-document differences across the corpus. Across the 47,800 probe runs we tagged every observed statement on five dimensions; the dimensions that predict retention, ranked by effect size:

  1. Quantification. Statements containing a specific number (a percentage, a count, a date, a duration) retained at 2.4× the rate of equivalent unquantified statements. The effect is not subtle; it is the largest single structural lever in the dataset.
  2. Source anchoring. Statements traceable to a named primary source (a paper, a patent, a press release, a named person on a named date) retained at 1.9× the rate of statements without explicit attribution.
  3. Entity specificity. Statements naming specific entities (companies, products, places, people) retained at 1.7× the rate of statements about generic categories.
  4. Temporal scope. Statements bound to a specific time window (“as of 2026-Q1”) retained better than both atemporal claims (which felt stale fast) and aggressively current claims (“right now”) which the models refused to surface confidently.
  5. Qualifier proximity. Statements whose qualifier sat within the same sentence retained at 1.4× the rate of statements whose qualifier sat in an adjacent sentence or paragraph — the chunking-survival effect, isolated.

The effects compound. A statement that quantifies, sources, names a specific entity, and bounds itself temporally has a survival multiplier of roughly 8× over a statement that does none of these things. That is the gap between content that compounds and content that evaporates, expressed structurally.

Steelmanning three serious objections

A position paper that does not engage its strongest critics is propaganda. Three objections to this framework are worth taking seriously, and each gets a reply that is partial rather than triumphant.

Objection 1: model behaviour is too unstable to measure. Models update weekly. A measurement framework that depends on observed model behaviour is measuring noise, not signal — by the time you have a result, the system has moved.

The objection is partly right and partly an excuse. It is partly right because inter-week model drift inside a single named model is genuinely large — we observe 8–15% per-statement variance week-to-week even on flagship models with nominally stable version pins. The defensible move is to measure across a panel of models rather than a single model, and to update the panel quarterly. The framework reports panel-aggregated results, not single-model results, for exactly this reason. The objection is an excuse if it is used to justify continuing to measure nothing.

Objection 2: this is just SEO with extra steps. Source your claims, quantify them, attribute them to named entities — these are the same tactics the helpful-content era already pushed. You’ve put new vocabulary on old advice.

This one bites a little. The tactics overlap. The reason they overlap is that the helpful-content systems were trained on the same signals the retrieval systems now consume, because Google’s quality-rater data is one of the foundational training inputs for the entire stack. The difference is what the unit of optimisation is. HCU advice was: make the page useful so the page ranks. The advice here is: make the claim defensible so the claim survives the chunker. These overlap in tactics and diverge in what counts as success — and the divergence shows up exactly where it matters, in the cases where HCU-passing pages have low statement-level visibility and ostensibly weaker pages have high statement-level visibility. Those cases exist; we report them in §5.

Objection 3: the unit will shift again. Today the unit is the claim. Tomorrow it will be the entity, the workflow, the agentic-action endpoint. Building a measurement framework against the current unit is investing in infrastructure that will be obsolete within two years.

This one is correct and we should plan for it. The framework is built around the operation (extract, probe, compare) rather than the unit (claim). Reapplying the same operations against entity-level or workflow-level artifacts is mechanical once the units are defined. The current paper happens to instantiate the operations at claim level because that is the unit the 2026-vintage retrieval pipeline rewards. The pipeline of 2028 will reward something else, and the framework’s first task at that point will be to identify what unit that is. That is the kind of obsolescence the practice has been managing for twenty-seven years.

Implications for practice

If the unit of competition is the claim, the unit of production has to follow. A practitioner who writes pages to compete for keywords is competing one substrate behind. The practitioner who writes statements — atomic, specific, quantified, source-anchored — and then assembles them into pages is operating one substrate ahead. This is not a content-marketing pivot dressed up in new vocabulary; it is closer to how academic publishing has always worked. The contribution is the finding; the paper is the vehicle in which the finding travels and gets cited. The job is to produce findings worth carrying.

Three changes earn their keep immediately and are within reach of any team that is willing to revise existing content rather than only write new content:

  • Front-load and self-contain claims. Put the assertion and its qualifier in the same retrievable span — one paragraph, one chunk — so the claim survives chunking intact. Most existing commercial content fails this passively, by spreading the qualifier across two paragraphs or hiding it inside a footer disclaimer.
  • Quantify and source. A specific number and a citation are the two cheapest durability signals available. In the corpus, each independently roughly doubled survival; together they compounded to 3.7×. The work is editorial, not technical — your engineering team cannot do it for you.
  • Anchor to entities. Statements tied to named companies, products, people, and places survived at 1.7× the rate of statements tied to generic categories — which is also the reason entity disambiguation infrastructure (Person schema, sameAs graphs, Wikidata nodes) is no longer optional. The entity layer and the claim layer reinforce each other.

For organisations that publish at volume, the operational shift is to a statement registry: a maintained store of the team’s atomic claims, each with provenance, version, and a stable URL fragment. New pages assemble from the registry; revisions flow back into the registry. The page becomes the vehicle, the registry becomes the asset. This is where production teams that are willing to invest now will outpace teams that continue to think of pages as the primary unit.

What we still do not know

The framework is a first cut, and it is wrong in at least four places I can name.

The weighting term $v(c_i)$ — how much a given claim is worth — is still set by hand and there is no obvious way to automate it without producing a metric that optimises against itself. Multi-corpus retrieval, where a model draws across verticals and languages, introduces interaction effects we have not characterised; the 90-day single-vertical methodology underspecifies them. The thirty-day decay window is almost certainly too short and we are extending the study to 180 days; the working hypothesis is that the durability gap widens, not closes, but we will report what the data shows. And the claim-matching classifier itself has a measurable error rate that we have not yet propagated through to the visibility numbers in §5 — the headline retention figures should be read with an implicit ±3-point uncertainty band that the next version of this paper will make explicit.

The framework will be wrong in interesting ways. If you find one of them, write — the working archive exists precisely to be corrected in public.5

References

  1. Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(1–7), 107–117. — The original PageRank paper — the citation-graph abstraction this work argues has been superseded.
  2. Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). — The canonical RAG paper. Foundational for understanding the retrieval-then-generation pipeline this paper measures against.
  3. Singhal, A. (2012). Introducing the Knowledge Graph: things, not strings. Google Official Blog, May 16, 2012. — The substrate's first explicit shift away from string-level retrieval.
  4. Google Search Central (2022). More content by people, for people in Search (Helpful Content Update). Google Search Central Blog, August 18, 2022. — The last substantive attempt to enforce claim-level quality through page-level penalties.
  5. Karpukhin, V., Oğuz, B., Min, S., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of EMNLP 2020. — The DPR paper — passage-level retrieval as the technical primitive underneath statement-level visibility.
  6. Izacard, G., & Grave, E. (2021). Leveraging Passage Retrieval with Generative Models for Open-Domain Question Answering. Proceedings of EACL 2021. — Fusion-in-Decoder. Explains why claims that span chunks lose against claims that fit within one.
  7. Liu, N. F., Lin, K., Hewitt, J., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the ACL, Volume 12. — Why front-loading the claim matters: models systematically under-weight the middle of a retrieved passage.
  8. Min, S., Krishna, K., Lyu, X., et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Proceedings of EMNLP 2023. — Methodologically adjacent — atomic-claim decomposition for factuality, repurposed here for visibility measurement.
  9. Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. EACL 2024 Demo Track. — Closest published cousin to the probe-and-compare operations described in §4.
  10. Gao, Y., Xiong, Y., Gao, X., et al. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. ACM Computing Surveys. — The most current comprehensive survey of the retrieval-augmented pipeline this paper sits inside.
  11. Sasson, G. (2026). A taxonomy of LLM citation behavior across 14 frontier models. Algoholic, Vol. III, Essay 02. — Companion piece. Categorises the citation-emission patterns this paper uses as the dependent variable.
  12. Sasson, G. (2026). Ranking ≠ retrieval ≠ generation. A decomposition. Algoholic, Vol. III, Essay 03. — The pipeline decomposition referenced throughout §2.
  13. Sasson, G. (2011). What optimising for AltaVista taught me about LLMs. Algoholic, Vol. I, Essay 10. — The historical setup — pre-PageRank retrieval as analogue for post-PageRank retrieval.
  14. OpenAI (2025). GPT-5 Pro system card. OpenAI technical documentation, October 2025. — Reference behaviour for the GPT-5 Pro tests in §5; version pin documented in Methods.
  15. Comscore & Sistrix (2025). AI Overview impact on zero-click rates: 2025-Q3 multi-market analysis. Industry telemetry report, syndicated. — Independent measurement of the substrate shift driving the framework's relevance.

Footnotes

  1. A blunt audit. Of the eight major rank-tracking platforms in common enterprise use as of early 2026, exactly zero report at the sub-document level. Three offer “AI Overviews tracking” — they screenshot the overview and OCR the citation list. None decompose the document into atomic claims and measure each claim’s appearance rate against a real prompt distribution. The measurement gap is not an oversight; it is a tooling debt left over from the page-was-the-unit era.

  2. Google announced the Knowledge Graph on May 16, 2012, with the framing “things, not strings” — explicit acknowledgment that entity-level retrieval was replacing string-level retrieval at the substrate. The page model survived the announcement by a decade, but in retrospect this is the moment the abstraction started to leak.

  3. Comscore and Sistrix telemetry across U.S. and EU markets, 2024-Q4 through 2025-Q3, both report informational-query zero-click rates above 64% in verticals where AI Overviews trigger reliably. Commercial-query zero-click is lower — the model still surfaces the buy box — but trending in the same direction quarter over quarter.

  4. The formalism is deliberately rough. The interesting structure is not in the equation but in the three estimation problems it forces into the open: estimating $P$ (which requires a real prompt-distribution sample, not a keyword list), estimating $\mathbb1[c_i \in M(p)]$ (which requires claim-matching, not string-matching), and assigning $v(c_i)$ (which is still an editorial judgment, not yet automatable). A more rigorous derivation, including the inter-model variance term, will appear as Appendix A of the extended technical report.

  5. Corrections, replications, and counter-results received before the v3.0 revision will be acknowledged by name in the changelog and incorporated into the next published cut. Methodology data and probe sets are available on request for serious reproduction efforts.

Version v2.0
Published May 21, 2026
Last revised May 30, 2026
Length 5,337 words · 25 min
Cite as Sasson, G. (2026). Statement-level visibility, or: why ranking a page no longer matters. Algoholic, Vol. III, Essay 04, v2.0. https://algoholic.com/research/statement-level-visibility
Gilad Sasson

Gilad Sasson

aka Algoholic · גלעד ששון

Gilad Sasson, also known as Algoholic, is an Israeli digital marketing expert, founder & CEO of nekuda Web Solutions, and a pioneer in search engine optimization and data analytics since 1999. Head of internet & search at Zap Group 2002–2006; CMO at Interlogic 2006–2009. Speaker at SMX Israel, TNW Amsterdam, Web Summit Dublin, DMIEXPO.