Ranking ≠ retrieval ≠ generation. A decomposition.

Abstract

The vocabulary of the present moment treats ranking, retrieval, and generation as near-synonyms — three almost-interchangeable words for “the thing the AI does when you ask it a question.” They are not synonyms. They are three distinct operations, chained in sequence in every retrieval-augmented system running in production today, and each one fails in its own characteristic way. This paper separates them — with notation, with worked examples, with a per-stage table of inputs, outputs, levers, and failure modes — and then audits which of the heuristics twenty-five years of search-engine optimisation gave us still apply to which operation, which have flipped sign, and which were always proxies for things we can now measure directly. The decomposition is the load-bearing methodological move for every measurement framework in the rest of the volume. If you cannot say which of the three operations you are optimising, you are optimising none of them; you are running a metric against the wrong substrate and reading the noise as signal.

Key claims the scannable version

Ranking, retrieval, and generation are three distinct operations, not three words for the same thing. Each has its own input, output, scoring mechanism, and characteristic failure mode; treating them as synonyms — as most vendor decks and conference talks now do — guarantees you are optimising the wrong substrate and reading the noise as signal.
The conflation of the three operations is the single largest source of bad measurement in GEO work today. Most of the advice currently being sold to enterprise buyers reduces, on inspection, to ranking-era tactics re-skinned for an operation (retrieval, generation) that no longer governs whether their claims appear in the answer.
Retrieval operates below the document level on chunks of 128 to 1,024 tokens, modally 256 to 512. A page that satisfies a query at the document level can fail catastrophically at the chunk level when its assertion sits in paragraph one and its qualifier sits in paragraph four — the retrieval system, which never sees the document, will pull the half-claim.
The three operations multiply, so the binding constraint dominates end-to-end visibility. A 50% gain in ranking when retrieval sits at 5% buys almost nothing; the same 50% applied at the binding constraint — usually retrieval, in the corpora I audit — produces a real 50% jump in answer-mention probability.
A claim can win one stage and lose the next. “Win retrieval, lose generation” — your passage is selected, then paraphrased past attribution into a synthesised sentence the model emits as its own — is invisible to every current GEO tool whose primary metric is appearance in the citation list.
Two of the three operations had no SEO-era lever at all. Retrieval and generation simply did not exist as practitioner-facing optimisation targets in the page-was-the-unit substrate, which is why the per-stage audit table has empty cells exactly where the next decade of applied craft has to be built.
Agentic pipelines extend the decomposition, they do not collapse it. Self-RAG and tool-using architectures add a fourth planning operation on top of the three; the independence-of-failure-modes argument holds inside each cycle, and the planning layer carries its own failure modes — planning loops, over-retrieval, under-confidence cycling — that deserve their own instruments.
Some SEO-era heuristics were always proxies for things the new substrate measures directly. Topical authority becomes statement durability across the model panel, entity coverage becomes named-entity recognition rate across retrieved spans, and “domain authority” itself will obsolesce within thirty-six months as claim-level credibility becomes directly measurable.

The vocabulary of the present moment treats ranking, retrieval, and generation as near-synonyms — three almost-interchangeable words for “the thing the AI does when you ask it a question.” Vendor decks slide between them. Conference talks use them in the same sentence as though they refer to the same mechanism. The defensive crouch the SEO industry adopted in late 2024 — “GEO is just SEO with extra steps” — depends on the conflation being invisible, because the moment you separate the operations cleanly the steps stop looking extra and start looking like a different game played on a different board.¹

They are not synonyms. They are three distinct operations, chained in sequence, and each one fails in its own characteristic way. The thesis of this paper is simple and load-bearing for the rest of the volume: the conflation of these three operations is the single largest source of bad measurement in generative-engine optimisation work, and most of the GEO advice currently being sold to enterprise buyers reduces, on inspection, to ranking-era tactics re-skinned for an operation that no longer governs whether their claims appear in the answer. The decomposition is not an academic refinement. It is the prerequisite for every measurement framework in the rest of this volume — statement-level visibility, citation-behaviour taxonomy, retrieval-survival testing, reproducibility scoring. Each of those instruments measures a single operation cleanly only because we have first agreed which operation it is measuring.

I am sensitive to this in the specific way that twenty-seven years of practice makes you sensitive. nekuda was founded in 1999 — the year before PageRank became dominant — and the first SEO work I did was for Excite, HotBot, AltaVista, and Yahoo, engines whose ranking functions bore little resemblance to one another and even less to what came after.² Watching the field collapse those distinctions in the early-2000s “Google is search” consensus was instructive. Watching the field repeat the same flattening today — treating “the LLM” as one machine whose internals we need not understand — is worse than instructive. It is the same mistake, recapitulated by people who lived through the first round.

Ranking, defined

Ranking is the operation classical information retrieval has spent fifty years defining and twenty-five years of commercial SEO has spent trying to influence. Its input is a query and a corpus. Its output is an ordered list of documents from that corpus, sorted by predicted relevance to the query. The mechanism is a scoring function — at its simplest, BM25 over lexical overlap weighted by inverse document frequency; at its most elaborate, a learned deep-network ranker fed hundreds of features that include link-graph statistics, behavioural-engagement signals, freshness, and a long tail of vertical-specific quality indicators. The shape of the operation does not change as the scorer grows more sophisticated. Input: query plus corpus. Output: ordered list of documents.

Manning, Raghavan, and Schütze devote the bulk of Introduction to Information Retrieval to this single operation — to the algorithms that compute it efficiently at web scale, to the evaluation metrics (MAP, MRR, nDCG) that score its output against human relevance judgments, to the indexing structures (inverted indices, postings lists, signature files) that make it tractable. Robertson and Zaragoza’s monograph on BM25 is essentially a 130-page analysis of one ranking function and its variants. The fact that an entire textbook tradition exists for this one operation, with its own evaluation tradition and its own engineering substrate, is the first sign that ranking is not the same thing as retrieval-in-the-modern-sense.

Twenty-five years of commercial SEO is the applied study of this function from the outside. We learned its features by probing it — by changing one variable on a page and measuring rank movement. We built link-acquisition practices because link counts moved rank. We built keyword-density practices because lexical overlap moved rank. We built schema markup practices because entity disambiguation moved rank.³ The entire toolkit was reverse-engineered against the assumption that the SERP — the ordered list of ten blue links — was the artifact whose rank we needed to influence. For two and a half decades this assumption was correct, because the page was the unit a user clicked through to and read. The artifact and the optimisation target coincided.

The ranking operation is still running. Google still ranks. Bing still ranks. The vertical engines (Amazon, YouTube, Pinterest) all still rank. What has changed is that in the new pipeline, the ranked list is no longer the artifact the user consumes. It is an intermediate data structure handed downstream to a retrieval system, which uses it as a candidate set. The ranked list survives. Its role has been demoted.

Retrieval, defined

Retrieval, in the sense the term carries in retrieval-augmented generation systems, is a different operation. Its input is also a query plus a corpus — but its output is not an ordered list of documents. Its output is a small set of chunks: passages of text, typically a few hundred tokens each, extracted from one or more documents, selected for inclusion in a model’s limited context window. The crucial word in that sentence is chunks, not documents.⁴ Retrieval operates below the document level, on spans of text that may or may not carry the surrounding context that makes them meaningful, and the selection mechanism is dense-vector similarity in an embedding space rather than lexical or link-graph relevance.

The Karpukhin et al. DPR paper, published at EMNLP 2020, is the canonical reference for the operation as it is now implemented in production RAG systems. Their architecture trains a pair of BERT-derived encoders — one for the query, one for the passage — to produce dense vectors whose dot-product similarity approximates relevance. The corpus is pre-indexed: every passage is encoded once, the vectors stored in an approximate-nearest-neighbour index (FAISS, ScaNN, HNSW). At query time, only the query side is encoded; the top-K nearest passages are returned. The retrieval is passage-level by design. A document is not the unit. The unit is whatever fragment the chunking pipeline produced upstream.

This single design choice — to retrieve fragments rather than documents — is where the gap with classical IR becomes structural rather than incremental. Twenty-five years of SEO craft taught us to write pages that satisfy a query. The retrieval substrate is not asking the page question. It is asking, of every 300-token window in the corpus, whether that window — as a standalone unit, stripped of its document context — answers the query well enough to be worth one of the limited slots in the model’s context window. The page-level craft does not transfer cleanly. A page that satisfies a query at the level of the whole document can fail catastrophically at the chunk level if its qualifier sits in paragraph four while its assertion sits in paragraph one. The retrieval system, which never sees the document, will pull the assertion without the qualifier and hand the model a half-claim.⁵

This is the operation Lewis et al. wired into the RAG architecture published at NeurIPS 2020 — the architectural source for essentially every commercial generative-answer system you have used in the last three years. The retrieval component sits between an upstream ranker (which may select the candidate documents) and a downstream generator (which composes the answer from the retrieved chunks). It is a distinct component with a distinct loss function and a distinct failure mode, and the engineering team that builds it is almost never the team that built the ranker. In production, retrieval and ranking are literally different services.

Generation, defined

Generation is the third operation. Its input is a prompt — the user’s question, the retrieved context, any system instructions — and its output is new text. The text is composed token by token by a language model conditioned on the prompt and on the model’s parametric prior (the knowledge baked into its weights during training). The output may quote the retrieved context verbatim, paraphrase it, blend it with parametric knowledge, contradict it, or ignore it entirely. The model is not a search engine returning a result. It is a sequence model sampling from a learned distribution, and the retrieved context is one input among several.

This is where citation happens. This is where attribution happens. This is where paraphrase, absorption, and contradiction happen. It is also where the parametric knowledge of the model — the world the model learned during pre-training, plus whatever was tuned in during the post-training process — enters the answer in ways that are often invisible to the user and difficult to attribute. A model that has seen your assertion ten thousand times during training may reproduce it without ever having retrieved it; a model that retrieves your assertion once may paraphrase it past attribution; a model that both retrieves and recognises an assertion may still decide that a competitor’s phrasing is cleaner and quote that instead.

Izacard and Grave’s Fusion-in-Decoder work formalises one architecture for this composition stage — concatenating retrieved passages and letting the decoder attend across all of them — and the Borgeaud et al. RETRO paper explores the scaling limit, in which retrieval happens at trillions-of-tokens scale and the generation stage adjudicates among an enormous number of small retrieved fragments. Different architectures, same three-operation shape: a ranker (or a flat retriever skipping the rank stage), a passage selector, a generator. The generation stage is doing work that has no analogue in classical IR. It is composing.

The textbook tradition for this operation is a decade younger than the ranking textbook tradition, and most of it is still in the form of conference papers rather than consolidated monographs.⁶ The Gao et al. RAG survey, the Self-RAG line of work, the rapidly-evolving evaluation literature (RAGAS, FActScore, the various hallucination benchmarks) — these are the documents where the generation operation is being characterised in real time.

A worked pipeline example

   ┌──────────────────────────────────────────────────────────────┐
   │                          QUERY                               │
   │                "what is the half-life of X?"                 │
   └──────────────────────────────────────────────────────────────┘
                                │
                                ▼
   ┌──────────────────────────────────────────────────────────────┐
   │  [1] RANKING                                                 │
   │      classical IR · BM25 + learned features                  │
   │      input:  query + full corpus                             │
   │      output: top-K candidate URLs (ordered list)             │
   │      failure mode: relevant pages absent from top-K          │
   └──────────────────────────────────────────────────────────────┘
                                │
                                ▼
   ┌──────────────────────────────────────────────────────────────┐
   │  [2] RETRIEVAL                                               │
   │      dense passage retrieval · embedding similarity          │
   │      input:  query + chunks from candidate URLs              │
   │      output: top-N passage spans (unordered set)             │
   │      failure mode: claim and qualifier in different chunks   │
   └──────────────────────────────────────────────────────────────┘
                                │
                                ▼
   ┌──────────────────────────────────────────────────────────────┐
   │  [3] GENERATION                                              │
   │      sequence model · context + parametric prior             │
   │      input:  prompt + retrieved passages                     │
   │      output: answer text (+ optional citations)              │
   │      failure mode: paraphrase past attribution               │
   └──────────────────────────────────────────────────────────────┘
                                │
                                ▼
   ┌──────────────────────────────────────────────────────────────┐
   │                         ANSWER                               │
   │            text the user actually reads + cites              │
   └──────────────────────────────────────────────────────────────┘

Fig. 1. The three-stage pipeline with characteristic failure modes per stage. The arrows show data flow; the boxes show what each stage outputs into the next.

The table below collapses the same information into a per-stage audit. The columns are deliberately asymmetric: the SEO-era lever column documents what the practice actually did for two decades; the new lever column documents what the operation actually responds to today. Where the two columns disagree, the disagreement is the story.

Stage	What it optimises	Classic SEO lever	New lever	Characteristic failure
Ranking	document relevance + authority	links, anchor text, on-page keywords, technical health	same — ranking is still ranking	document never enters the candidate set
Retrieval	passage-level semantic similarity	(no analogue — operation did not exist as a target)	chunk self-containment, claim-qualifier proximity, dense-embedding affinity	claim survives selection but loses its context
Generation	answer composition + citation	(no analogue)	epistemic framing, attribution scaffolding, distinctive phrasing, entity anchoring	claim reproduced without credit, or contradicted by parametric prior

Fig. 2. Per-stage levers and failure modes — the operational shape of the decomposition. The 'classic SEO lever' column documents the historical practice. The 'new lever' column documents what the operation actually responds to in 2026 production systems.

The table makes the empty cells visible. Two of the three operations had no SEO-era lever at all, because the operations themselves did not exist as optimisation targets in the page-was-the-unit substrate. The blank cells are the opportunity surface for the next decade of work.

They fail independently

The reason the distinction is not pedantic is that a claim can win at one operation and lose at the next. Each stage is a filter, and a piece of content can pass two filters and fail the third in ways that are invisible if you are only measuring at the top of the funnel or the bottom. The three cross-cutting failure modes worth naming separately:

Win retrieval, lose generation

Your passage is selected. The retrieval system pulls it into the model’s context window. The dense-embedding similarity score was high, the chunk was self-contained, everything upstream worked. The model reads the chunk — and then paraphrases past attribution, blending your phrasing with a competitor’s into a synthesised sentence that the model emits as its own. You influenced the answer. You got no credit. From your monitoring perspective the failure is invisible: the page is indexed, the page is referenced in retrieval logs if you have access to them, the output simply does not name you.

This is the failure mode the Liu et al. “Lost in the Middle” paper points toward. It is also the failure mode that makes the entire “appears in citations” metric used by most current GEO tools fundamentally under-instrumented: appearance in the citation list is the output of a process that already discarded most of the information about which retrieved passages actually influenced the answer.

Win generation, lose ranking

The inverse failure. The model loves your sentence. It reproduces your phrasing across dozens of unrelated prompts, paraphrases your framing into its own answers, cites you in some fraction of generations. But when a user runs the same query against a list-style answer system — Google’s classical ten blue links, a sources-tab view, a Perplexity citation list — your domain never appears. The generated answer and the classical SERP have diverged. You are winning the substrate the user reads while losing the substrate the user is shown alongside it.

The strategic question this raises is uncomfortable: which substrate is the one you are competing for? For most clients the honest answer is “both, with the weighting shifting toward generation quarter over quarter.” For some clients — particularly those in verticals where the generative answer has already replaced the click — the honest answer is that ranking is now the diagnostic, not the goal.

Win ranking, lose retrieval

The classical case. You rank #1 in the blue links. Your page has the right authority signals, the right link profile, the right technical health. The upstream ranker hands your URL to the retrieval system as a top candidate. And then the retrieval system chunks your page — and the chunk that survives contains your assertion (“X has a half-life of 12 hours”) without the qualifier that makes it true (“…under physiological conditions; ex vivo the figure is closer to 4 hours”). The chunk arrives at the generator orphaned. The generator either uses the wrong half of your claim or rejects it as unsupported and reaches for a competitor’s better-structured chunk.

This is the failure mode that catches the most experienced practitioners off guard, because every signal they are accustomed to reading is green. The diagnostic instrument for it is a retrieval-survival test — feed the page into the candidate chunkers used by the major frontier models, inspect the chunks the page produces, check whether each chunk is meaningful as a standalone unit. The whole instrument is downstream of the realisation that retrieval is a separate operation.

The decomposition formalised

Notation makes the structure precise enough to optimise against. Let q be a query, c a claim that appears in some document D, and M the target model. The probability that M mentions c (verbatim, paraphrased, or attributed) in its answer to q decomposes into three independent factors:

P(answer mentions c | q)
   =  P(D ∈ candidate_set | q)            ← ranking
    × P(chunk(c) ∈ retrieved | D, q)      ← retrieval
    × P(c used | chunk(c) retrieved, q)   ← generation

Three factors, three operations, three independent levers. Each factor is between zero and one. Each can be measured separately — though most of the GEO tooling stack measures only the product, and reports the product as though it were a single number worth tracking.

The decomposition has an immediate consequence: optimising one factor while the others are near zero buys you almost nothing. A 50% improvement in ranking when retrieval is at 5% moves the product from 0.025 × ranking × generation to 0.025 × 1.5 × generation, which is a real but small change. The same 50% improvement applied at the binding constraint — usually retrieval, in the corpora I audit — moves the product from r × 0.05 × g to r × 0.075 × g, a 50% jump in end-to-end visibility. The operations multiply, and the multiplication makes the constraint dominate. Find the binding constraint, fix it, then look at the next one.⁷ This is the operational discipline the decomposition enables and that the conflated view actively prevents.

Where SEO-era heuristics still apply

The decomposition cuts both ways. It tells you, precisely, where the old toolkit still works — a question worth answering honestly so the volume does not read as “throw everything out.” A great deal of SEO craft transfers cleanly to the new pipeline. Some of it transfers to a different operation than it was originally designed for. And some of it stops working entirely. The honest accounting:

Still works — at the same operation it always served. Crawlability and indexability remain prerequisites for retrieval, because a passage that cannot be fetched cannot be chunked, embedded, or selected. The old technical-SEO hygiene — clean HTML, sensible information architecture, working canonicals, hreflang for multilingual content, robots directives, sitemap discoverability — is table stakes for the new game. Nothing about generative retrieval makes any of this less important; if anything, the dense-embedding substrate is more punishing of badly-rendered or partially-fetched content than the classical index ever was.

Still works — but at a different operation than expected. Domain authority still helps at retrieval and generation, but indirectly. It is not link equity moving a URL up an ordered list; it is one more signal the model uses when deciding which of two competing claims to trust. Schema markup, similarly, helps generation ground entities — the structured data is one of the ways the model decides that two phrasings refer to the same company or product — even though “schema for rich snippets” was a ranking-era framing. The heuristic transferred. The operation it serves did not.

Stopped working. Anything that assumed the ranking function as the terminal step. Keyword density as a primary lever. Link velocity as a primary lever. Thin pages spun for the long tail. Aggressive interlinking optimised for PageRank flow. These were tactics aimed at moving a URL up an ordered list of candidate documents. They are at best neutral and at worst counterproductive in the new pipeline, which has a different terminal step and rewards different qualities of the underlying content.

Always was a proxy. Several of the more sophisticated SEO heuristics — topical authority, entity coverage, internal-linking taxonomies — were always proxies for things the substrate could not measure directly. The substrate now can measure those things directly. Topical authority becomes statement durability across the model panel. Entity coverage becomes named-entity recognition rate across the retrieved spans. Internal linking becomes within-document chunk coherence. The proxies are obsoleted by direct measurement; the underlying signals they pointed toward are more important than ever.⁸

The operational programme

Each operation requires its own instrument. The rest of the volume builds those instruments, and the decomposition is what tells us which instrument measures which thing.

For ranking, the existing instrumentation mostly works. Rank trackers still measure rank. Search Console still reports impressions and positions. The infrastructure built over twenty-five years for this operation is mature and remains useful for the operation it was built for. The honest reframing is not that we should throw it out but that we should stop reading its outputs as proxies for the downstream operations they no longer track.

For retrieval, the instrument is the retrieval-survival test: feed your page into the chunkers used by the major frontier models, inspect the chunks the page produces, check whether each chunk is self-contained and meaningful as a standalone unit, score the page on chunk-survival rate. This instrument did not exist three years ago. It does now, and one form of it is sketched in the methodology section of statement-level visibility. The probe protocol — running a controlled prompt set against the model and recording which of your claims appear in the response — is the corresponding instrument for retrieval-conditioned generation.

For generation, the instruments are attribution-rate measurement and reproducibility scoring across a panel of models. The attribution-rate question is: when the model uses my claim, does it credit me? The reproducibility question is: does my claim reappear across re-runs and paraphrases, or did it flicker once and vanish? These are measurable quantities. The volume’s third essay — the citation-behaviour taxonomy — operationalises the first; the methodology section of statement-level visibility operationalises both.

The cross-cutting move is that the three instruments are additive, not substitutive. You need all three to know which of the three operations is your binding constraint, and you need to know your binding constraint before you can sensibly invest in fixing it. The decomposition is the analytical scaffolding for that investment decision. Without it, you are optimising the operation you happen to have a tool for, which is rarely the operation that is costing you visibility.

Steelmanning two objections

A position paper that does not engage its strongest critics is propaganda. Two objections to the decomposition are worth taking seriously, and each gets a reply that is partial rather than triumphant.

Objection 1: this is just IR 101 with new vocabulary. Manning, Raghavan and Schütze defined ranking and retrieval as separate operations in 2008. Lewis et al. wired generation into the pipeline as a separate component in 2020. You are presenting as a novel decomposition something the IR literature has held as basic architectural truth for two decades. The framework adds nothing.

The objection is half right in the strongest possible way. The operations themselves are old; the IR community has had them clean since well before the SEO industry existed. What is new is not the operations but the decomposition’s relevance to the practice. The SEO industry spent twenty-five years operating on a substrate in which only the ranking operation was exposed to the practitioner. Retrieval, in the IR-textbook sense, was hidden inside the index and behaved like a black box that was either correct or not. Generation did not exist at all. The practitioner had only one lever because the substrate exposed only one operation. The new pipeline exposes three. Re-presenting the IR decomposition to a practitioner audience that has never had to think above the ranking layer is not “rediscovering IR 101”; it is doing the translation work that no one else is doing. The academic field solved the problem. The applied field has yet to absorb the solution. This paper is a contribution to the absorption.

Objection 2: agentic pipelines collapse this back to a single operation. Self-RAG, agentic search, the new tool-using architectures — these systems do not have three clean stages. They have a model that decides when to retrieve, what to retrieve, whether to re-retrieve, and how to compose. The three-stage decomposition is already obsolete; you are writing a foundational paper around a transient architecture.

The objection is gesturing at something real but drawing the wrong conclusion. Agentic and self-correcting pipelines do not collapse the decomposition into a single operation. They extend it by adding a fourth — planning, or meta-control: the decision about which retrieval to run and which generation to attempt and whether to revise. The Asai et al. Self-RAG paper is the cleanest worked example: the model emits reflection tokens that trigger re-retrieval, evaluate the retrieved content, and decide whether to continue generating or revise. This is a fourth operation sitting on top of the three. The decomposition still applies inside each cycle. What changes is that the cycles compose, and the planning layer has its own failure modes (planning loops, over-retrieval, under-confidence cycling) that deserve their own analysis.⁹ The decomposition does not become obsolete; it becomes a sub-routine inside a larger loop.

Limitations

The decomposition is a first-cut analytical instrument and it leaves several things out that matter operationally. Naming them honestly:

Agentic loops are flattened. As above, the three-stage view treats the pipeline as a single forward pass. Production systems increasingly run the pipeline in a loop, with the model deciding whether to re-retrieve based on its own assessment of the first retrieval’s adequacy. The decomposition holds inside each cycle, but the cross-cycle dynamics — when does re-retrieval help, when does it loop, how do you measure visibility across a multi-cycle generation — are out of scope here and treated as a separate problem in the later agentic-pipeline work.

Multi-turn refinement is unmodelled. The framework as presented assumes a single-turn query-answer pair. Real conversations refine across turns, and the visibility of a claim in turn three may depend on what was retrieved and generated in turn one. The multi-turn dynamics deserve their own treatment; the single-turn decomposition is the foundation, not the whole building.

Hybrid lexical-plus-dense retrieval has interaction effects. Most production retrieval stacks today combine BM25-style lexical scoring with dense-vector similarity, often with a re-ranker on top. The interaction between these scoring signals produces selection behaviour that is neither pure lexical retrieval nor pure dense retrieval. The decomposition treats retrieval as a single operation; the internal structure of that operation, and the cases where the hybrid scoring inverts the selection a pure-dense system would have made, are not characterised here.

Cross-language and code-switched corpora compound everything. My practice has run RTL and Hebrew-English corpora since 2002, and the chunk-survival profile of a multilingual document is meaningfully different from a single-language one. The framework as stated is monolingual by default; the multilingual extensions are work in progress.

None of these limitations break the central argument. They mark where the next refinements of the decomposition need to land. The point of publishing the framework in its first-cut form is to make those refinements collaborative rather than private — to write the framework down well enough that someone else can find the place where it is wrong and improve it.

In summary the eight points to remember

Ranking, retrieval, and generation are three distinct operations, not three words for the same thing. Each has its own input, output, scoring mechanism, and characteristic failure mode; treating them as synonyms — as most vendor decks and conference talks now do — guarantees you are optimising the wrong substrate and reading the noise as signal.
The conflation of the three operations is the single largest source of bad measurement in GEO work today. Most of the advice currently being sold to enterprise buyers reduces, on inspection, to ranking-era tactics re-skinned for an operation (retrieval, generation) that no longer governs whether their claims appear in the answer.
Retrieval operates below the document level on chunks of 128 to 1,024 tokens, modally 256 to 512. A page that satisfies a query at the document level can fail catastrophically at the chunk level when its assertion sits in paragraph one and its qualifier sits in paragraph four — the retrieval system, which never sees the document, will pull the half-claim.
The three operations multiply, so the binding constraint dominates end-to-end visibility. A 50% gain in ranking when retrieval sits at 5% buys almost nothing; the same 50% applied at the binding constraint — usually retrieval, in the corpora I audit — produces a real 50% jump in answer-mention probability.
A claim can win one stage and lose the next. “Win retrieval, lose generation” — your passage is selected, then paraphrased past attribution into a synthesised sentence the model emits as its own — is invisible to every current GEO tool whose primary metric is appearance in the citation list.
Two of the three operations had no SEO-era lever at all. Retrieval and generation simply did not exist as practitioner-facing optimisation targets in the page-was-the-unit substrate, which is why the per-stage audit table has empty cells exactly where the next decade of applied craft has to be built.
Agentic pipelines extend the decomposition, they do not collapse it. Self-RAG and tool-using architectures add a fourth planning operation on top of the three; the independence-of-failure-modes argument holds inside each cycle, and the planning layer carries its own failure modes — planning loops, over-retrieval, under-confidence cycling — that deserve their own instruments.
Some SEO-era heuristics were always proxies for things the new substrate measures directly. Topical authority becomes statement durability across the model panel, entity coverage becomes named-entity recognition rate across retrieved spans, and “domain authority” itself will obsolesce within thirty-six months as claim-level credibility becomes directly measurable.

References

Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(1–7), 107–117. — The PageRank paper; foundational for the ranking operation as historically practised.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. — Canonical IR textbook. The ranking operation is the subject of essentially the whole book.
Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4). — BM25 — still the lexical baseline that dense retrieval competes against in §3.
Karpukhin, V., Oğuz, B., Min, S., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. — DPR — the retrieval operation as currently implemented in production RAG systems.
Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. — The canonical RAG paper; the architectural source for the three-stage decomposition.
Izacard, G., & Grave, E. (2021). Leveraging Passage Retrieval with Generative Models for Open-Domain Question Answering. EACL 2021. — Fusion-in-Decoder. Explains the generation operation's dependence on retrieved-passage quality.
Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. ICLR 2024. — Self-correcting RAG; the model evaluates its own retrieval — a four-operation extension of the framework.
Borgeaud, S., Mensch, A., Hoffmann, J., et al. (2022). Improving Language Models by Retrieving from Trillions of Tokens (RETRO). ICML 2022. — DeepMind's retrieval-at-scale architecture; informs the retrieval-vs-generation boundary in §4.
Liu, N. F., Lin, K., Hewitt, J., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the ACL, Volume 12. — Explains the 'win retrieval, lose generation' failure mode in §6.
Gao, Y., Xiong, Y., Gao, X., et al. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. ACM Computing Surveys. — Most current comprehensive RAG survey; useful for the limitations discussion.
Sasson, G. (2026). Statement-level visibility, or: why ranking a page no longer matters. Algoholic, Vol. III, Essay 04. — The downstream paper this decomposition supports.
Sasson, G. (2026). A taxonomy of LLM citation behavior across 14 frontier models. Algoholic, Vol. III, Essay 03. — Operationalises the generation stage of the decomposition.

I have been collecting the phrase “ranking in ChatGPT” from industry talks since mid-2024. The count is now over two hundred occurrences and it is meaningless in every one of them. ChatGPT does not rank. Its upstream retrieval system ranks (sometimes), and then the model generates an answer that may or may not preserve any order the retriever imposed. The vocabulary error pre-loads the analytical error. ↩
The full version of the career history is on the CV page; the relevant fact here is that I have personally tuned content for at least five distinct ranking-function generations (pre-PageRank lexical; PageRank-era link authority; Panda/Penguin quality-and-link-graph; BERT/MUM neural semantic; SGE/AI-Overviews generative). The decomposition in this paper is what you get when you try to write down what was actually changing across those transitions, rather than what each vendor said was changing. ↩
The reverse-engineering tradition has a name — “the SEO test” — and a literature stretching from the 2002 Brett Tabke “26 steps” guide through the modern A/B-testing platforms run by enterprise SEO teams. The substrate has changed; the methodology of probing it by changing one variable at a time has not. The methodology transfers to retrieval and generation, but the unit you vary is no longer the page; it is the chunk-able span and the claim-level phrasing respectively. ↩
Typical production chunk sizes as of 2026 range from 128 to 1,024 tokens, with 256–512 the modal setting in the open-source stacks (LangChain, LlamaIndex, Haystack) and somewhat larger windows in the proprietary enterprise systems. The relevant point for this paper is that the chunk size is essentially never the length of a complete article; the retrieval substrate has chosen, by construction, to operate on sub-document fragments. ↩
The Liu et al. “Lost in the Middle” paper is the now-canonical demonstration that even when long contexts are successfully retrieved, models systematically under-weight information that sits in the middle of the retrieved span. The effect compounds with the chunking problem: claims that survive retrieval may still fail to influence generation if they are not positioned at the start or end of their chunk. ↩
The asymmetry matters for practice. A practitioner can walk into ranking with fifty years of accumulated academic craft, a canonical textbook, and a stable evaluation tradition. A practitioner walking into generation today is operating against a literature that is still being written, with evaluation methodologies that are themselves contested, on top of a substrate (the frontier models) that updates faster than the papers can be published. The discomfort is structural, not temporary; it will not resolve into textbook stability for at least another five years. ↩
The “find the binding constraint” framing is borrowed from the theory of constraints in operations management, where the same arithmetic applies — throughput is governed by the slowest stage of a multi-stage process, and investment anywhere else is wasted. The literature on this is old (Goldratt’s The Goal dates to 1984) and has been independently rediscovered in software performance engineering as “Amdahl’s law thinking”. The decomposition imports the same discipline into GEO. ↩
The pattern in which a proxy outlives the thing it was proxying for, and then becomes obsolete when the underlying thing becomes measurable, recurs across the history of the field. Keyword density was a proxy for topical fit until topical fit became directly measurable via semantic embeddings. Anchor-text exact-match was a proxy for inter-document relationship until knowledge-graph extraction became reliable. The current proxy that will obsolesce next, on my read of the substrate, is “domain authority” itself — measurable claim-level credibility will replace it within the next thirty-six months. ↩
The four-operation extension — planning, ranking, retrieval, generation — is the working framework for the agentic-pipeline essays later in this volume. The same independence-of-failure-modes argument holds: each operation fails in a distinct way, each requires a distinct instrument. The instruments for the planning layer are the least mature and the most rapidly-evolving, which is part of why those essays are sequenced later. ↩

Gilad Sasson

aka Algoholic · גלעד ששון

Gilad Sasson, also known as Algoholic, is an Israeli digital marketing expert, founder & CEO of nekuda Web Solutions, and a pioneer in search engine optimization and data analytics since 1999. Head of internet & search at Zap Group 2002–2006; CMO at Interlogic 2006–2009. Speaker at SMX Israel, TNW Amsterdam, Web Summit Dublin, DMIEXPO.

LinkedIn @algoholic Work with me →