nekuda was founded in 1999, in a one-room office in Tel Aviv, and the first engines I optimised for were Excite, HotBot, AltaVista, and Yahoo’s human-curated directory. PageRank had launched the year before but had not yet swallowed the index. There was no dominant ranking algorithm — there were five major engines, each weighting text differently, none of them agreeing on what a page was worth.1 The work was crude, frustrating, and, in hindsight, an unexpectedly good education for the world we are now entering. I want to write about it not as nostalgia — most of what we did then was ugly, and I will say so — but because the discipline that the pre-link era forced on us is the discipline generative retrieval is forcing back on the practice today. Twenty years of link-era habit taught a generation of practitioners to think about everything except the text on the page. That habit is now actively misleading. The 1999 instinct, which I had to learn the hard way once and then unlearn the hard way again, is the one that transfers.
The five engines, briefly
To understand what the work actually felt like, you have to picture the market. There were five engines that mattered, and each had a character — a set of tells about what it was rewarding under the hood. You learned them the way a cook learns ovens.
AltaVista was the technical heavyweight. Built inside Digital Equipment Corporation in 1995, run on a wall of Alpha workstations that were, at the time, the fastest general-purpose machines anyone had ever pointed at the web, AltaVista famously claimed an index of roughly sixteen million pages in 1996 and crossed a hundred million by 1998. It survived DEC’s collapse, got passed to Compaq, then to HP, and was finally bought by Yahoo and absorbed. Its scoring rewarded straightforward text presence — title-tag terms, meta keywords, body density — but it was less forgiving of obvious manipulation than its competitors. Of the five, AltaVista is the one that felt the most like a retrieval engine rather than a directory or a portal, which is probably why its discipline is the one that has aged best.
HotBot was the fastest, in two senses. It crawled aggressively, and it was the first engine where you could expect a new page to appear in the index within a week rather than a month. It was powered by Inktomi — Berkeley spinout, brilliant team, and the technical core that AltaVista should have learned from but didn’t. HotBot supported genuine boolean syntax when most engines treated user queries as bags of words. Its scoring leaned on term frequency more aggressively than AltaVista’s, which made it gameable in ways AltaVista wasn’t, and also made it useful as a feedback loop: a page that ranked on HotBot a week after launch told you whether your text was saying the thing or just containing it.
Excite was experimenting with what we would now call entity graphs. Its ranking blended classical IR with what they called “concept-based searching” — roughly, an attempt to identify related terms and cluster results by topic. It was clever; it was also unstable. The same query could return quite different results week to week as the concept model adjusted. Excite got to a market cap of around ten billion dollars at the peak of the dot-com run, bought @Home, and then collapsed catastrophically. The engineering work was ahead of its time. The business case for it was not.
Lycos had the academic origin — built at Carnegie Mellon in 1994 by Michael Mauldin and his team, the name a contraction of Lycosidae, the wolf-spider family — and for a while was the most-trafficked site on the web. Its scoring was the most traditional of the five: TF-IDF in recognisable form, link counts from the few pages that bothered to link out, some early experiments with anchor text. Lycos was the engine where you could most clearly see classical information-retrieval theory operating in a commercial product.
Yahoo, then, wasn’t really an engine — it was a directory. A staff of human editors classified submitted sites into a topic hierarchy, and you either got accepted into the relevant category or you didn’t. Submission required a $299 fee for commercial sites by 2000, no guarantee of inclusion, and a turnaround time that could run to months. When Yahoo accepted you, your traffic doubled overnight. When they didn’t, you had no recourse. The directory was the most consequential single navigational surface on the web through about 2002, and the work of getting a client into the right Yahoo category was its own discipline — closer to publishing-rights work than to SEO.
The job of a search-engine marketer in 1999 was to optimise across the disagreement among these engines. A page that ranked first on HotBot might sit on page three on AltaVista and not appear at all on Excite. You learned to think in terms of what was robust across engines and what was engine-specific tuning, and you kept a mental table of which signals each engine seemed to weight most heavily — which titles HotBot rewarded that AltaVista shrugged at, which descriptions Excite seemed to favour. The practice was inherently multi-system, in a way that the Google-monoculture era after 2003 made us forget.2
How ranking actually worked, pre-PageRank
The substrate question is simple: what were these engines actually measuring? The honest answer, across all five, is that they were measuring presence and proximity of query terms against a document, plus a small set of structural signals about where in the document the terms appeared. There was no authority graph to lean on, because the link graph at any practical scale didn’t yet exist.3 Term frequency mattered. Term position mattered — terms in the title tag carried more weight than terms in a heading, which carried more weight than terms in body text, which carried more weight than terms in a footer. Meta-keyword tags mattered, on the engines that read them at all. Density mattered, up to a point and then back down again.
The scoring was crude but it was also legible. You could look at a query, a page, and a SERP, and you could mostly reason out why the engine had ranked the page where it did. You could change three things on the page, wait two weeks for the recrawl, and watch the rank move. Cause and effect were observable in a way that the modern search practitioner, working against opaque deep-learning rankers, would find quaint and almost unbelievable.
What was being measured was, fundamentally, whether the page’s text said the thing. Did the page contain the query terms? Did they appear in load-bearing positions? Did they appear together, in phrases that suggested the page was really about that topic rather than coincidentally mentioning it? That was most of the model. The link era would later add a powerful exogenous signal — an external vote of confidence — that could compensate for weakness in the document itself. The pre-link era couldn’t. If the document didn’t say the thing, no amount of off-page work was going to rescue it, because off-page work didn’t yet exist as a category.
This had two consequences I want to flag. First, the quality of the text was load-bearing in a way that link-era SEO would later let practitioners forget. Second, the work was easily abused in obvious ways — white-text-on-white keyword stuffing, comment-tag spam, doorway pages, meta descriptions packed with two hundred terms — which is why PageRank, when it arrived, felt to most of us like an upgrade. Citation-graph centrality was a better proxy for quality than raw term presence, because raw term presence could be faked without limit. PageRank required collusion, which scaled worse. For about a decade, this was a real improvement. Then it stopped being one, but that is a later part of the story.
The first nekuda clients
The discipline the 1999-2002 work taught was this: the document had to carry itself on its own text, because there was no authority graph to launder weakness through. When a client came in with a page that wasn’t ranking, the conversation was about the page. Did the title say the thing? Did the first paragraph state the claim cleanly? Was the body text actually about the topic, or was it a thin layer of marketing copy spread over a product database that nobody would read? The audit was textual, because the system was textual. There was nothing else to audit.
I spent a lot of those first three years writing copy, or sitting with clients while they rewrote copy, or explaining to clients why the copy they had was the reason they weren’t ranking. Some of them got it. Some of them didn’t, and went and bought meta-keyword-stuffing tools instead, and got penalised, and either learned from it or didn’t. The ones who got it built durable rankings on durable pages. The ones who didn’t had to redo everything every six months as the engines tightened their spam filters, which by 2001 they were doing routinely.
I want to mark something here that I underweighted at the time and that I think about a lot now. The practitioners who survived the AltaVista era with their craft intact were, almost without exception, the ones who had learned to write the page well — not the ones who had learned the trick of the week. The trick-of-the-week practitioners had to relearn their whole craft every time a major engine updated. The write-the-page-well practitioners had to adjust at the margin and were otherwise fine. That sorting effect, which I lived through as a beginner, recurs every time the substrate shifts. We are inside one of those shifts now.
The arrival of PageRank
PageRank launched in September 1998, when Brin and Page incorporated Google, but the algorithm and the architectural argument behind it had been published in their famous paper that April.4 At the time, almost no practitioner I knew read it. The engines we were working against were AltaVista and HotBot; Google was a curiosity, a project out of Stanford with a clean interface and a strange name. By the end of 1999 it was the search-engine-of-choice among engineers, but it was nowhere on the commercial radar. We optimised for the five engines I described above and treated Google as a footnote.
The dominance crept up. By 2001 Google’s index was meaningfully larger than AltaVista’s. By 2002 the major portals — AOL, Yahoo itself for a time — were running Google as their backend search provider. By 2003 the smart practitioners were optimising for Google first and the others as afterthoughts; by 2005 the others were largely gone as commercial concerns. The shift was profound and most of us mostly missed how profound it was at the time. We adapted at the tactical level — buying domains with relevant keywords, getting links from directories, building reciprocal-link networks — without registering that the substrate argument had changed. The question was no longer “does this page contain the term in load-bearing positions?” The question was “do enough other pages link to this page with text suggesting it is about the term?” Those are different questions, with different optimal strategies, and the practice slowly bent itself around the second one over the following five years.
What we lost in the bend — and what I am asking you to feel, because it is the through-line of the rest of this essay — was the discipline of treating the document as having to carry itself. Once links were available as a rescue mechanism for weak pages, a market grew up around supplying the rescue mechanism. The market was rational, given the substrate. It was also the single biggest deformation of the practice in the years I have been doing it.
The link era taught us to forget the document
Here is the quiet cost of the PageRank era: it taught a generation of practitioners to think about everything except the text. Link building. Anchor-text distribution. Domain authority. Internal PageRank sculpting. Topic authority. Hub-and-spoke architectures. Link-velocity smoothing. Disavow files. Penguin-recovery audits. An enormous craft grew up around moving authority between URLs, and the content itself became almost incidental — a vehicle for the links rather than the thing being judged. You could rank a mediocre page if its link profile was strong enough, and for a long time you could rank a strong page poorly if its link profile was weak.
I want to be fair to that era. It produced real expertise. The practitioners who became excellent at link economics in the 2005-2015 window were not lazy or unserious; they were responding rationally to a ranking system that rewarded what they were doing. The work was harder than it looked. Identifying high-equity link sources, negotiating placements, building durable link assets, anticipating which link patterns would survive the next Penguin update — this was real craft. I respect it and I still draw on parts of it. What I am pointing at is not the craft itself but the habit of mind the craft induced.
The habit was: when a page is not ranking, look at the off-page. Audit the backlink profile. Check the internal links. Look at the anchor-text distribution. Look at the domain’s overall authority score. Compare against competitor backlink profiles. Identify the link gap. Build the link gap. Wait for the recrawl. Watch the rank move. The page itself was almost a constant in the equation — you would optimise it once for keywords, and after that the action was elsewhere. The content team and the SEO team in most agencies were different teams, and the SEO team treated the content team as a service-provider for the real work, which happened in the link graph.
That habit, which was rational for two decades, is now actively misleading. The generation of practitioners who came up entirely inside the link era has, in my experience, the hardest time adjusting to what generative retrieval rewards — because their muscle memory says the answer is in the off-page and the off-page is the thing that no longer matters in the way they expect.
Generative retrieval restored the old constraint
Watch what actually happens inside a model’s context window. A retrieval system has pulled a few passages in. The model is composing an answer from those passages. Inside that window there is no link graph to consult; there is no domain authority to lean on; there are no anchor texts and no internal PageRank flow. There is only the text — how clearly each passage states its claim, how self-contained and legible and defensible that claim is. The model reaches for the passage that most cleanly says the thing.
That is the AltaVista constraint, restored.5 The oldest instinct in the practice — make the content itself carry the ranking, because there is no shortcut — is suddenly the most modern one. The link-era habit of treating text as a vehicle for authority is the thing that has to be unlearned, and the harder it has been to unlearn it for a given practitioner, the more thoroughly link-era they tended to be.
There is a precise and slightly alarming symmetry here. In 1999, if your page didn’t say the thing clearly in its own text, no link economy could rescue it because the link economy didn’t exist. In 2026, if your page doesn’t say the thing clearly in its own text, no link economy can rescue it because the link economy has stopped being what the system is consulting at the point of answer-generation. The pipeline still uses links — retrieval is downstream of crawl, crawl is downstream of discovery, discovery still leans on the link graph in places — but the artifact the user consumes is a generated answer, and the generated answer is assembled from text that has to carry itself inside the context window. Links are part of the substrate. They are not part of the artifact. That distinction is everything.
What transfers
The durable lessons from 1999, the ones that have aged well enough to be worth restating for the 2026 practitioner, are narrower than the nostalgia-merchants want and broader than the nothing-old-applies crowd will admit. I will list them honestly.
First: the discipline of making a single page state its claim clearly, on its own terms, without relying on off-page signals to rescue it. In retrieval terms, that is chunkability and self-containment. The claim and its qualifier in the same retrievable span. The fact and its source in the same paragraph. The number and its date next to each other. These are not new skills; they are the skills the 1999 practice required, expressed in 2026 vocabulary.
Second: the discipline of optimising to a specific engine’s logic rather than to a mythical universal ranking factor. In 1999 this meant understanding that Excite’s concept-clustering, HotBot’s term-frequency sensitivity, and AltaVista’s stricter spam thresholds were different targets. In 2026 it means understanding that GPT-5, Claude Opus, Gemini 2.5, Perplexity Sonar, and the enterprise-RAG systems running on private corpora all reward subtly different things. The 2026 practitioner who optimises to a generic “for AI” target is making the same category error as the 1999 practitioner who optimised to a generic “for search engines” target.
Third — and this is the one I underweight most often when I am tired — the posture that the page has to earn its place on what it actually says. This is not a tactic. It is closer to an ethic. The 1999 substrate enforced it because nothing else was available. The 2002-2022 substrate let us forget it because the link economy was a reliable rescue mechanism. The 2026 substrate is enforcing it again, in a different vocabulary, with different surface manifestations, but the underlying constraint is the same one.
What does not transfer
I want to be careful not to romanticise 1999, and I want to be specific about what we did then that would be punished instantly now.
Keyword density does not transfer. The 1999 instinct to repeat the target term every fifty words is, on an embedding-based retrieval system, indistinguishable from spam — both literally (modern spam classifiers fire on it) and mechanically, in that embedding similarity does not reward literal repetition the way string-matching did. The page that says the thing once, with the right qualifier and a citation, will out-retrieve the page that says the thing twelve times.
Meta-keyword stuffing does not transfer. The meta keywords tag was deprecated by Google in 2009 and is read by no retrieval system that matters in 2026. Time spent on it is time wasted.
Header-tag stuffing does not transfer. The 1999 trick of nesting an H1, two H2s, three H3s, and a string of H4s into a page to load it with keyword hits is a categorical waste of effort against any current retrieval system. Headers still have structural value — they help the chunker — but their keyword-loading function is dead.
Title-tag truncation games do not transfer. The art of getting the maximum number of high-equity terms into the 65-character title window was real work in 2003. The art of writing a title that states the claim the page will deliver, in a sentence a human would actually read, is what works in 2026. The constraint is editorial, not lexical.
Doorway pages do not transfer, hidden text does not transfer, comment-tag spam does not transfer, keyword-loaded alt attributes on decorative images do not transfer, footer-link farms do not transfer. Most of what we did in 1999, even the parts we thought of as legitimate at the time, would be punished instantly now. Embedding-based retrieval reads meaning, not string frequency, and almost every surface trick that worked in the 1999 era was a string-frequency hack.
The clean version is: the tactics of 1999 are mostly extinct. The posture of 1999 — that the document carries itself — is one of the most useful things in the 2026 practitioner’s toolkit, precisely because the intervening twenty years taught most of the field to forget it.
| Engine | Era | What it rewarded | What it punished |
|---|---|---|---|
| AltaVista | 1995–2003 | Title-tag terms, body density, structural HTML | Obvious spam, hidden text, white-on-white |
| HotBot (Inktomi) | 1996–2002 | Term frequency, boolean precision, fast indexing | Less than AltaVista — gameable |
| Excite | 1995–2001 | Concept clusters, related-term breadth | Unpredictable, week-to-week instability |
| Lycos | 1994–2003 | Classical TF-IDF, early anchor-text use | Sparse link graph; little to punish with |
| Yahoo Directory | 1994–2002 | Editorial fit, category accuracy, paid submission | Inclusion was binary; no spam at the rank level |
A personal coda
The through-line of twenty-seven years is not a set of tactics. The tactics have turned over completely, twice. They turned over in 2002-2005 when the link era arrived and rendered most of the 1999 craft obsolete, and they are turning over again in 2024-2026 as generative retrieval renders most of the link-era craft obsolete in turn. The 2026 practitioner who learned in 2015 and the 1999 practitioner who learned in 1999 are in roughly the same position relative to their current substrate: their tactical training is mostly wrong, and they have to relearn the field from the substrate up. This is fine. It is the cost of working in a field where the substrate shifts on a fifteen-year clock.
What does transfer, what is worth carrying across the substrate transitions, is a posture: the document has to earn its place on what it actually says. The link era let us forget that for two decades. Generative retrieval is, in its own strange way, the engines remembering it too.6
I write this in 2026, at a desk in Tel Aviv that is half a kilometre from the one-room office where nekuda started, with twenty-seven years between me and the practitioner who first opened HotBot and AltaVista in two browser tabs and tried to work out why his client was ranking on one and not the other. The tabs are different now. The question is the same one. I am not sure whether to find that comforting or alarming, and after twenty-seven years I have decided it is a little of both.
References
- Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(1–7), 107–117. — The paper that ended the AltaVista era. Cited here for the substrate transition the essay sits at the start of.
- Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill. — The vector-space model. The conceptual ancestor of the dense-passage retrieval that drives generative answer systems today.
- Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1995). Okapi at TREC-3. Text REtrieval Conference (TREC-3) Proceedings. — BM25 in its formative TREC-3 form. The lexical scoring tradition the AltaVista-era practice was implicitly tuning against.
- Lawrence, S., & Giles, C. L. (1999). Accessibility of information on the web. Nature, 400(6740), 107–109. — The famous 16% paper — quantifying how partial each search engine's coverage actually was in 1999. Empirical justification for why the practice required optimising across multiple engines.
- Bharat, K., & Henzinger, M. R. (1998). Improved Algorithms for Topic Distillation in a Hyperlinked Environment. SIGIR 1998. — Early link-analysis algorithms; representative of the work that was about to displace the pre-PageRank approach.
- Henzinger, M. R. (2000). Link Analysis in Web Information Retrieval. IEEE Data Engineering Bulletin, 23(3). — Contemporaneous overview of the link-era transition from the inside of the field.
- Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. — The architectural paper that closed the loop — generative retrieval that, mechanically, restores the AltaVista-era constraint of text-carries-the-rank.
- Sasson, G. (2026). Statement-level visibility, or: why ranking a page no longer matters. Algoholic, Vol. III, Essay 04. — The Volume III paper that operationalises the discipline this memoir traces back to 1999.
- Sasson, G. (2026). Ranking ≠ retrieval ≠ generation. A decomposition. Algoholic, Vol. III, Essay 01. — The decomposition that this memoir's closing argument depends on.
Footnotes
-
Search-engine market share in mid-1999, by my own measurement at the time and corroborated by the Lawrence & Giles coverage paper that December: AltaVista, Excite, HotBot/Inktomi, Lycos, Yahoo, plus a long tail of Northern Light, WebCrawler, Magellan, Snap, and Ask Jeeves. No single engine held more than about a third of the directed-search market, and the Yahoo directory was still receiving more navigational traffic than any of the algorithmic engines individually. ↩
-
I am dating the monoculture from 2003, the year Google’s share in most Western markets crossed sixty per cent and Yahoo announced it would terminate its Inktomi-powered web-search arm in favour of its own crawl. By the end of 2004 the practical search market in English-speaking geographies was Google, with rounding error attached. In Hebrew, which is where I was working, Google’s dominance took another two years to harden — which is part of why I tend to use 2006 as the personal date the monoculture became total. ↩
-
The web in 1999 had roughly 800 million indexed pages by Lawrence & Giles’s measurement, and the median page had under five outbound links. The Excite acquisition of WebCrawler the same year produced one of the first large-scale crawls in which the link graph was actually retained as a queryable structure, but no commercial engine yet ranked on it seriously. The PageRank paper had been published, the algorithm worked, but the link graph it required was a research artifact, not a production one. ↩
-
The paper, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” is still the cleanest single-document description of the PageRank algorithm and its motivation. Worth re-reading every few years; the substrate argument it makes is almost identically transposable onto the current generative-retrieval substrate, with citation-graph centrality swapped for embedding-space neighbourhood and the user-query distribution replaced by the prompt distribution. ↩
-
This is the argument I make at greater length in Statement-level visibility, the Vol. III paper that operationalises what this memoir treats as a personal observation. The empirical work in that paper — three months, fourteen models, three thousand documents — finds that the statement-level retention gap between sourced, self-contained content and conventional commercial copy is roughly an order of magnitude. The mechanism it isolates is exactly the one this memoir is pointing at. ↩
-
The pipeline decomposition this argument leans on — ranking, retrieval, and generation as three distinct operations with three different optimisation targets — is the subject of Ranking ≠ retrieval ≠ generation, the first essay in Vol. III. That paper is the formal version of an intuition the AltaVista years left me with: that the system that retrieves your text is not the system that ranks it, and treating them as one operation is a category error the link era propagated for twenty years. ↩
