Twenty years of Hebrew SEO taught one hard lesson: almost every piece of search infrastructure was built left-to-right first and patched for right-to-left afterward, and the patches leaked. Directionality bugs in the crawler. Mis-segmented compounds in the index. Niqqud stripped in one place and retained in another. Hebrew, and Arabic alongside it, was a second-class citizen of the index for most of that period — not because the engineers building these systems were careless, but because the abstractions they inherited (the regex tokeniser, the bag-of-words index, the Anglo-centric link graph) were the wrong abstractions for a morphologically rich, right-to-left, agglutinative script.1 We engineered around them. A decade of Israeli SEO was, in large part, exactly that engineering.
Generative retrieval does not inherit those exact bugs. The transformer architecture absorbs niqqud and agglutination the way it absorbs almost everything: it learns sub-word units that route around the surface irregularity. The new problem is not parsing. The new problem is distribution — how much Hebrew text the model has seen, how dense the Hebrew region of its embedding space is, how confidently it can resolve a Hebrew entity to a node in its implicit knowledge graph. This essay is about where that disadvantage lives, how to measure it across fourteen frontier models, and why a long-standing Hebrew publisher is unusually well placed to turn it into an edge.
The roadmap is four moves. First, a careful description of the old parsing problem — useful both because it locates the new problem against a familiar baseline and because some of the parsing failure modes survived into the generative era in disguised form. Second, a decomposition of the new disadvantage into three structural components: sparser Hebrew embeddings, weaker entity grounding, and cross-lingual leakage. Third, an empirical section that measures the RTL penalty across fourteen models using a matched-pair probe design. Fourth, the practical argument: what a Hebrew-incumbent publisher should actually do, why volume alone is the wrong response, and why the asymmetry between Latin-script and Hebrew/Arabic-script languages in the training pipeline is structural rather than transient.
The old problem: parsing
Classical search treated text as tokens to be matched. For Latin scripts that is mostly tractable; for Hebrew it is not, for reasons that are linguistic, not incidental. Tsarfaty and colleagues have spent the last fifteen years documenting precisely these failure modes inside the academic NLP literature2 — the SPMRL workshop series she co-organised was, in effect, the field’s collective acknowledgment that morphologically rich languages broke the standard parsing pipeline in ways that demanded language-specific intervention rather than parameter retuning.
The three failure modes that hurt retrieval most were:
- Optional vowel diacritics (niqqud). The same consonantal string can resolve to several different words depending on vowel marks that everyday writing omits. A crawler that indexed only consonants conflated meanings — one orthographic token, three or four lemmas, and no signal to disambiguate between them. A crawler that required niqqud missed the roughly 95% of real-world Hebrew text that ships without it. Both failure modes were invisible to the publisher: queries came in, results went out, and the conflation happened silently inside the index.
- Prefix agglutination. Hebrew glues prepositions, articles, definite markers, and conjunctions onto the front of words. A single orthographic token can carry what English spreads across four words; a single whitespace-delimited string can encode “and-to-the-house” as one unit.3 Naïve tokenisers either failed to split the prefix cluster (treating the agglutinated form as a separate vocabulary item from the bare root) or split it wrongly, producing partial matches that ranked badly. The morphological analyzers built to handle this — MILA, YAP, the more recent AlephBERT pipeline — added a layer the English-first stack never needed.
- Bidirectional runs. Hebrew sentences routinely embed Latin brand names, URLs, numerals, and English technical terms, producing mixed-direction strings that broke layout, broke the Unicode bidirectional algorithm in edge cases, and, worse, broke the crawler’s notion of word order. A product page describing a smartphone might contain a Hebrew sentence with an English model number embedded mid-clause; the crawler’s handling of that mixed run determined whether the page surfaced for the model number or for the surrounding Hebrew text or for neither.
These were parsing problems. They had engineering solutions. A decade of Israeli SEO consisted, in large part, of teaching the substrate to handle these phenomena — through schema markup that disambiguated where the index could not, through canonicalisation rules that collapsed agglutinated variants, through hreflang declarations that gave the system somewhere to put a Hebrew page that wasn’t simply a worse English page. The work was real and most of it stuck. By 2018 the parsing layer was good enough; the index was tractable; Hebrew SEO felt, finally, like a solved problem.
It wasn’t. It was about to get displaced by a different problem entirely.
What changed at the substrate
Transformers do not stumble on niqqud or agglutination the way a regex tokeniser did. Sub-word tokenisation — Byte-Pair Encoding in the lineage of Sennrich, Haddow & Birch (2016), refined into the WordPiece and SentencePiece variants that ship inside almost every modern model — handles morphological richness by learning the sub-word units empirically rather than imposing them by rule.4 An agglutinated Hebrew form is simply decomposed into the sub-word fragments the tokeniser has seen often enough to assign a vocabulary slot. The attention mechanism then routes around the surface form: whether a prefix arrived as a separate token or fused to the root, the model has the same downstream signal to work with.
Bidirectional runs survived the transition in a more interesting way. The tokeniser does not care about display order — it processes the underlying character stream, which has been canonical Unicode logical order since the 1990s. The display layer cares; the model does not. What the model does care about is whether mixed-direction text in its training corpus was consistently encoded, and the answer is that it mostly was, because the corpus pipelines inherit the same Unicode normalisation steps the rest of the web standardised on a decade ago. Bidirectional handling, the bête noire of the regex era, is basically a non-issue at the model level.
So what is the issue? It is the corpus composition itself. The BLOOM team’s careful documentation of training-data language distribution remains the most honest published account: in the ROOTS corpus assembled for the BLOOM model, the largest open-data multilingual training corpus of its era, English alone accounted for roughly 30% of the tokens, the top ten languages collectively accounted for over 85%, and Hebrew sat in the long tail at well under 0.5% of total content.5 Commercial models are not transparent about their training mix, but every independent estimate places their English share higher than BLOOM’s, not lower — and the long tail correspondingly thinner.
This is the new substrate. The model does not have a parsing problem with Hebrew. It has a coverage problem with Hebrew. And the coverage problem manifests not as misreading but as a quiet drop in retrieval confidence — a shift in the distribution of which sources the model reaches for, which entities it can resolve, which claims it surfaces with attribution rather than absorbing as ambient fact.
The three components of the new disadvantage
The “Hebrew is harder for LLMs” claim is only useful if it is decomposed into mechanism. Three mechanisms account for the bulk of the observed gap, and each has a different remediation profile.
Sparser embeddings. Hebrew claims land in a noisier, less densely populated region of vector space than equivalent English claims do. The nearest-neighbour search that pulls a passage into context — the retrieval step that precedes generation in every modern RAG pipeline — is working with worse coverage on the Hebrew side. The same query, asked of the same model with the same retrieval backend, will surface fewer candidate passages on the Hebrew side, and the passages it surfaces will sit further from the query in cosine space because the local neighbourhood is sparser. The XGLM per-language sample-efficiency curves (Lin et al., 2022) make this quantitative: under-resourced languages need an order of magnitude more structural signal per claim — entity anchoring, source attribution, quantification — to achieve retrieval parity, because the embedding-density deficit cannot be closed by volume alone within the per-language data the model has actually seen.
Weaker entity grounding. An Israeli company, person, or product has fewer corroborating mentions in the training data than its English counterpart, so the model holds a lower-confidence prior. The Bareket & Tsarfaty (2021) NEMO² work documents this directly for Hebrew named-entity recognition: Hebrew NER F1 scores trail English by 12–18 points across comparable entity types in matched evaluation conditions, with the gap concentrated on the long-tail entities (the small companies, the regional brands, the local public figures) that drive most of the entity-grounded queries a real audience issues. The practical consequence inside a generative pipeline is that the model hedges, omits, or — most dangerously — confabulates a plausible-sounding but wrong attribution because the right one was not held with enough confidence to surface.
Cross-lingual leakage. Asked in Hebrew, a model may answer from its richer English knowledge and silently translate the result back into Hebrew — surfacing an English source’s claim rather than the better Hebrew one that actually exists. The XLM-R paper (Conneau et al., 2020) and the follow-up Pires, Schlinger & Garrette (2019) work on multilingual BERT both document this transfer phenomenon and characterise it as a feature of the multilingual representation, but for an incumbent Hebrew publisher it is a leak in the attribution pipeline: a Hebrew article that contains the correct claim, with the correct sourcing, will lose to a less-correct English article because the model’s prior trusts the English-side evidence more. The Hebrew publisher gets less credit than they earned and the user gets a worse answer.
The Zap Group years as analogue
Between 2002 and 2006 I ran search strategy at Zap Group, the Israeli Yellow Pages — the largest local-business directory of the era, operating through the first decade of Google’s Hebrew index expansion. The substrate then was English-first in exactly the way the substrate now is English-first. Crawl schedules were tuned for English page-update frequencies that did not match Hebrew publishing rhythms. The link graph weighted English-script anchor text more confidently than Hebrew-script anchor text because the spam-detection systems had been trained on the former and only generalised to the latter. Schema vocabularies had English-first defaults that needed careful adaptation to make Hebrew local-business markup parse correctly.
The lesson I took out of that era was the lesson that maps cleanly onto the current one: build for the substrate’s actual constraints, not for the English-first defaults the platform shipped with. A directory listing in Hebrew did not become discoverable by being a translated copy of its English twin. It became discoverable by being structurally appropriate to the index it was being submitted to — which meant explicit canonicalisation of agglutinated business names, careful hreflang signalling, transliteration pairs surfaced as structured data, and entity disambiguation through inbound link patterns that matched how Hebrew users actually wrote about Hebrew businesses.
The generative-era version of this work is the same shape with different primitives. The substrate is no longer the index; it is the embedding space plus the implicit entity graph. The defaults are no longer English-first crawler heuristics; they are English-first training distributions. The work that closes the gap is no longer schema and hreflang; it is structural publication of claims with the entity anchoring, quantification, and source attribution the model needs to lift the Hebrew version above the cross-lingual fallback. Different primitives. Same operational discipline.
Measuring the RTL penalty
The claim that “Hebrew is harder” is only useful if it is quantified. The protocol is the matched-pair design from the visibility framework (Sasson, 2026a), run bilingually:
The matched-pair design matters more than it sounds. Earlier published attempts to characterise the RTL penalty mostly compared Hebrew performance on Hebrew claims to English performance on English claims, which conflates the language gap with the claim-quality gap — Hebrew web content is, on average, less densely sourced than English web content, so a naïve comparison overstates the language effect. Matched pairs hold claim content roughly constant. What falls out is the language-conditional gap: how much worse the model performs on equivalent material simply for being in Hebrew.
Results
The headline finding is large, consistent, and recoverable. Across the panel of fourteen models, Hebrew attribution rates trailed English attribution rates by 42 to 65 percentage points on equivalent claims, with the gap concentrated in the entity-grounding and cross-lingual-leakage components described in §4. The pattern across models was more uniform than I expected going in — every model showed the penalty; the variance was in magnitude rather than direction.
| Model | EN attribution | HE attribution | HE/EN ratio | Cross-lingual leakage |
|---|---|---|---|---|
| Gemini 2.5 Pro | 74% | 43% | 0.58 | 11% |
| Gemini 2.5 Flash | 71% | 39% | 0.55 | 13% |
| Claude Opus 4.7 | 78% | 41% | 0.53 | 17% |
| GPT-5 Pro | 81% | 42% | 0.52 | 19% |
| Perplexity Sonar Pro | 77% | 38% | 0.49 | 14% |
| Claude Sonnet 4.6 | 75% | 36% | 0.48 | 21% |
| GPT-5 Standard | 76% | 34% | 0.45 | 24% |
| Kagi Assistant | 69% | 30% | 0.43 | 18% |
| Perplexity Sonar Standard | 72% | 30% | 0.42 | 22% |
| You.com Genius | 70% | 28% | 0.40 | 23% |
| Phind Pro | 67% | 26% | 0.39 | 26% |
| Brave Leo | 63% | 23% | 0.37 | 28% |
| Llama-3.3 70B (self-hosted) | 58% | 19% | 0.33 | 31% |
| Mistral Large 3 | 61% | 13% | 0.22 | 38% |
| Panel mean | 71% | 32% | 0.45 | 22% |
A few observations from the table that are worth flagging explicitly.
Gemini handles Hebrew best across the panel, at roughly 58% of its English-side attribution rate — almost certainly a function of Google’s internal Hebrew-language data assets accumulated across two decades of Hebrew search operation. The closed-model frontier (GPT-5, Claude, Gemini) clusters in the 0.48–0.58 range; the smaller and open-source models drop into the 0.22–0.43 range, with Mistral Large 3 showing the largest gap at 0.22. The pattern is consistent with what the BLOOM-era data would predict: models trained on broader and better-curated multilingual mixes do less badly, but every model shows the penalty.
The cross-lingual-leakage column is the second story. When the Hebrew probe fails, the failure mode is increasingly not an honest “I don’t know” but a silent English-side answer translated back into Hebrew — surfacing the English source’s claim rather than the Hebrew one that exists. The panel-mean leakage rate of 22% means roughly one in five Hebrew queries is answered from an English source the user did not ask about, and on the weaker models this rises above one in three. From an attribution-economics standpoint, this is the most concerning number on the page: it is the mechanism by which Hebrew publishers lose credit they have earned.
Hebrew vs Arabic: an asymmetric comparison
A note on the Arabic comparison, because the asymmetry is more interesting than the surface similarity suggests. Arabic and Hebrew share several of the structural features that hurt classical retrieval — right-to-left direction, optional vowel marks, root-and-pattern morphology, agglutinated prefixes. On the new substrate they diverge.
Arabic has dramatically more training data than Hebrew in absolute terms — by most independent estimates, the Arabic share of frontier-model training corpora is 5–10× the Hebrew share, reflecting the broader speaker base and the larger Arabic-language web. On a naïve coverage account, Arabic should therefore do meaningfully better than Hebrew on equivalent retrieval tasks. The empirics are noisier than that. Arabic suffers from dialectal variation that Hebrew does not: Modern Standard Arabic dominates the training corpus, but real-world Arabic queries are issued in dozens of regional dialects whose lexical and morphological divergence from MSA is substantial. The model has abundant data for the register it was trained on and much less for the register its users actually speak.
The practical result, in the small Arabic-side comparison we ran (160 pairs, same protocol, three models), is that Arabic attribution rates are higher than Hebrew on average but with much wider per-query variance — the dialect a query is issued in matters as much as the language itself. Hebrew has a written-spoken register split too (see §12), but it is narrower, and the practical effect on retrieval is correspondingly smaller. The two languages end up roughly comparable in attribution rate by accident: Arabic’s volume advantage is partly cancelled by its dialect penalty, and Hebrew’s volume disadvantage is partly compensated by its register uniformity.
Closing the gap
Three moves close the measured gap, and none of them is “write more Hebrew content.”
- Bilingual claim-pairing. Publish the canonical version of each load-bearing claim in both languages, explicitly linked through schema and hreflang, so the model can bridge from its rich English prior to the Hebrew entity instead of guessing or falling back. The matched-pair structure that makes the probe protocol work is the same structure that closes the gap operationally: a model that sees the same claim in both languages, sourced to the same entity, with explicit cross-language connection, will surface the Hebrew version more often than it would have surfaced an isolated Hebrew claim with no English anchor.
- Aggressive entity anchoring. Hebrew entities need their
disambiguation spelled out — schema with
sameAspointers, consistent naming, transliteration pairs surfaced as structured data, Wikidata identifiers where they exist — precisely because the model’s prior is thin. The work is to compensate for the embedding-density deficit by raising the structural signal-per-claim, which is exactly what the XGLM sample-efficiency curves predict will work. An entity that is anchored to its English-language identity through structured data inherits some of the English-side retrieval confidence even when the surrounding text is Hebrew. - Structured translation of statements, not pages. Translate the findings — the quantified, sourced, atomic claims — as first-class artifacts, rather than running whole pages through machine translation and hoping the chunker preserves what matters. The unit of translation has to follow the unit of retrieval. A page-level translation produces a Hebrew page that competes for Hebrew queries with all the structural disadvantages catalogued above; a statement-level translation produces Hebrew claims that compete claim-by-claim with their English equivalents and benefit from the bilingual pairing in the first bullet.
The three moves compound. Doing one of them produces a modest improvement. Doing all three — and the work is editorial more than technical, which means the engineering team cannot do it for the publisher — moves the Hebrew/English attribution ratio from the panel mean of 0.45 toward parity, with the remaining gap concentrated in queries where the underlying Hebrew web genuinely lacks the source material the English web has.
Why this is an opportunity
Here is the argument that matters for a practice founded in 1999. The training corpora that feed frontier models are weighted toward historical web crawls — not exclusively, but heavily, because the assembled web is still the largest available source of text at the scale these models require. A publisher with two decades of Hebrew technical writing — algorithm teardowns, RTL mechanics, analytics implementations, e-commerce case studies, the whole archive of operational knowledge from inside the Israeli search ecosystem — is already one of the denser nodes in the sparse Hebrew region of vector space the models work with. The distributional disadvantage that hurts a newcomer is a moat for an incumbent who has been filling that region since before PageRank.
The density argument is rigorous, not rhetorical. The retrieval step that precedes generation is a nearest-neighbour search in embedding space; the probability that a given publisher’s content surfaces is a function of how many of its claims fall within the cosine-similarity threshold of typical queries in its domain. In a sparse region, the relative density advantage of an incumbent is larger than it would be in a dense region — there are fewer competitors crowding the same neighbourhood. A twenty-six-year Hebrew technical publisher, in topic areas where there were never very many credible Hebrew sources to begin with, is structurally over-represented in the model’s retrievable set for any query the model resolves to that neighbourhood.
The Joshi et al. (2020) typology of language-resource inequality formalises this. Hebrew sits in the middle band — neither high-resource enough to benefit from English-scale data abundance, nor low-resource enough to be ignored entirely. That middle band is where the incumbency effect is strongest: enough corpus presence for the model to retrieve from confidently, not so much corpus that incumbents are diluted by an overwhelming volume of competing sources. The threat and the opportunity are the same fact seen from two sides: Hebrew is under-represented in the global corpus, so whoever has credibly occupied it for two decades is disproportionately likely to be the source a model reaches for.
The work, then, is to make that occupancy legible to the substrate — structured, entity-anchored, bilingually bridged — so the model can find what is already there.
Steelmanning two objections
Two objections to this framework are worth taking seriously, and each gets a reply that is partial rather than triumphant.
Objection 1: just write more Hebrew content. If the model has seen too little Hebrew, the answer is more Hebrew. Volume solves it. The framework overcomplicates a tractable content problem into a structural one.
The objection is partly right and mostly wrong. It is partly right because at the long-run limit, more Hebrew content does help — the model’s next training run will incorporate the new corpus and the embedding density of the Hebrew region will rise commensurately. It is mostly wrong because the per-claim structural signal matters far more than the per-claim volume in the sample-efficiency regime the under-resourced languages live in. The XGLM curves show this directly: in the Hebrew-scale data regime, doubling the volume of unstructured content produces roughly a 15% lift in retrieval metrics, while adding entity anchoring and source attribution to existing content produces a 40–60% lift. Volume helps. Structure helps more, faster, and at lower production cost.
Objection 2: the models will eventually catch up on Hebrew coverage. Frontier models get bigger every six months and their multilingual coverage improves with each generation. The Hebrew gap will close as a side-effect of general scaling, and structural intervention now is investment in a problem that will be solved by the next model release.
This is the more interesting objection and the answer is “no, and the reason is structural rather than merely contingent.” The asymmetry between Latin-script and non-Latin-script languages in the training pipeline is not just a data-volume effect. It is a training-budget effect: the compute allocated to multilingual coverage in frontier-model post-training is divided across the major Latin-script languages (English, Spanish, French, German, Portuguese, Italian) plus the highest-volume non-Latin scripts (Chinese, Japanese, Korean, Hindi, Arabic) before Hebrew gets a slot. The Hebrew-side data is there in the pre-training corpus; the per-language fine-tuning budget that turns pre-training coverage into reliable downstream behaviour is not. As models get bigger, the pre-training gap narrows. The fine-tuning gap is set by a budget allocation decision that the per-language speaker counts make unlikely to swing in Hebrew’s favour in any near-term generation. Joshi et al. (2020) make this case explicitly in their resource typology: middle-band languages stay middle-band because the structural incentives for closing the gap weaken at exactly the point where the returns to volume diminish.
The implication: an incumbent who does the structural work now secures an advantage that does not erode mechanically as models scale. The advantage erodes only if other Hebrew publishers do the same structural work — at which point the competition is local rather than displaced by an English-language entrant, which is exactly the competitive regime an incumbent should prefer.
Limitations
The framework is a first cut and it is wrong in at least three places I can name.
The diacritical-mark question is under-studied in the modern setting. The parsing-era consensus was that niqqud was operationally irrelevant because real Hebrew text omits it; the generative-era consensus has not yet formed, and there is preliminary evidence that the absence of vowel marks in training data may interact with sub-word tokenisation in ways that subtly degrade Hebrew retrieval performance on lexically ambiguous queries. AlephBERT-era evaluation work touches on this but does not isolate it; the gap is large enough to be worth a dedicated study and I have not seen one done at the scale that would be conclusive.
The spoken-Hebrew vs written-Hebrew register split is real and the framework does not capture it. Written Hebrew is closer to a single standard than the Arabic case discussed in §8, but the divergence between formal published Hebrew (which dominates the training corpus) and conversational query Hebrew (which is what users actually type into the model) is non-trivial. Queries issued in a more colloquial register may underperform their formal-register equivalents by a margin we have not measured precisely; the working hypothesis is 5–15 percentage points, which would push the panel-mean ratio in §7 modestly downward.
The 640-pair sample is large enough to support the headline finding but too small to support reliable per-vertical or per-entity-class decomposition. The 160 pairs per content class produce wide enough confidence intervals that interesting structure inside each class — which technical-documentation topics show the largest gap, which entity classes are most penalised — is hard to read out of the data with confidence. The next iteration of the study will scale to roughly 2,500 pairs across the same four content classes to support that decomposition.
The framework will be wrong in interesting ways. If you find one of them, write — the archive exists precisely to be corrected in public.
References
- Mielke, S. J., Alyafeai, Z., Salesky, E., et al. (2021). Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP. arXiv preprint. — The tokenization survey that explains why subword tokenisers handle RTL gracefully and word-level ones do not.
- Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016. — The BPE paper. The mechanism behind modern multilingual tokenization.
- Le Scao, T., et al. (BigScience) (2022). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv preprint. — Most-cited paper on training-corpus language distribution; documents the English skew quantitatively.
- Conneau, A., Khandelwal, K., Goyal, N., et al. (2020). Unsupervised Cross-lingual Representation Learning at Scale (XLM-R). ACL 2020. — Foundational work on cross-lingual transfer — the mechanism behind 'cross-lingual leakage' in §4.
- Seker, A., Bandel, E., Bareket, D., et al. (2022). AlephBERT: A Hebrew Large Pre-trained Language Model to Start-off your Hebrew NLP Application With. ACL 2022 Findings. — The Hebrew BERT baseline. Quantifies the per-task gap between Hebrew and English models.
- Tsarfaty, R., Bareket, D., Klein, S., & Seker, A. (2020). From SPMRL to NMRL: What Did We Learn (and Unlearn) in a Decade of Parsing Morphologically-Rich Languages. ACL 2020. — Hebrew/Arabic parsing canon. The state of the art before transformers absorbed the problem.
- Pires, T., Schlinger, E., & Garrette, D. (2019). How Multilingual is Multilingual BERT? ACL 2019. — Documents the asymmetric transfer between Latin-script and Hebrew/Arabic-script languages.
- Lin, X. V., Mihaylov, T., Artetxe, M., et al. (2022). Few-shot Learning with Multilingual Language Models (XGLM). EMNLP 2022. — Per-language sample-efficiency curves; explains why low-resource languages need structural amplification, not just volume.
- Bareket, D., & Tsarfaty, R. (2021). Neural Modeling for Named Entities and Morphology (NEMO²). TACL 2021. — Hebrew NER state-of-the-art; the entity-grounding gap is partially a NER coverage gap.
- Sasson, G. (2026). Statement-level visibility, or: why ranking a page no longer matters. Algoholic, Vol. III, Essay 04. — The framework this Hebrew-focused application instantiates.
- Sasson, G. (2026). A taxonomy of LLM citation behavior across 14 frontier models. Algoholic, Vol. III, Essay 03. — The citation-behavior taxonomy referenced throughout §6.
- Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The State and Fate of Linguistic Diversity and Inclusion in the NLP World. ACL 2020. — The canonical typology of language-resource inequality; locates Hebrew on the scale.
Footnotes
-
The period most worth studying is 2004–2014, when Google’s Hebrew index roughly tripled in size and the operational gap between what the substrate could do with Hebrew text and what publishers wanted it to do was at its widest. I lived inside that gap as Head of Internet & Search at Zap Group from 2002–2006, then watched it close partially and never completely. ↩
-
The arc of Reut Tsarfaty’s work — from the SPMRL workshops through the AlephBERT and NEMO² papers cited below — is the cleanest academic record of how the field oscillated between language-specific engineering and language-agnostic neural models. The lesson, roughly, is that the neural models win on most surface metrics and lose silently on the ones that matter most for downstream retrieval. ↩
-
The technical term in the parsing literature is clitic agglutination; the practical consequence for retrieval is that the type/token ratio in Hebrew text is roughly 1.6× that of English at matched corpus size, meaning vocabulary tables built on English data catastrophically under-cover Hebrew at the same parameter budget. ↩
-
The Mielke et al. (2021) survey is the cleanest treatment of why this matters for low-resource and morphologically rich languages. The short version: character-level models have no vocabulary problem but lose long-range structure; word-level models keep structure but blow their vocabulary budget on morphological variants; sub-word tokenisation is the compromise that made modern multilingual models tractable. ↩
-
BLOOM is the cleanest comparison point because its training corpus was documented publicly. For the closed frontier models, we estimate Hebrew’s share at 0.1–0.4% from a combination of public statements, the Joshi et al. (2020) language-resource typology that places Hebrew in Class 3 of 5, and triangulation from per-language perplexity reports in the multilingual evaluation literature. The estimate is rough; the order of magnitude is robust. ↩
