Tools · 04 · free · the cyber wedge
AI crawler policy generator.
Decide what the AI crawlers may do with your site — train on it, index it for citation, fetch it on demand —
and get a copy-paste robots.txt, llms.txt, ai.txt and meta bundle.
Every rule annotated with which bots actually honor it, because most "AI blockers" quietly do nothing.
What should AI systems be allowed to do?
Quick presets:
/robots.txt
§ 01
The honest reference
Which crawlers actually obey.
A Disallow is a request, not a fence. Some bots honor it, some ignore it, some carve out "user-triggered" exceptions. This is the state of play as of 2026-Q2.
| Crawler | Vendor | Purpose | Compliance |
|---|---|---|---|
| GPTBot | OpenAI | Model training | Honors robots.txtBulk crawl for model training. Honors robots.txt. |
| OAI-SearchBot | OpenAI | AI search & citation | Honors robots.txtIndexes for ChatGPT Search citations. Block this and you lose ChatGPT-Search visibility. |
| ChatGPT-User | OpenAI | On-demand user fetch | Honors robots.txtFetches a page when a user pastes/links it in ChatGPT. On-demand, not bulk. |
| ClaudeBot | Anthropic | Model training | Honors robots.txtBulk crawl for training. Honors robots.txt. |
| Claude-User | Anthropic | On-demand user fetch | Honors robots.txtOn-demand fetch when a user references your URL in Claude. |
| Google-Extended | Model training | Honors robots.txtThe ONLY lever for Gemini/Vertex training. Blocking it does NOT affect Google Search ranking — they're separate. | |
| Googlebot | Classic search | Honors robots.txtClassic Search crawl. ⚠ Blocking this removes you from Google Search AND AI Overviews. Almost never block. | |
| CCBot | Common Crawl | Model training | Honors robots.txtFeeds a large share of open training datasets. Honors robots.txt. |
| PerplexityBot | Perplexity | AI search & citation | Partial / disputedIndexes for Perplexity answers. Documented to honor robots.txt; independent reports of stealth fetching exist. |
| Perplexity-User | Perplexity | On-demand user fetch | Partial / disputedOn-demand user fetch. Perplexity states this bypasses robots.txt by design. |
| Applebot-Extended | Apple | Model training | Honors robots.txtApple Intelligence training control. (Applebot itself still serves Siri/Spotlight.) |
| Meta-ExternalAgent | Meta | Model training | Honors robots.txtMeta AI training/crawl. Honors robots.txt. |
| Amazonbot | Amazon | AI search & citation | Honors robots.txtPowers Alexa answers + Amazon AI. Honors robots.txt. |
| Bytespider | ByteDance | Model training | Ignores robots.txt⚠ Widely reported to ignore robots.txt. Blocking is best-effort — pair with a server/WAF rule. |
| cohere-ai | Cohere | Model training | Honors robots.txtCohere model crawl. Honors robots.txt. |
| Diffbot | Diffbot | Model training | Honors robots.txtKnowledge-graph + training data crawler. |
| ImagesiftBot | ImageSift | Image training | Honors robots.txtImage crawl feeding image-training datasets. |
- robots.txt is honor-system. Reputable vendors (OpenAI, Anthropic, Google) respect it. Others (Bytespider) are documented to ignore it. For the bots that ignore it, the only real control is a server / CDN / WAF block on the user-agent or IP range.
- Blocking training ≠ removing what's already trained. These directives are forward-looking. Content already in a training set stays there until the next model generation, if then.
- Don't block Googlebot. It powers Google Search and AI Overviews both.
Google-Extendedis the separate, safe lever for Gemini training — blocking it leaves Search untouched. - llms.txt is a content guide, not a fence. It tells models where your best content is; it does not restrict access, and major models do not yet act on it — independent log studies find ~97% of llms.txt files are never read. See the field note for the preregistered test. Ship it for the upside, not for protection.
- The strategic question isn't "block or allow" — it's "which, and why." Most visibility-focused sites should allow citation crawlers (you want to be cited) while blocking bulk training (you don't want to be raw material). That's the "Protect IP, stay citable" preset.
Going deeper on the risk side? Read the visibility research, or
book an audit — the GEO & LLM-visibility audit ($18K–$28K) includes a full crawler-and-citation posture review.