Tools · 04 · free · the cyber wedge

AI crawler policy generator.

Decide what the AI crawlers may do with your site — train on it, index it for citation, fetch it on demand — and get a copy-paste robots.txt, llms.txt, ai.txt and meta bundle. Every rule annotated with which bots actually honor it, because most "AI blockers" quietly do nothing.

What should AI systems be allowed to do?

Quick presets:

/robots.txt

§ 01
The honest reference

Which crawlers actually obey.

A Disallow is a request, not a fence. Some bots honor it, some ignore it, some carve out "user-triggered" exceptions. This is the state of play as of 2026-Q2.

CrawlerVendorPurposeCompliance
GPTBot OpenAI Model training Honors robots.txtBulk crawl for model training. Honors robots.txt.
OAI-SearchBot OpenAI AI search & citation Honors robots.txtIndexes for ChatGPT Search citations. Block this and you lose ChatGPT-Search visibility.
ChatGPT-User OpenAI On-demand user fetch Honors robots.txtFetches a page when a user pastes/links it in ChatGPT. On-demand, not bulk.
ClaudeBot Anthropic Model training Honors robots.txtBulk crawl for training. Honors robots.txt.
Claude-User Anthropic On-demand user fetch Honors robots.txtOn-demand fetch when a user references your URL in Claude.
Google-Extended Google Model training Honors robots.txtThe ONLY lever for Gemini/Vertex training. Blocking it does NOT affect Google Search ranking — they're separate.
Googlebot Google Classic search Honors robots.txtClassic Search crawl. ⚠ Blocking this removes you from Google Search AND AI Overviews. Almost never block.
CCBot Common Crawl Model training Honors robots.txtFeeds a large share of open training datasets. Honors robots.txt.
PerplexityBot Perplexity AI search & citation Partial / disputedIndexes for Perplexity answers. Documented to honor robots.txt; independent reports of stealth fetching exist.
Perplexity-User Perplexity On-demand user fetch Partial / disputedOn-demand user fetch. Perplexity states this bypasses robots.txt by design.
Applebot-Extended Apple Model training Honors robots.txtApple Intelligence training control. (Applebot itself still serves Siri/Spotlight.)
Meta-ExternalAgent Meta Model training Honors robots.txtMeta AI training/crawl. Honors robots.txt.
Amazonbot Amazon AI search & citation Honors robots.txtPowers Alexa answers + Amazon AI. Honors robots.txt.
Bytespider ByteDance Model training Ignores robots.txt⚠ Widely reported to ignore robots.txt. Blocking is best-effort — pair with a server/WAF rule.
cohere-ai Cohere Model training Honors robots.txtCohere model crawl. Honors robots.txt.
Diffbot Diffbot Model training Honors robots.txtKnowledge-graph + training data crawler.
ImagesiftBot ImageSift Image training Honors robots.txtImage crawl feeding image-training datasets.
  • robots.txt is honor-system. Reputable vendors (OpenAI, Anthropic, Google) respect it. Others (Bytespider) are documented to ignore it. For the bots that ignore it, the only real control is a server / CDN / WAF block on the user-agent or IP range.
  • Blocking training ≠ removing what's already trained. These directives are forward-looking. Content already in a training set stays there until the next model generation, if then.
  • Don't block Googlebot. It powers Google Search and AI Overviews both. Google-Extended is the separate, safe lever for Gemini training — blocking it leaves Search untouched.
  • llms.txt is a content guide, not a fence. It tells models where your best content is; it does not restrict access, and major models do not yet act on it — independent log studies find ~97% of llms.txt files are never read. See the field note for the preregistered test. Ship it for the upside, not for protection.
  • The strategic question isn't "block or allow" — it's "which, and why." Most visibility-focused sites should allow citation crawlers (you want to be cited) while blocking bulk training (you don't want to be raw material). That's the "Protect IP, stay citable" preset.