Tools · 04 · free · the cyber wedge

AI crawler policy generator.

Decide what the AI crawlers may do with your site — train on it, index it for citation, fetch it on demand — and get a copy-paste robots.txt, llms.txt, ai.txt and meta bundle. Every rule annotated with which bots actually honor it, because most "AI blockers" quietly do nothing.

Your domain

What should AI systems be allowed to do?

Block bulk trainingStop GPTBot, ClaudeBot, Google-Extended, CCBot… from training on your content Block AI search & citationStop Perplexity, ChatGPT-Search, Amazon from indexing you for answers — usually you want this OFF (you want the citations) Block on-demand user fetchStop ChatGPT-User / Claude-User from fetching a page a user pastes Block image trainingAdd image-AI opt-outs (ImagesiftBot + noimageai meta)

Quick presets:

/robots.txt

§ 01

The honest reference

Which crawlers actually obey.

A Disallow is a request, not a fence. Some bots honor it, some ignore it, some carve out "user-triggered" exceptions. This is the state of play as of 2026-Q2.

Crawler	Vendor	Purpose	Compliance
GPTBot	OpenAI	Model training	Honors robots.txtBulk crawl for model training. Honors robots.txt.
OAI-SearchBot	OpenAI	AI search & citation	Honors robots.txtIndexes for ChatGPT Search citations. Block this and you lose ChatGPT-Search visibility.
ChatGPT-User	OpenAI	On-demand user fetch	Honors robots.txtFetches a page when a user pastes/links it in ChatGPT. On-demand, not bulk.
ClaudeBot	Anthropic	Model training	Honors robots.txtBulk crawl for training. Honors robots.txt.
Claude-User	Anthropic	On-demand user fetch	Honors robots.txtOn-demand fetch when a user references your URL in Claude.
Google-Extended	Google	Model training	Honors robots.txtThe ONLY lever for Gemini/Vertex training. Blocking it does NOT affect Google Search ranking — they're separate.
Googlebot	Google	Classic search	Honors robots.txtClassic Search crawl. ⚠ Blocking this removes you from Google Search AND AI Overviews. Almost never block.
CCBot	Common Crawl	Model training	Honors robots.txtFeeds a large share of open training datasets. Honors robots.txt.
PerplexityBot	Perplexity	AI search & citation	Partial / disputedIndexes for Perplexity answers. Documented to honor robots.txt; independent reports of stealth fetching exist.
Perplexity-User	Perplexity	On-demand user fetch	Partial / disputedOn-demand user fetch. Perplexity states this bypasses robots.txt by design.
Applebot-Extended	Apple	Model training	Honors robots.txtApple Intelligence training control. (Applebot itself still serves Siri/Spotlight.)
Meta-ExternalAgent	Meta	Model training	Honors robots.txtMeta AI training/crawl. Honors robots.txt.
Amazonbot	Amazon	AI search & citation	Honors robots.txtPowers Alexa answers + Amazon AI. Honors robots.txt.
Bytespider	ByteDance	Model training	Ignores robots.txt⚠ Widely reported to ignore robots.txt. Blocking is best-effort — pair with a server/WAF rule.
cohere-ai	Cohere	Model training	Honors robots.txtCohere model crawl. Honors robots.txt.
Diffbot	Diffbot	Model training	Honors robots.txtKnowledge-graph + training data crawler.
ImagesiftBot	ImageSift	Image training	Honors robots.txtImage crawl feeding image-training datasets.

robots.txt is honor-system. Reputable vendors (OpenAI, Anthropic, Google) respect it. Others (Bytespider) are documented to ignore it. For the bots that ignore it, the only real control is a server / CDN / WAF block on the user-agent or IP range.
Blocking training ≠ removing what's already trained. These directives are forward-looking. Content already in a training set stays there until the next model generation, if then.
Don't block Googlebot. It powers Google Search and AI Overviews both. Google-Extended is the separate, safe lever for Gemini training — blocking it leaves Search untouched.
llms.txt is a content guide, not a fence. It tells models where your best content is; it does not restrict access, and major models do not yet act on it — independent log studies find ~97% of llms.txt files are never read. See the field note for the preregistered test. Ship it for the upside, not for protection.
The strategic question isn't "block or allow" — it's "which, and why." Most visibility-focused sites should allow citation crawlers (you want to be cited) while blocking bulk training (you don't want to be raw material). That's the "Protect IP, stay citable" preset.

Going deeper on the risk side? Read the visibility research, or book an audit — the GEO & LLM-visibility audit ($18K–$28K) includes a full crawler-and-citation posture review.