induwara.lk
induwara.lkAI · robots.txt

Block AI Bots in robots.txt — GPTBot, ClaudeBot & more

Stop AI crawlers from scraping your site for model training, without losing Google or Bing search traffic. Tick the bots to block, copy the rules, paste into robots.txt. Runs entirely in your browser — no signup, sources cited.

By Induwara AshinsanaUpdated Jun 25, 2026
Pick the AI bots to block21 crawlers
Quick presets
AI training crawlers
Training

AI training crawlers

Scrape your pages to train large language models. Blocking them keeps your content out of model training sets and has no effect on Google or Bing.

Aggressive scrapers
Scraper

Aggressive scrapers

General-purpose harvesters that resell or repackage web data. Usually safe to block — they are not mainstream search engines.

AI search & assistants
AI search

AI search & assistants

Index pages so ChatGPT, Claude and Perplexity can cite you in live answers. Leaving these allowed keeps you visible in AI search results.

Use / for the whole site, or /blog/ for one section.

Output options
Bots blocked
14
Left allowed
7
Includes every AI bot you didn't tick
Search engines
Unaffected
Googlebot & Bingbot never touched

Paste this at the very top of the robots.txt file in your site's root (so it loads at https://yoursite.com/robots.txt).

Blocked bots — operator & SEO impact

GPTBotTrainingOpenAI

No search impact — separate from Google, Bing and ChatGPT search.

Docs
Google-ExtendedTrainingGoogle

Does NOT affect Google Search ranking — Googlebot is a separate token.

Docs
CCBotTrainingCommon Crawl Foundation

No search impact — Common Crawl is not a search engine.

Docs
anthropic-aiTrainingAnthropic

No search impact — distinct from ClaudeBot, which serves Claude's answers.

Docs
Applebot-ExtendedTrainingApple

Base Applebot (Siri & Spotlight search) is unaffected.

Docs
meta-externalagentTrainingMeta

No search impact — separate from FacebookBot link previews.

Docs
cohere-aiTrainingCohere

No search impact.

Docs
BytespiderScraperByteDance

No search impact in Sri Lanka's main engines; known to crawl aggressively.

Docs
DiffbotScraperDiffbot

No search impact.

Docs
omgilibotScraperWebz.io

No search impact.

Docs
ImagesiftBotScraperThe Hive / ImageSift

No search impact; primarily collects images.

Docs
TimpibotScraperTimpi

No impact on Google/Bing.

Docs
YouBotScraperYou.com

Only affects visibility inside You.com.

Docs
PetalBotScraperHuawei

Only affects Huawei Petal Search, not Google or Bing.

Docs

robots.txt is voluntary. Reputable AI companies (OpenAI, Google, Anthropic) honour it, but rogue scrapers can ignore it. For hard enforcement, add server, WAF, or Cloudflare AI-bot blocking rules on top — those sit outside robots.txt and are covered in “How it works” below. Tokens are sourced from each operator's official docs.

How it works

This generator produces a block of the Robots Exclusion Protocol — the format every well-behaved web crawler reads at /robots.txt. The grammar is standardised in RFC 9309. For each AI crawler you choose to block, the tool emits one group:

User-agent: GPTBot
Disallow: /

User-agent names the exact crawler token, and Disallow tells it which paths to stay out of. Disallow: / means “the whole site”; a path like Disallow: /blog/ blocks only that section. To allow a bot, it is simply left out (under RFC 9309, anything not disallowed is allowed) — or, when comments are on, written as an explicit empty rule for clarity:

User-agent: OAI-SearchBot
Disallow:

The 21crawler tokens are hard-coded from each operator's official documentation and grouped by purpose:

  • Training crawlers (7) feed model pre-training datasets — GPTBot, Google-Extended, CCBot, anthropic-ai, Applebot-Extended, meta-externalagent, cohere-ai. Blocking them keeps your content out of training data.
  • Aggressive scrapers (7) harvest and resell web data — Bytespider, Diffbot, omgilibot and others. Usually safe to block.
  • AI-search assistants (7) index pages so ChatGPT, Perplexity and Claude can cite you in live answers — blocking these can remove your site from AI answer results, so many content owners leave them allowed.

Critically, the tool never emits a rule for Googlebot, Bingbot, or any traditional search crawler — so your normal SEO is untouched. That is why blocking Google-Extended is safe: Google's own docs confirm it is separate from Googlebot and has no effect on Search ranking.

One honest caveat: robots.txt is voluntary. Reputable AI companies obey it, but it does not physically block anyone, so a rogue scraper can ignore it. For enforcement you can add noai meta tags, X-Robots-Tag HTTP headers, or a Cloudflare/WAF AI-bot rule on top. robots.txt is the correct, universal first step — and the one every major AI operator actually reads.

Worked examples

Example 1 — Block AI training, keep AI search

A Colombo recipe blogger wants her posts kept out of LLM training sets, but still wants to appear in ChatGPT Search and Perplexity answers with attribution.

  1. Select the 7 training crawlers + 7 scrapers (the "Block training, keep AI search" preset).
  2. Leave the AI-search bots (OAI-SearchBot, PerplexityBot, ClaudeBot …) unticked.
  3. Path stays "/" — block the whole site for those crawlers.
  4. Result: 14 Disallow: / groups; AI-search bots written as explicit empty Disallow so they stay allowed.
# Block — AI training crawlers

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# … CCBot, anthropic-ai, Applebot-Extended,
#     meta-externalagent, cohere-ai …

# Allowed — these AI assistants can still cite your pages

User-agent: OAI-SearchBot
Disallow:

User-agent: PerplexityBot
Disallow:

Example 2 — Block everything AI (maximum protection)

A subscription news site wants zero AI access of any kind.

  1. Use the "Block everything AI" preset.
  2. Every token in all three groups is selected — 21 crawlers total.
  3. Output: 21 User-agent groups, each with Disallow: /. No AI bot is left allowed.
  4. Googlebot and Bingbot are still untouched, so traditional search keeps working.
# Block — AI training crawlers
User-agent: GPTBot
Disallow: /
# … +6 more training crawlers

# Block — Aggressive scrapers
User-agent: Bytespider
Disallow: /
# … +6 more scrapers

# Block — AI search & assistants
User-agent: OAI-SearchBot
Disallow: /
# … +6 more AI-search bots

Example 3 — Block one bot from one section only

A site is happy to be trained on, except for its paid /members/ area, which it wants kept from GPTBot.

  1. Untick every bot except GPTBot.
  2. Change the path to /members/ (it must start with "/").
  3. Turn comments off for a minimal rule.
  4. Result: a single group scoped to that one directory.
User-agent: GPTBot
Disallow: /members/

Frequently asked questions

Sources & references

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Spotted a new AI crawler, a renamed token, or a bug?

Email me at [email protected] — most fixes ship within 24 hours.