Reference
robots.txt for AI Crawlers — Complete Guide (2026)
Copy-paste templates and the complete directive reference for controlling AI crawlers via robots.txt. Covers GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, Google-Extended, Applebot-Extended, PerplexityBot, Meta-ExternalAgent, Bingbot, and Bytespider.
On this page
What robots.txt is
robots.txt is a plain-text file at the root of your domain (yourdomain.com/robots.txt) that tells web crawlers which URLs they may or may not fetch. The Robots Exclusion Protocol has been a web standard since 1994. Every major AI crawler operating in 2026 — including OpenAI, Anthropic, Google, Microsoft, Apple, Meta, Perplexity — publicly commits to respecting it.
For AI visibility, robots.txt is the single most important configuration on your site. It determines whether your content can be used as training data, cited in real-time AI answers, and indexed by AI-search products like ChatGPT Search.
Approximately 40% of websites accidentally block at least one major AI crawler due to overly strict default robots.txt files inherited from CMS templates or copied from older SEO guides. Always audit yours.
The basic directives
| Directive | Purpose | Example |
|---|---|---|
| User-agent | Names the crawler the next rules apply to. Wildcard * = all crawlers. | User-agent: GPTBot |
| Allow | Explicitly permits a path. Used to override a broader Disallow. | Allow: /docs/ |
| Disallow | Blocks a path. Empty value = block nothing (i.e., allow all). | Disallow: /admin/ |
| Sitemap | Points crawlers to your sitemap.xml. Independent of allow/disallow rules. | Sitemap: /sitemap.xml |
Rules are evaluated per-crawler. A User-agent declaration starts a block; everything until the next User-agent applies only to that bot.
Template: allow all AI crawlers
Recommended for any site that wants AI visibility. Drop this into your robots.txt to explicitly permit every major AI crawler. Explicit allow rules are clearer documentation than implicit defaults.
# Allow all major AI crawlers — recommended for AEO User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Claude-Web Allow: / User-agent: Google-Extended Allow: / User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / User-agent: Applebot-Extended Allow: / User-agent: Meta-ExternalAgent Allow: / User-agent: Meta-ExternalFetcher Allow: / User-agent: bingbot Allow: / Sitemap: https://yourdomain.com/sitemap.xml
Template: block all AI crawlers
Use only if you have a clear reason — typically copyright-sensitive publishers or businesses with confidential first-party data. Blocking all AI crawlers means your category gets answered by everyone except you.
# Block all major AI crawlers — visibility cost is high User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Claude-Web Disallow: / User-agent: Google-Extended Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: Meta-ExternalAgent Disallow: / User-agent: Bytespider Disallow: /
Template: block training, allow real-time
The middle path — your content doesn't train future LLMs, but AI assistants can still cite you live when users ask questions. Used by news publishers, premium content sites, and brands worried about copyright but unwilling to lose AI search visibility.
# Block training crawlers — keep real-time + search visibility # ── BLOCK: training crawlers ── User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: Meta-ExternalAgent Disallow: / User-agent: Bytespider Disallow: / # ── ALLOW: real-time + search crawlers ── User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: Claude-Web Allow: / User-agent: PerplexityBot Allow: / User-agent: Meta-ExternalFetcher Allow: / User-agent: bingbot Allow: / Sitemap: https://yourdomain.com/sitemap.xml
Template: allow only docs and blog
For SaaS products that want AI to learn from public docs and blog posts but not from app pages, account dashboards, or internal tooling.
# Allow AI on docs + blog only User-agent: GPTBot Disallow: / Allow: /docs/ Allow: /blog/ User-agent: ChatGPT-User Disallow: / Allow: /docs/ Allow: /blog/ User-agent: ClaudeBot Disallow: / Allow: /docs/ Allow: /blog/ User-agent: PerplexityBot Disallow: / Allow: /docs/ Allow: /blog/ # Block app routes from all AI crawlers User-agent: * Disallow: /dashboard/ Disallow: /app/ Disallow: /api/ Sitemap: https://yourdomain.com/sitemap.xml
Common mistakes
❌ Blocking all bots with `User-agent: *` `Disallow: /`
This blocks every crawler — Google, Bing, AI bots, social previews. If you wanted to block only AI, you locked yourself out of search and link previews too.
❌ Using `Disallow: /*`
The `*` is treated as a literal in some implementations. Use `Disallow: /` for an absolute block.
❌ Mixing crawler-specific and wildcard rules
If you have `User-agent: *` followed by allow/disallow rules, then add a `User-agent: GPTBot` block, GPTBot ignores the wildcard rules entirely. Each User-agent block is fully self-contained.
❌ Forgetting the sitemap directive
AI crawlers use sitemap.xml to discover content beyond your homepage. Always include `Sitemap: https://yourdomain.com/sitemap.xml`.
❌ Editing robots.txt only on staging
Verify the live URL — `https://yourdomain.com/robots.txt` — returns the file you intended. CDN caching, framework-level overrides, or path mismatches are common.
❌ Expecting retroactive removal
robots.txt only affects future crawls. Content already in training data isn't removed when you add a Disallow. To opt out of existing data, contact the provider directly.
How to verify your robots.txt is working
- Visit it directly: open
https://yourdomain.com/robots.txtin a browser. The file should load as plain text. - Check the response code: it must return
200 OK. A 404 means no robots.txt — every crawler will fetch everything by default. - Run an analyzer: paste your URL into a free analyzer that tests each AI crawler against your rules. Catches subtle issues (wildcard conflicts, mixed allow/disallow, encoding problems) automatically.
- Check server logs: after 1–3 days, look for User-agent strings like
GPTBotorClaudeBotin your access logs. They confirm crawlers are reading the file and following rules.
Free robots.txt analyzer
Tests your robots.txt against every major AI crawler in 30 seconds. Per-bot allow/block status with copy-paste fixes.
Frequently asked questions
Where does robots.txt go?
robots.txt is a plain-text file placed at the root of your domain — yourdomain.com/robots.txt. It must be served with HTTP 200 and Content-Type: text/plain. Crawlers fetch it before crawling any other URL on the domain.
How do I allow GPTBot in robots.txt?
Add: User-agent: GPTBot followed by Allow: / on the next line. Or omit any rule for GPTBot — by default, crawlers are allowed if there's no Disallow directive matching them. Explicit allow is clearer documentation.
How do I block ChatGPT entirely?
ChatGPT uses three crawlers: GPTBot (training), ChatGPT-User (real-time browsing), and OAI-SearchBot (ChatGPT Search index). To block ChatGPT entirely, disallow all three: User-agent: GPTBot / Disallow: / then User-agent: ChatGPT-User / Disallow: / then User-agent: OAI-SearchBot / Disallow: /. Blocking just GPTBot still leaves you visible in ChatGPT Search and live queries.
Will my changes to robots.txt take effect immediately?
AI crawlers re-fetch robots.txt before each crawl session, typically within 24-72 hours of a change. Existing training data isn't removed retroactively — robots.txt only controls future crawls. To remove already-indexed content from a model, you generally need to contact the provider directly.
Can I use Allow and Disallow together?
Yes. The most specific matching rule wins. For example, you can disallow your site root but allow a specific section: User-agent: GPTBot / Disallow: / / Allow: /docs/. This blocks GPTBot from everything except /docs/. Useful for opening up only documentation while keeping the rest private.
Do AI crawlers obey robots.txt?
All major AI crawlers — OpenAI's GPTBot/ChatGPT-User/OAI-SearchBot, Anthropic's ClaudeBot/Claude-Web, Google's Google-Extended, Apple's Applebot-Extended, Meta's Meta-ExternalAgent, Perplexity's PerplexityBot, and Microsoft's Bingbot — publicly commit to respecting robots.txt and have been audited by independent researchers. Bytespider (ByteDance) has historically had compliance issues but has been improving in 2025-2026.
What's the difference between blocking GPTBot and Google-Extended?
Blocking GPTBot stops OpenAI from using your content for training. Blocking Google-Extended stops Google from using your content for Gemini training — but Google-Extended is a robots.txt token, not a separate crawler, so blocking it doesn't affect standard Google Search indexing. Both are surgical opt-outs from training data while preserving search visibility.
Is your robots.txt blocking AI crawlers?
Free 30-second analyzer — paste your URL, get per-bot allow/block status with copy-paste fixes.
Updated for 2026 with current user-agent strings for OpenAI, Anthropic, Google, Microsoft, Apple, Meta, Perplexity, and ByteDance crawlers.