Reference

Complete List of AI Crawlers (2026)

Every AI crawler currently fetching public web content — user-agent strings, owner organizations, purpose (training vs real-time vs search), and how to allow or block each one. Updated for 2026.

What AI crawlers are

AI crawlers are automated programs operated by AI companies (OpenAI, Anthropic, Google, Microsoft, Apple, Meta, Perplexity, ByteDance) that fetch public web content. Some add what they read to LLM training data. Others fetch content in real time when a user asks the AI something that needs browsing. A third group powers AI-search products that have their own indexes (ChatGPT Search, Bing AI).

Every AI crawler identifies itself with a user-agent string and respects (with rare exceptions) robots.txt. You can allow or block each one independently.

The short version: if you want ChatGPT, Claude, Perplexity, or Gemini to recommend your product when users ask questions in your category, you need to allow their crawlers. Blocking them = invisibility in AI answers.

Training vs real-time crawlers

Crawler typeWhat it doesBlock effect
TrainingFetches content to add to LLM training corpus. Examples: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, Bytespider.Your content won't be in the model's baseline knowledge. Long-term effect.
Real-timeFetches content during a live user query. Examples: ChatGPT-User, Claude-Web, PerplexityBot, Meta-ExternalFetcher.AI can't cite your live content in answers. Immediate visibility loss.
SearchBuilds a dedicated AI-search index. Examples: OAI-SearchBot (ChatGPT Search), Bingbot (powers ChatGPT Search + Copilot).You disappear from that AI-search product entirely.

For maximum AI visibility, allow all three categories. Many publishers selectively block training crawlers (concerns about copyright) while allowing real-time and search crawlers — that pattern preserves AI citation visibility without contributing to training data.

Complete crawler reference

Every major AI crawler in 2026, grouped by owner organization. User-agent strings are the canonical identifiers — copy them exactly into your robots.txt or server logs to filter.

OpenAI

Operates ChatGPT and the OpenAI API. Three distinct crawlers, each with a separate purpose.

GPTBotTraining

Adds content to OpenAI's training corpus. Block this if you don't want your content training future GPT models.

User-agent string

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.2; +https://openai.com/gptbot
ChatGPT-UserReal-time

Fires when a ChatGPT user asks a question that requires fetching live web content. Blocking this means ChatGPT can't cite your live page in answers.

User-agent string

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
OAI-SearchBotSearch

Crawls for ChatGPT Search's index. Blocking this removes you from ChatGPT Search results.

User-agent string

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot

Anthropic

Operates Claude. Two crawlers: training and real-time.

ClaudeBotTraining

Crawls public content for Claude's training corpus.

User-agent string

Mozilla/5.0 (compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
Claude-WebReal-time

Real-time fetch when a Claude user asks something requiring browsing.

User-agent string

Mozilla/5.0 (compatible; Claude-Web/1.0; +https://www.anthropic.com/claude/web)

Google

Operates Gemini, AI Overviews, and AI Mode. Uses a special user-agent token rather than a separate crawler.

Google-ExtendedTraining

Not a separate crawler — Google-Extended is a robots.txt token that opts you out of training data for Gemini, Vertex AI, and other Google AI products without affecting standard search indexing.

User-agent string

(token applied to Googlebot)
GoogleOtherReal-time

Generic user-agent for non-search Google products. Used by various internal Google tools including AI features.

User-agent string

GoogleOther

Microsoft

Operates Copilot and Bing AI. ChatGPT Search ALSO uses Bing's index, so allowing Bingbot is critical for ChatGPT visibility.

BingbotSearch

Powers Bing Search, Microsoft Copilot, AND ChatGPT Search. Block this and you lose visibility on three major AI surfaces at once.

User-agent string

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm

Perplexity AI

Operates the Perplexity answer engine. Crawls in real time when users ask questions.

PerplexityBotReal-time

Real-time fetch when a Perplexity user asks a question. Blocking PerplexityBot makes you invisible on Perplexity entirely.

User-agent string

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot
Perplexity-UserReal-time

Direct user-initiated fetches (when a user clicks a citation in a Perplexity answer).

User-agent string

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user

Apple

Operates Apple Intelligence and Siri's AI features.

Applebot-ExtendedTraining

Apple's opt-out token for Apple Intelligence training data. Same pattern as Google-Extended — disallows training without removing you from Siri search results.

User-agent string

(token applied to Applebot)
ApplebotSearch

Standard Apple search crawler used by Siri Suggestions and Spotlight Web Search.

User-agent string

Mozilla/5.0 (Device; OS_version) AppleWebKit/WebKit_version (KHTML, like Gecko) Version/Safari_version Safari/WebKit_version (Applebot/Applebot_version)

Meta

Operates Meta AI (Llama-based assistant in Messenger, Instagram, WhatsApp).

Meta-ExternalAgentTraining

Crawls for Meta AI training data and embeddings.

User-agent string

Meta-ExternalAgent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
Meta-ExternalFetcherReal-time

Real-time fetch for Meta AI live queries.

User-agent string

Meta-ExternalFetcher/1.0

ByteDance

Operates Doubao (China's largest AI assistant) and TikTok's AI features.

BytespiderTraining

ByteDance crawler — has historically had a mixed track record with robots.txt compliance. If you serve a global audience and want Doubao visibility, allow this. Otherwise many publishers block it.

User-agent string

Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)

robots.txt template — allow all AI crawlers

Drop this into /robots.txt at your domain root to explicitly allow every major AI crawler. Recommended for any site that wants to appear in AI answers.

# Allow all AI crawlers — recommended for AEO
# Place at https://yourdomain.com/robots.txt

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Meta-ExternalFetcher
Allow: /

# Bingbot powers ChatGPT Search and Microsoft Copilot
User-agent: bingbot
Allow: /

# Standard sitemap directive
Sitemap: https://yourdomain.com/sitemap.xml

Free robots.txt analyzer

Check whether your robots.txt is blocking any AI crawler — paste your URL and get a per-bot status.

Analyze robots.txt

Should I allow AI crawlers?

Three frameworks publishers use, in increasing order of restrictiveness:

  1. 1. Allow everything

    Maximum AI visibility. Your content trains LLMs and gets cited live. Most SaaS, content businesses, and tool sites pick this — the visibility upside far exceeds the training-data risk.

  2. 2. Allow real-time + search, block training

    You appear in live AI answers but your content isn't used to train baseline models. Common for publishers and brands worried about copyright. Block: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, Bytespider. Allow: ChatGPT-User, OAI-SearchBot, Claude-Web, PerplexityBot, Meta-ExternalFetcher, Bingbot.

  3. 3. Block everything

    Maximum control, zero AI visibility. Used by some news publishers and high-IP businesses. The trade-off is severe — your category gets answered by everyone except you.

For most AEO use cases — SaaS, agencies, content businesses, tools — option 1 (allow everything) is the right default. The fastest way to get cited by AI is to be readable by AI.

Frequently asked questions

What user-agent does ChatGPT use?

OpenAI operates three crawlers: GPTBot (training data crawler), ChatGPT-User (real-time queries when a user asks ChatGPT something that requires browsing), and OAI-SearchBot (the ChatGPT search index crawler). All three identify with distinct user-agent strings and can be controlled independently in robots.txt.

What is the user-agent for Claude?

Anthropic uses two crawlers: ClaudeBot (training data) and Claude-Web (real-time when a user asks Claude something requiring web access). Both identify themselves clearly and respect robots.txt directives.

What is Google-Extended?

Google-Extended is Google's user-agent token (not a separate crawler) that lets sites opt out of having their content used to train Gemini and other Google AI products without affecting standard Google Search indexing. Disallowing Google-Extended in robots.txt removes the site from Gemini's training data while keeping it indexed for regular search.

Should I allow all AI crawlers?

If you want your content cited by AI assistants — yes. Blocking AI crawlers means your product doesn't show up when ChatGPT, Claude, Perplexity, or Gemini answer questions about your category. The exception is real-time fetch crawlers (ChatGPT-User, Claude-Web, PerplexityBot) which don't add to training data — blocking these prevents AI from quoting your live content even when a user explicitly asks.

How do I block AI crawlers?

Add User-agent: <bot-name> followed by Disallow: / for each crawler in your robots.txt. For example, to block GPTBot: User-agent: GPTBot then Disallow: /. To allow it explicitly: Disallow: (empty value) or use Allow: /. See the full list of user-agents in this doc.

Do AI crawlers respect robots.txt?

All major AI crawlers from OpenAI, Anthropic, Google, Microsoft, and Apple publicly commit to respecting robots.txt directives. Bytespider (ByteDance) and some smaller crawlers have a mixed track record. PerplexityBot, ClaudeBot, and GPTBot have all been audited respecting robots.txt by independent researchers.

What's the difference between training crawlers and real-time crawlers?

Training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended) fetch content to add to LLM training data — what the model 'knows' baseline. Real-time crawlers (ChatGPT-User, Claude-Web, PerplexityBot) fetch content in response to a live user query that requires browsing — they don't store the data, just use it to answer that specific question. Most sites should allow both. Blocking real-time crawlers makes you invisible during live AI search even if you're in the training data.

Check if your site blocks any AI crawlers

Free 30-second analyzer — paste any URL and see per-bot allow/block status with copy-paste fixes.