Resource · Free

Technical Checklist for AI Crawlability Website

23 specific checks to make your website readable by ChatGPT, Claude, Perplexity, and Gemini. Covers robots.txt rules, sitemap, llms.txt, JavaScript rendering, and WAF blocking. Free, no signup.

Why AI crawlability matters

AI assistants like ChatGPT and Perplexity decide what to recommend by first crawling the open web. If your robots.txt blocks GPTBot, your Cloudflare firewall challenges AI bots, or your React SPA fails to render without JavaScript — AI sees nothing. Over 40% of websites have at least one of these blockers in place without knowing it. This checklist covers every technical barrier between your site and AI visibility.

2-minute self-check

Answer these 5 questions before diving into the full checklist. They catch 80% of AI crawlability issues.

  1. 1. Can I see my robots.txt at yoursite.com/robots.txt?

    If yes and it lists Allow rules for AI bots, you're fine.

  2. 2. Can I see my sitemap.xml at yoursite.com/sitemap.xml?

    If yes and it contains URLs, AI can discover your pages.

  3. 3. Does yoursite.com/llms.txt return content (not 404)?

    If yes, you have a major advantage over 90% of competitors.

  4. 4. Does 'view source' on your homepage show real product content?

    If yes, SSR is working. If you only see <div id='root'></div>, AI sees nothing.

  5. 5. Is your site reachable from curl with no JS enabled?

    Run: curl -I yoursite.com — expect 200, not 403 or challenge page.

robots.txt & AI Bot Access

9 checks

AI crawlers check robots.txt first. If your rules block or miss key AI bots, your site is invisible regardless of content quality.

robots.txt file exists at /robots.txt

Why: Without it, some crawlers assume restrictive rules. Even an empty file signals that the site knows about robots.txt.

How to check: Visit yoursite.com/robots.txt — expect to see User-agent lines, not a 404.

GPTBot is not blocked

Why: GPTBot is OpenAI's crawler for ChatGPT training and web search. Blocking it removes you from ChatGPT's knowledge.

How to check: Look for 'User-agent: GPTBot' followed by 'Allow: /' or no Disallow rules. Test at /tools/robots-txt-analyzer.

ChatGPT-User is not blocked

Why: ChatGPT-User handles real-time browsing when users ask ChatGPT about a specific site. Blocking it breaks live lookups.

How to check: Search robots.txt for ChatGPT-User; ensure no Disallow: / rule applies to it.

OAI-SearchBot is not blocked

Why: OAI-SearchBot powers ChatGPT's search function. Blocking it removes you from ChatGPT Search results.

How to check: Search robots.txt for OAI-SearchBot; explicit Allow recommended.

ClaudeBot is not blocked

Why: ClaudeBot is Anthropic's crawler. Blocking it removes your site from Claude's training data and web responses.

How to check: Search robots.txt for ClaudeBot; ensure no Disallow: / applies.

PerplexityBot is not blocked

Why: PerplexityBot powers Perplexity's live citations. Blocking it means Perplexity cannot cite your pages.

How to check: Search robots.txt for PerplexityBot; explicit Allow recommended.

Google-Extended is not blocked

Why: Google-Extended is the opt-in crawler for Gemini and AI Overviews. Blocking it removes your site from Google's AI-generated answers.

How to check: Search robots.txt for Google-Extended; ensure no Disallow: / applies.

Applebot-Extended is not blocked

Why: Applebot-Extended fuels Apple Intelligence and Siri. Blocking it removes you from Apple AI results.

How to check: Explicit Allow recommended to future-proof Apple AI visibility.

Meta-ExternalAgent is not blocked

Why: Meta-ExternalAgent is Meta AI's crawler. Relevant if your audience uses Instagram, WhatsApp, or Threads.

How to check: Check robots.txt for Meta-ExternalAgent; Allow by default.

Page Discovery & Indexing

6 checks

Even if crawlers can reach your root, they need to discover and understand your important pages.

sitemap.xml exists and returns 200 OK

Why: Sitemaps are how AI crawlers discover pages beyond your homepage. No sitemap = only pages linked from the home get crawled.

How to check: Visit yoursite.com/sitemap.xml. Expect valid XML with <url> entries. Use /tools/sitemap-checker to validate.

Sitemap is referenced in robots.txt

Why: Adding 'Sitemap: https://yoursite.com/sitemap.xml' to robots.txt tells every crawler where to find it without guessing.

How to check: Open robots.txt and look for a 'Sitemap:' line at the top or bottom.

llms.txt exists at /llms.txt

Why: llms.txt is a plain-text summary of your product that AI crawlers read first. Without it, AI has to interpret your full HTML and often gets it wrong.

How to check: Visit yoursite.com/llms.txt. If 404, generate one at /tools/llms-txt-generator.

llm.json or llms-full.txt available (optional but helpful)

Why: These extended formats give AI agents machine-readable product metadata and deeper context.

How to check: Optional. Adds 3 extra score points if present.

Homepage returns 200 OK without JavaScript

Why: AI crawlers often do not execute JS. If your homepage relies on client-side rendering, crawlers see an empty shell.

How to check: curl -I yoursite.com should return 200. View source (Ctrl+U) should show real content, not just <div id="root"></div>.

All key pages return 200 OK

Why: Pricing, features, and docs pages must be reachable for AI to recommend them.

How to check: Use /tools/sitemap-checker to confirm every sitemap URL returns 200.

JavaScript & Server-Side Rendering

4 checks

Most AI crawlers do not execute JavaScript. If your content only appears after hydration, AI cannot read it.

Text-to-HTML ratio above 10%

Why: A page that is 95% JavaScript and 5% text suggests heavy client-side rendering. AI crawlers extract little useful content.

How to check: Use /tools/text-ratio-checker. Ratios below 5% are red flags.

Main content is in the initial HTML response

Why: AI crawlers fetch once. Content loaded via fetch/XHR after page load is often missed entirely.

How to check: curl yoursite.com | grep 'your product description' — if not found, your content requires JS to render.

Key pages use SSR, SSG, or ISR (not pure CSR)

Why: Next.js App Router, Nuxt, Astro, and static site generators produce HTML that AI can read. Pure create-react-app SPAs do not.

How to check: Check the framework you use. If it is CRA or unhydrated React SPA, migrate to Next.js or Astro.

Headings (H1, H2, H3) are in the HTML

Why: AI uses heading structure to understand page sections. If headings are generated by React, AI may miss them.

How to check: Use /tools/heading-checker — all H1-H6 tags should appear in the initial HTML.

Firewalls, WAFs & CDN

4 checks

Cloudflare, Vercel, and AWS WAF often block AI crawlers by default. This is the #1 silent reason sites are invisible to AI.

Cloudflare AI bot blocking is disabled or configured to allow

Why: Cloudflare launched an option to block AI crawlers by default in 2024. Many sites have this enabled without realizing it.

How to check: Cloudflare dashboard → Security → Bots → verify 'Block AI Scrapers and Crawlers' is off, or that GPTBot/ClaudeBot/PerplexityBot are explicitly allowed.

Vercel firewall allows AI crawler user-agents

Why: Vercel's attack challenge and bot protection can intermittently block legitimate AI crawlers.

How to check: Vercel dashboard → Firewall → confirm no rules block user-agents containing 'GPTBot', 'ClaudeBot', or 'PerplexityBot'.

Rate limiting allows crawler traffic

Why: Strict IP rate limits (e.g., 10 requests/second) cause AI crawlers to fail before reading your site.

How to check: Check your rate limiter configuration. Aim for at least 30 requests/second per IP on crawl paths.

No CAPTCHA challenge on main content pages

Why: If Cloudflare or hCaptcha shows a challenge to every visitor, AI crawlers cannot solve it and see only the challenge page.

How to check: Open your homepage in incognito with no cookies. If you see a CAPTCHA, AI does too.

Frequently asked questions

What is AI crawlability?

AI crawlability is whether AI assistants like ChatGPT, Claude, Perplexity, and Gemini can reach, render, and understand your website. It depends on your robots.txt rules, sitemap presence, llms.txt file, server-side rendering, WAF configuration, and structured data. If any of these block AI crawlers, your site is invisible in AI-generated answers.

How do I check if my website is crawlable by AI?

Run a technical AI crawlability audit. Check that your robots.txt allows GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, and Google-Extended. Confirm your sitemap.xml exists and is valid. Deploy an llms.txt file at your root. Test that your key pages are server-side rendered. Verify your WAF (Cloudflare, Vercel) is not blocking AI bots. Free tools like the AI Exposure Audit check all 23 signals in 60 seconds.

Which AI bots should I allow in robots.txt?

Allow all major AI crawlers: GPTBot (OpenAI), ChatGPT-User (ChatGPT browsing), OAI-SearchBot (ChatGPT Search), ClaudeBot (Anthropic), Claude-Web, PerplexityBot, Google-Extended (Gemini), Applebot-Extended (Apple AI), and Meta-ExternalAgent (Meta AI). Block scrapers and low-value crawlers like Bytespider and CCBot only if you have specific concerns — otherwise allowing them helps your site appear in more AI training data.

Does JavaScript affect AI crawlability?

Yes. AI crawlers often do not execute JavaScript. If your site is fully client-side rendered (like an unhydrated React SPA), AI crawlers may see an empty page. Use server-side rendering (Next.js App Router, Nuxt, Astro) or static site generation so your content is in the initial HTML. Tools like our text-to-HTML ratio checker reveal this issue.

What is llms.txt and why does it matter for AI crawlability?

llms.txt is a plain-text file at your website root (yoursite.com/llms.txt) that provides AI crawlers with a structured summary of your product: name, description, features, pricing, key pages. It is the fastest way for an AI to understand what your site is about. Without llms.txt, AI has to crawl and interpret your full site, which is slower and less accurate.

Skip the manual work

The AI Exposure Audit runs all 23 crawlability checks plus 25 other AI visibility signals on any URL in 60 seconds. Free, no signup.

Related tools