AI crawlers are consuming 4.2% of all web traffic. GPTBot traffic grew 305% between May 2024 and May 2025. Meta’s AI crawlers alone generate 52% of all AI bot traffic.
The numbers are real. The impact is measurable. But here’s the question no one’s answering: beyond allowing these bots into your robots.txt file, what actually makes your content crawlable for AI?
We analyzed the research, tracked the data, and identified the factors that determine whether LLMs can find, understand, and use your content.
The AI Crawlability Problem
AI crawlers behave fundamentally differently from search engines. And most websites aren’t built for how they work.
Traditional search crawlers like Googlebot render JavaScript, follow your sitemap, and index keywords for later retrieval. They’re patient. They’re sophisticated. They understand modern web architecture. They’ve been around for decades, after all.
AI crawlers are different:
- Most don’t render JavaScript. They only see response HTML, not rendered HTML
- They’re looking for extractable data, not just indexable keywords
- They hit sites aggressively: 1,000 to 39,000 requests per minute in some cases
- They don’t always honor robots.txt (despite what the documentation says)
- They’re training models or fetching real-time answers, not building a search index
The result? Content that ranks perfectly in Google can be completely invisible to Claude, ChatGPT, or Gemini. .
If your product names live in JavaScript. If your pricing tables require interaction to display. If your FAQs are hidden behind accordion menus in the rendered DOM but not in the response HTML. Well, let’s just say that in all those cases, AI models can’t see them.
What Actually Affects AI Crawlability
We tracked AI traffic patterns across the research and identified six factors that determine crawlability. Not all of them matter equally.
1. Technical Accessibility (Critical)
This is table stakes. If AI crawlers can’t access your content, nothing else matters.
What the data shows:
- 14% of top domains now use robots.txt rules to manage AI bots
- Sites blocking AI crawlers see zero AI referral traffic (obviously)
- Crawl errors and timeout patterns show up consistently when bots abandon sites
What works:
- Allow relevant crawlers in robots.txt (GPTBot, ClaudeBot, Google-Extended, PerplexityBot)
- Fix server response times. AI bots have shorter timeouts than traditional crawlers
- Monitor server logs for 4xx/5xx errors when AI bots visit
- Manage rate limiting carefully. Aggressive throttling blocks legitimate AI traffic
What doesn’t:
- Allowing every AI bot indiscriminately increases server load without ROI
- Blocking all AI bots protects content but eliminates AI discovery entirely
2. Content Structure (Critical)
AI models need extractable data. They are not looking for interpretable content. Not beautifully designed layouts. Just data they can pull directly into answers.
The problem: Most AI crawlers can’t render JavaScript. If your content lives in <script> tags, gets loaded via React, or appears only after client-side rendering, AI models won’t see it.
One analysis found that major ecommerce sites serve product names only in rendered HTML. Response HTML shows empty containers. AI crawlers see nothing.
What works:
- Serve critical content in response HTML, not just rendered HTML
- Use structured data: comparison tables, pricing grids, FAQ schemas
- Put important information at the top of the page in simple HTML
- Avoid hiding content in dropdowns or accordions that require interaction
Real example: A neobank restructured product pages with extractable comparison tables showing interest rates, fees, and account minimums.
AI traffic increased 25%, but they also fixed crawl errors, earned press coverage, and launched new content. The structure helped, but it wasn’t isolated.
3. Content Type and Format (High Impact)
Not all content performs equally in AI discovery.
What the data shows:
- Sites that created functional assets (downloadable templates, tools, calculators) saw traffic increases
- Sites that just documented existing content in llms.txt saw no change
- Developer documentation optimized for AI coding assistants drives measurable signups
Content that works:
- Functional assets users can deploy immediately (templates, frameworks, checklists)
- Structured comparisons (“How does X compare to Y?”)
- FAQ content that maps directly to user queries
- API documentation for developer tools
- Data-rich pages with extractable statistics
Content that doesn’t:
- Generic blog posts without unique data
- Marketing copy without specific facts
- Content requiring interpretation or context
- Paywalled content (obviously)
One B2B SaaS platform published 27 downloadable AI templates. Traffic jumped 12.5% two weeks later. But Google organic traffic to those templates rose 18% in the same period. The templates worked because they solved problems, not because AI discovered them differently than search engines.
4. External Validation (Moderate Impact)
AI models appear to factor authority into content selection, similar to how search engines use backlinks and domain authority.
What we see:
- Sites with press coverage from major publications (Bloomberg, WSJ) saw AI traffic increases
- These same sites saw increases across all channels, not just AI
- It’s unclear whether AI models directly assess authority or if authoritative sites simply have better content
What probably works:
- Earning coverage from authoritative sources in your industry
- Building backlinks from high-authority domains
- Creating content that other sites reference and cite
What we can’t prove:
- Whether AI models parse backlink profiles directly
- How they weigh different authority signals
- If citation within AI responses creates a feedback loop
5. Token Efficiency (Niche Impact)
This matters almost exclusively for developer tools and API documentation.
The argument: Clean markdown in llms.txt saves tokens when AI agents parse documentation. Instead of processing complex HTML with navigation, ads, and JavaScript, agents get streamlined content.
Who this matters for:
- Developer tools where AI coding assistants (Cursor, GitHub Copilot) are primary distribution channels
- API documentation that agents reference during code generation
Who this doesn’t matter for:
- Ecommerce sites selling physical products
- B2B SaaS targeting non-technical buyers
- Insurance, finance, and healthcare sites explaining coverage
- Content publishers
6. Crawl Frequency Optimization (Low Impact)
Some sites try to optimize how often AI crawlers visit. The data suggests this rarely matters.
What we know:
- Training crawlers return every 6 hours in some cases (compared to search crawlers that visit daily or weekly)
- Fetcher bots access content in real-time, responding to user queries
- No evidence that crawl frequency correlates with AI referral traffic
What doesn’t work:
- Using Crawl-delay in robots.txt to manage frequency
- Optimizing update frequency to match crawler patterns
The exception: if aggressive crawling is causing server issues, rate limiting is legitimate infrastructure protection, not optimization.
Your AI Crawlability Score
Here’s a framework to audit your own site. Score each factor 0-2, then total your score.
Technical Access (0-2)
- 2: AI crawlers allowed, no crawl errors, fast response times
- 1: Some crawlers are blocked or have intermittent access issues
- 0: AI crawlers blocked or major technical barriers
Content in Response HTML (0-2)
- 2: All critical content available in response HTML
- 1: Some content requires rendering
- 0: Most content lives in JavaScript/requires rendering
Structured Data (0-2)
- 2: Extensive use of tables, comparisons, FAQ schema
- 1: Some structured elements
- 0: Primarily unstructured prose
Functional Assets (0-2)
- 2: Templates, tools, and calculators users can deploy
- 1: Some downloadable resources
- 0: Only informational content
External Authority (0-2)
- 2: Regular press coverage, strong backlink profile
- 1: Some external validation
- 0: Limited external signals
Score Interpretation:
- 8-10: Highly crawlable. Focus on content quality.
- 5-7: Moderate crawlability. Address technical and structural gaps.
- 0-4: Low crawlability. Major optimization needed.
What Actually Works Right Now
Based on actual data and not just internet speculation
Do this:
- Fix technical barriers first. If bots can’t access content, nothing else matters.
- Structure content for extraction. Use tables, comparisons, and FAQ formats.
- Create functional assets that solve specific problems.
- Serve critical content in response HTML, not just rendered HTML.
- Earn external validation through press and backlinks.
Skip this (for now):
- Implementing llms.txt unless you’re a developer tool
- Optimizing crawl frequency
- Creating content specifically “for AI” without user value
- Chasing every new AI optimization trend without data
The lesson from our research is clear: the fundamentals matter more than the formats.
Sites that grew AI traffic did so because they created useful, structured, accessible content and earned external validation. The same factors that work for traditional search work for AI discovery.
llms.txt won’t save poorly structured content. Allowing GPTBot won’t make generic blog posts discoverable. Token efficiency doesn’t matter if you’re not serving useful information.
The Reality About Control
We’re reaching for control in a system where the rules aren’t written yet.
No major LLM provider has officially committed to using llms.txt. Crawl-to-refer ratios show Anthropic crawling 38,000 pages for every one visitor it sends back. Google’s John Mueller confirmed that AI services don’t even check for llms.txt files in server logs.
The infrastructure we’re building, the markdown files, the optimization guides, and the best practices might matter eventually. Or it might not.
What we know works: useful content, clear structure, technical accessibility, and external validation.
Focus on the fundamentals. Measure what matters. When AI platforms publish official guidelines, adapt. Until then, build for users and let AI discovery follow.
The platforms will change. The formats will evolve. But content that solves real problems will always be discoverable, whether that’s through search engines, AI models, or whatever comes next.
Navigate the future of search with confidence
Let's chat to see if there's a good fit
More from Previsible
SEO Jobs Newsletter
Join our mailing list to receive notifications of pre-vetted SEO job openings and be the first to hear about new education offerings.