AI Bot Detection: Which Crawlers Visit Your Site and What They Want

AI Bot Detection: Which Crawlers Visit Your Site and What They Want

Your WordPress site is being visited by AI crawlers right now. Most site owners have no idea which bots are scraping their content, how often they visit, or what they do with what they find. Here's the complete picture.

The AI Crawler Landscape (2026)

Bot Operator Purpose User-Agent Contains
GPTBot OpenAI Training data + browsing GPTBot
ChatGPT-User OpenAI Live browsing (user-initiated) ChatGPT-User
ClaudeBot Anthropic Training data ClaudeBot
Claude-Web Anthropic Live search/citation Claude-Web
PerplexityBot Perplexity Real-time search answers PerplexityBot
Google-Extended Google Gemini training data Google-Extended
Googlebot (AI Overviews) Google AI Overview generation Googlebot
Bingbot (Copilot) Microsoft Copilot answers bingbot
Applebot-Extended Apple Siri/AI features Applebot-Extended
Bytespider ByteDance TikTok AI features Bytespider
CCBot Common Crawl Open training datasets CCBot
FacebookBot Meta AI features FacebookExternalHit
Cohere-ai Cohere Enterprise AI cohere-ai
Meta-ExternalAgent Meta Llama training Meta-ExternalAgent
YandexBot (AI) Yandex AI search features YandexBot
Amazonbot Amazon Alexa/AI features Amazonbot
ImagesiftBot AI training Image + text training ImagesiftBot
Diffbot Diffbot Knowledge graph extraction Diffbot
Timpibot Timpi Decentralized search Timpibot
Omgili Omgili/Webz.io AI data feeds omgili

This list grows monthly. As of mid-2026, a typical WordPress site with moderate traffic sees 5–15 unique AI bots per week.

Detection Methods

User-Agent String Matching

The simplest and most reliable method. Each bot identifies itself via the User-Agent HTTP header:

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot)

Pattern matching against known signatures catches the vast majority of AI crawlers. The Zitably plugin ships with 20+ built-in signatures and lets you add custom patterns.

Reverse DNS Verification

For critical applications, verify that a bot claiming to be GPTBot actually comes from OpenAI's IP range:

  1. Reverse DNS lookup on the client IP
  2. Verify the hostname matches the expected domain (e.g., *.openai.com for GPTBot)
  3. Forward DNS lookup to confirm the IP resolves back

This prevents spoofing but adds latency. Most analytics use cases don't need this.

Behavioral Signals

Some AI agents don't identify themselves clearly. Behavioral signals that suggest AI activity:

  • Very high page-per-session counts (50+ pages in seconds)
  • Systematic crawl patterns (hitting every page in order)
  • No JavaScript execution (no analytics pings)
  • Accept headers requesting text/markdown or application/json

What Each Bot Does with Your Content

Training Data Bots (GPTBot, ClaudeBot, CCBot)

These crawl your site to build training datasets for future model versions. Content they ingest today may influence model outputs months from now. You can block them via robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

But blocking means your content won't be in future training data — making it harder to be cited.

Live Search Bots (PerplexityBot, ChatGPT-User, Google AI)

These fetch content in real-time to answer user queries right now. They're your AI search traffic. Blocking them means immediate loss of AI visibility.

Knowledge Graph Bots (Diffbot, Google-Extended)

These extract structured information to build knowledge bases that AI systems reference. They care most about Schema.org markup, tables, and clearly structured facts.

Why Detection Matters for GEO

You can't optimize what you don't measure. AI bot detection gives you:

  1. Visibility data — Which AI systems are reading your content?
  2. Content priorities — Which pages get crawled most? (Optimize those first)
  3. Format decisions — Which bots request Markdown? (Enable content negotiation for them)
  4. ROI measurement — After GEO optimization, do crawl rates increase?
  5. Competitive intel — If bots stop visiting a page, your content may be losing relevance

Zitably's Bot Detection

The Zitably plugin detects AI bots on every request and provides:

  • Real-time dashboard — See which bots visited, when, and what they read
  • Per-page analytics — Know which content AI systems prioritize
  • Custom signatures — Add your own bot patterns (Name:regex format)
  • Excluded bots — Optionally exclude bots from Markdown injection
  • Zero performance impact — Pattern matching adds <1ms per request

All detection runs server-side on WordPress without external API calls or JavaScript tracking.

The Blocking vs. Serving Decision

Strategy Pros Cons
Block all AI bots Protect content from training Invisible in AI search
Allow all AI bots Maximum AI visibility Content used for training
Selective (recommended) Block training bots, serve search bots Requires maintenance
Serve optimized content Best visibility + controlled format Slight server overhead

Our recommendation: allow search bots (they drive traffic), and serve them optimized Markdown. For training bots, decide based on your business model. If you want to be in future AI answers, allow training too.


See which AI bots visit your site. Install Zitably and check the dashboard within 24 hours. Get started →