AI Bot Detection: Which Crawlers Visit Your Site and What They Want
Your WordPress site is being visited by AI crawlers right now. Most site owners have no idea which bots are scraping their content, how often they visit, or what they do with what they find. Here's the complete picture.
The AI Crawler Landscape (2026)
| Bot | Operator | Purpose | User-Agent Contains |
|---|---|---|---|
| GPTBot | OpenAI | Training data + browsing | GPTBot |
| ChatGPT-User | OpenAI | Live browsing (user-initiated) | ChatGPT-User |
| ClaudeBot | Anthropic | Training data | ClaudeBot |
| Claude-Web | Anthropic | Live search/citation | Claude-Web |
| PerplexityBot | Perplexity | Real-time search answers | PerplexityBot |
| Google-Extended | Gemini training data | Google-Extended |
|
| Googlebot (AI Overviews) | AI Overview generation | Googlebot |
|
| Bingbot (Copilot) | Microsoft | Copilot answers | bingbot |
| Applebot-Extended | Apple | Siri/AI features | Applebot-Extended |
| Bytespider | ByteDance | TikTok AI features | Bytespider |
| CCBot | Common Crawl | Open training datasets | CCBot |
| FacebookBot | Meta | AI features | FacebookExternalHit |
| Cohere-ai | Cohere | Enterprise AI | cohere-ai |
| Meta-ExternalAgent | Meta | Llama training | Meta-ExternalAgent |
| YandexBot (AI) | Yandex | AI search features | YandexBot |
| Amazonbot | Amazon | Alexa/AI features | Amazonbot |
| ImagesiftBot | AI training | Image + text training | ImagesiftBot |
| Diffbot | Diffbot | Knowledge graph extraction | Diffbot |
| Timpibot | Timpi | Decentralized search | Timpibot |
| Omgili | Omgili/Webz.io | AI data feeds | omgili |
This list grows monthly. As of mid-2026, a typical WordPress site with moderate traffic sees 5–15 unique AI bots per week.
Detection Methods
User-Agent String Matching
The simplest and most reliable method. Each bot identifies itself via the User-Agent HTTP header:
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot)
Pattern matching against known signatures catches the vast majority of AI crawlers. The Zitably plugin ships with 20+ built-in signatures and lets you add custom patterns.
Reverse DNS Verification
For critical applications, verify that a bot claiming to be GPTBot actually comes from OpenAI's IP range:
- Reverse DNS lookup on the client IP
- Verify the hostname matches the expected domain (e.g.,
*.openai.comfor GPTBot) - Forward DNS lookup to confirm the IP resolves back
This prevents spoofing but adds latency. Most analytics use cases don't need this.
Behavioral Signals
Some AI agents don't identify themselves clearly. Behavioral signals that suggest AI activity:
- Very high page-per-session counts (50+ pages in seconds)
- Systematic crawl patterns (hitting every page in order)
- No JavaScript execution (no analytics pings)
Acceptheaders requestingtext/markdownorapplication/json
What Each Bot Does with Your Content
Training Data Bots (GPTBot, ClaudeBot, CCBot)
These crawl your site to build training datasets for future model versions. Content they ingest today may influence model outputs months from now. You can block them via robots.txt:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
But blocking means your content won't be in future training data — making it harder to be cited.
Live Search Bots (PerplexityBot, ChatGPT-User, Google AI)
These fetch content in real-time to answer user queries right now. They're your AI search traffic. Blocking them means immediate loss of AI visibility.
Knowledge Graph Bots (Diffbot, Google-Extended)
These extract structured information to build knowledge bases that AI systems reference. They care most about Schema.org markup, tables, and clearly structured facts.
Why Detection Matters for GEO
You can't optimize what you don't measure. AI bot detection gives you:
- Visibility data — Which AI systems are reading your content?
- Content priorities — Which pages get crawled most? (Optimize those first)
- Format decisions — Which bots request Markdown? (Enable content negotiation for them)
- ROI measurement — After GEO optimization, do crawl rates increase?
- Competitive intel — If bots stop visiting a page, your content may be losing relevance
Zitably's Bot Detection
The Zitably plugin detects AI bots on every request and provides:
- Real-time dashboard — See which bots visited, when, and what they read
- Per-page analytics — Know which content AI systems prioritize
- Custom signatures — Add your own bot patterns (
Name:regexformat) - Excluded bots — Optionally exclude bots from Markdown injection
- Zero performance impact — Pattern matching adds <1ms per request
All detection runs server-side on WordPress without external API calls or JavaScript tracking.
The Blocking vs. Serving Decision
| Strategy | Pros | Cons |
|---|---|---|
| Block all AI bots | Protect content from training | Invisible in AI search |
| Allow all AI bots | Maximum AI visibility | Content used for training |
| Selective (recommended) | Block training bots, serve search bots | Requires maintenance |
| Serve optimized content | Best visibility + controlled format | Slight server overhead |
Our recommendation: allow search bots (they drive traffic), and serve them optimized Markdown. For training bots, decide based on your business model. If you want to be in future AI answers, allow training too.
See which AI bots visit your site. Install Zitably and check the dashboard within 24 hours. Get started →