AI Bot User Agents List 2026 — Complete Reference
The complete list of AI crawler and AI training bot user agent strings, with detection notes and compliance information. Use this as your canonical reference when building bot-blocking middleware for any platform.
Quick-copy bot list
Lowercase strings for case-insensitive substring matching. Use this array directly in your middleware. See the full table below for details on each bot.
// AI bot user agents — 2026 reference list // Use case-insensitive substring matching (see detection section below) const AI_BOTS = [ // OpenAI 'gptbot', // Training crawler 'chatgpt-user', // ChatGPT browsing feature 'oai-searchbot', // SearchGPT indexing // Anthropic 'claudebot', // Training crawler 'anthropic-ai', // General Anthropic crawl 'claude-web', // Claude web tool // Google 'google-extended', // Gemini/Vertex AI training (NOT regular Googlebot) // Common Crawl (used by most LLMs) 'ccbot', // Other major AI companies 'bytespider', // ByteDance / TikTok 'applebot-extended',// Apple Intelligence training 'perplexitybot', // Perplexity AI search 'diffbot', // Knowledge graph 'cohere-ai', // Cohere 'facebookbot', // Meta AI / Llama 'amazonbot', // Amazon / Alexa AI // Data aggregators 'omgili', // Webz.io (also omgilibot) 'omgilibot', // AI search engines 'iaskspider', // iAsk.ai 'youbot', // You.com // Image dataset builders 'img2dataset', ];
Full bot reference table
| Bot | UA string | Company | Purpose | robots.txt |
|---|---|---|---|---|
| GPTBot | GPTBot | OpenAI | AI training crawler | ✅ Yes |
| ChatGPT-User | ChatGPT-User | OpenAI | ChatGPT web browsing | ✅ Yes |
| OAI-SearchBot | OAI-SearchBot | OpenAI | SearchGPT indexing | ✅ Yes |
| ClaudeBot | ClaudeBot | Anthropic | AI training crawler | ✅ Yes |
| anthropic-ai | anthropic-ai | Anthropic | General Anthropic crawl | ✅ Yes |
| Claude-Web | Claude-Web | Anthropic | Claude web access feature | ✅ Yes |
| Google-Extended | Google-Extended | Gemini / Vertex AI training | ✅ Yes | |
| CCBot | CCBot | Common Crawl | Web archive for AI training datasets | ⚠️ Variable |
| Bytespider | Bytespider | ByteDance | AI training / TikTok | ⚠️ Variable |
| Applebot-Extended | Applebot-Extended | Apple | Apple Intelligence training | ✅ Yes |
| PerplexityBot | PerplexityBot | Perplexity AI | AI search indexing | ✅ Yes |
| Diffbot | Diffbot | Diffbot | Knowledge graph / AI data | ⚠️ Variable |
| cohere-ai | cohere-ai | Cohere | AI training | ✅ Yes |
| FacebookBot | FacebookBot | Meta | Meta AI training | ✅ Yes |
| Amazonbot | Amazonbot | Amazon | Alexa / Amazon AI | ✅ Yes |
| omgili | omgili | Webz.io | Data aggregation for AI | ⚠️ Variable |
| iaskspider | iaskspider | iAsk.ai | AI search indexing | ✅ Yes |
| YouBot | YouBot | You.com | AI search indexing | ✅ Yes |
| img2dataset | img2dataset | Various (HuggingFace) | Image dataset collection | ❌ No |
| Scrapy | Scrapy | Open source | Generic scraping framework | ❌ No |
robots.txt compliance is self-reported by operators. "Variable" means documented cases of non-compliance or inconsistent behaviour across deployments.
⚠️ Do not block Googlebot
Googlebot is Google's standard search indexing crawler. Blocking it removes your site from Google Search entirely. Google uses a separate token — Google-Extended — for Gemini and Vertex AI training. Block Google-Extended in both robots.txt and your server middleware. Never block Googlebot.
How bot detection works
Always use case-insensitive substring matching on the User-Agent header. Never use exact equality — user agent strings include version numbers, platform tokens, and additional metadata that change between versions.
JavaScript / TypeScript
const AI_BOTS = ['gptbot', 'claudebot', 'anthropic-ai', 'google-extended',
'ccbot', 'bytespider', 'applebot-extended', 'perplexitybot',
'diffbot', 'cohere-ai', 'facebookbot', 'amazonbot', 'omgili',
'omgilibot', 'iaskspider', 'youbot', 'img2dataset',
'chatgpt-user', 'oai-searchbot', 'claude-web'];
function isAIBot(userAgent: string): boolean {
const ua = userAgent.toLowerCase();
return AI_BOTS.some(bot => ua.includes(bot));
}Python
AI_BOTS = [
'gptbot', 'chatgpt-user', 'oai-searchbot',
'claudebot', 'anthropic-ai', 'claude-web',
'google-extended', 'ccbot', 'bytespider',
'applebot-extended', 'perplexitybot', 'diffbot',
'cohere-ai', 'facebookbot', 'amazonbot',
'omgili', 'omgilibot', 'iaskspider', 'youbot', 'img2dataset',
]
def is_ai_bot(user_agent: str) -> bool:
ua = user_agent.lower()
return any(bot in ua for bot in AI_BOTS)PHP
$AI_BOTS = [
'gptbot', 'chatgpt-user', 'oai-searchbot',
'claudebot', 'anthropic-ai', 'claude-web',
'google-extended', 'ccbot', 'bytespider',
'applebot-extended', 'perplexitybot', 'diffbot',
'cohere-ai', 'facebookbot', 'amazonbot',
'omgili', 'omgilibot', 'iaskspider', 'youbot', 'img2dataset',
];
function isAIBot(string $userAgent): bool {
$ua = strtolower($userAgent);
foreach ($AI_BOTS as $bot) {
if (str_contains($ua, $bot)) return true;
}
return false;
}Go
var aiBots = []string{
"gptbot", "chatgpt-user", "oai-searchbot",
"claudebot", "anthropic-ai", "claude-web",
"google-extended", "ccbot", "bytespider",
"applebot-extended", "perplexitybot", "diffbot",
"cohere-ai", "facebookbot", "amazonbot",
"omgili", "omgilibot", "iaskspider", "youbot", "img2dataset",
}
func isAIBot(userAgent string) bool {
ua := strings.ToLower(userAgent)
for _, bot := range aiBots {
if strings.Contains(ua, bot) {
return true
}
}
return false
}robots.txt directives for all bots
Each AI bot that respects robots.txt has its own User-agent token. List them individually to be explicit. A single User-agent: * disallow would also block Googlebot — don't do that.
# robots.txt — AI training crawler block User-agent: * Allow: / # OpenAI User-agent: GPTBot User-agent: ChatGPT-User User-agent: OAI-SearchBot Disallow: / # Anthropic User-agent: ClaudeBot User-agent: anthropic-ai User-agent: Claude-Web Disallow: / # Google AI (NOT regular Googlebot) User-agent: Google-Extended Disallow: / # Common Crawl User-agent: CCBot Disallow: / # ByteDance User-agent: Bytespider Disallow: / # Apple AI User-agent: Applebot-Extended Disallow: / # AI search engines User-agent: PerplexityBot User-agent: iaskspider User-agent: YouBot Disallow: / # Data aggregators User-agent: Diffbot User-agent: cohere-ai User-agent: FacebookBot User-agent: Amazonbot User-agent: omgili User-agent: omgilibot Disallow: /
How often does this list change?
Major new AI bot user agents appear 2–4 times per year, typically when a large AI company launches a new product or changes their crawling strategy. The core list has been stable since 2023.
How to check for new bots
- →OpenAI:
platform.openai.com/docs/gptbot - →Anthropic:
docs.anthropic.com/claude/docs/claude-crawler - →Google:
developers.google.com/search/docs/crawling-indexing/overview-google-crawlers - →Check your server access logs monthly for unknown user agents — new bots almost never announce themselves
Block these bots on your platform
Now that you have the bot list, implement the block in your specific framework or platform:
Don't see your platform? Browse all guides →
FAQ
How do I detect AI bots by user agent string?
Use case-insensitive substring matching on the User-Agent request header. Convert the incoming header to lowercase and check if it contains any of the known bot strings (also lowercased). Exact full-string matching is too brittle — user agent strings include version numbers that change between releases.
Do AI bots respect the robots.txt Disallow directive?
Major AI companies — OpenAI (GPTBot), Anthropic (ClaudeBot), Google (Google-Extended), Apple (Applebot-Extended), and Perplexity (PerplexityBot) — officially state that their training crawlers respect robots.txt. CCBot, Bytespider, and Diffbot have mixed compliance records. For reliable blocking, use server-level middleware returning a 403 in addition to robots.txt.
Should I block Googlebot to stop Google AI from using my content?
No. Googlebot is the standard search indexing crawler — blocking it removes your site from Google Search entirely. Google uses a separate token, Google-Extended, for Gemini and Vertex AI training. Block Google-Extended. Never block Googlebot.
What is the difference between GPTBot and ChatGPT-User?
GPTBot is OpenAI's training crawler — it scrapes the web to build training datasets. ChatGPT-User is used when a ChatGPT user triggers a browse/search action. OAI-SearchBot is for SearchGPT indexing. Block all three to prevent any OpenAI access.
How often does this list change?
Major new AI bot user agents appear 2-4 times per year. The core list (GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider) has been stable since 2023. Check official docs from OpenAI, Anthropic, and Google periodically.
What is CCBot and why is it on the AI bot list?
CCBot is the Common Crawl foundation crawler — it indexes the web into a publicly available dataset. Common Crawl data has been used to train GPT-3, GPT-4, Llama, Mistral, and most major LLMs. Blocking CCBot prevents your content from entering the datasets that train AI models.
Is your site protected from AI bots?
Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.