Skip to content
Guides/Reference

AI Bot User Agents List 2026 — Complete Reference

The complete list of AI crawler and AI training bot user agent strings, with detection notes and compliance information. Use this as your canonical reference when building bot-blocking middleware for any platform.

Last updated: April 202620 bots listed

Quick-copy bot list

Lowercase strings for case-insensitive substring matching. Use this array directly in your middleware. See the full table below for details on each bot.

// AI bot user agents — 2026 reference list
// Use case-insensitive substring matching (see detection section below)

const AI_BOTS = [
  // OpenAI
  'gptbot',           // Training crawler
  'chatgpt-user',     // ChatGPT browsing feature
  'oai-searchbot',    // SearchGPT indexing

  // Anthropic
  'claudebot',        // Training crawler
  'anthropic-ai',     // General Anthropic crawl
  'claude-web',       // Claude web tool

  // Google
  'google-extended',  // Gemini/Vertex AI training (NOT regular Googlebot)

  // Common Crawl (used by most LLMs)
  'ccbot',

  // Other major AI companies
  'bytespider',       // ByteDance / TikTok
  'applebot-extended',// Apple Intelligence training
  'perplexitybot',    // Perplexity AI search
  'diffbot',          // Knowledge graph
  'cohere-ai',        // Cohere
  'facebookbot',      // Meta AI / Llama
  'amazonbot',        // Amazon / Alexa AI

  // Data aggregators
  'omgili',           // Webz.io (also omgilibot)
  'omgilibot',

  // AI search engines
  'iaskspider',       // iAsk.ai
  'youbot',           // You.com

  // Image dataset builders
  'img2dataset',
];

Full bot reference table

BotUA stringCompanyPurposerobots.txt
GPTBotGPTBotOpenAIAI training crawler✅ Yes
ChatGPT-UserChatGPT-UserOpenAIChatGPT web browsing✅ Yes
OAI-SearchBotOAI-SearchBotOpenAISearchGPT indexing✅ Yes
ClaudeBotClaudeBotAnthropicAI training crawler✅ Yes
anthropic-aianthropic-aiAnthropicGeneral Anthropic crawl✅ Yes
Claude-WebClaude-WebAnthropicClaude web access feature✅ Yes
Google-ExtendedGoogle-ExtendedGoogleGemini / Vertex AI training✅ Yes
CCBotCCBotCommon CrawlWeb archive for AI training datasets⚠️ Variable
BytespiderBytespiderByteDanceAI training / TikTok⚠️ Variable
Applebot-ExtendedApplebot-ExtendedAppleApple Intelligence training✅ Yes
PerplexityBotPerplexityBotPerplexity AIAI search indexing✅ Yes
DiffbotDiffbotDiffbotKnowledge graph / AI data⚠️ Variable
cohere-aicohere-aiCohereAI training✅ Yes
FacebookBotFacebookBotMetaMeta AI training✅ Yes
AmazonbotAmazonbotAmazonAlexa / Amazon AI✅ Yes
omgiliomgiliWebz.ioData aggregation for AI⚠️ Variable
iaskspideriaskspideriAsk.aiAI search indexing✅ Yes
YouBotYouBotYou.comAI search indexing✅ Yes
img2datasetimg2datasetVarious (HuggingFace)Image dataset collection❌ No
ScrapyScrapyOpen sourceGeneric scraping framework❌ No

robots.txt compliance is self-reported by operators. "Variable" means documented cases of non-compliance or inconsistent behaviour across deployments.

⚠️ Do not block Googlebot

Googlebot is Google's standard search indexing crawler. Blocking it removes your site from Google Search entirely. Google uses a separate token — Google-Extended — for Gemini and Vertex AI training. Block Google-Extended in both robots.txt and your server middleware. Never block Googlebot.

How bot detection works

Always use case-insensitive substring matching on the User-Agent header. Never use exact equality — user agent strings include version numbers, platform tokens, and additional metadata that change between versions.

JavaScript / TypeScript

const AI_BOTS = ['gptbot', 'claudebot', 'anthropic-ai', 'google-extended',
  'ccbot', 'bytespider', 'applebot-extended', 'perplexitybot',
  'diffbot', 'cohere-ai', 'facebookbot', 'amazonbot', 'omgili',
  'omgilibot', 'iaskspider', 'youbot', 'img2dataset',
  'chatgpt-user', 'oai-searchbot', 'claude-web'];

function isAIBot(userAgent: string): boolean {
  const ua = userAgent.toLowerCase();
  return AI_BOTS.some(bot => ua.includes(bot));
}

Python

AI_BOTS = [
    'gptbot', 'chatgpt-user', 'oai-searchbot',
    'claudebot', 'anthropic-ai', 'claude-web',
    'google-extended', 'ccbot', 'bytespider',
    'applebot-extended', 'perplexitybot', 'diffbot',
    'cohere-ai', 'facebookbot', 'amazonbot',
    'omgili', 'omgilibot', 'iaskspider', 'youbot', 'img2dataset',
]

def is_ai_bot(user_agent: str) -> bool:
    ua = user_agent.lower()
    return any(bot in ua for bot in AI_BOTS)

PHP

$AI_BOTS = [
    'gptbot', 'chatgpt-user', 'oai-searchbot',
    'claudebot', 'anthropic-ai', 'claude-web',
    'google-extended', 'ccbot', 'bytespider',
    'applebot-extended', 'perplexitybot', 'diffbot',
    'cohere-ai', 'facebookbot', 'amazonbot',
    'omgili', 'omgilibot', 'iaskspider', 'youbot', 'img2dataset',
];

function isAIBot(string $userAgent): bool {
    $ua = strtolower($userAgent);
    foreach ($AI_BOTS as $bot) {
        if (str_contains($ua, $bot)) return true;
    }
    return false;
}

Go

var aiBots = []string{
    "gptbot", "chatgpt-user", "oai-searchbot",
    "claudebot", "anthropic-ai", "claude-web",
    "google-extended", "ccbot", "bytespider",
    "applebot-extended", "perplexitybot", "diffbot",
    "cohere-ai", "facebookbot", "amazonbot",
    "omgili", "omgilibot", "iaskspider", "youbot", "img2dataset",
}

func isAIBot(userAgent string) bool {
    ua := strings.ToLower(userAgent)
    for _, bot := range aiBots {
        if strings.Contains(ua, bot) {
            return true
        }
    }
    return false
}

robots.txt directives for all bots

Each AI bot that respects robots.txt has its own User-agent token. List them individually to be explicit. A single User-agent: * disallow would also block Googlebot — don't do that.

# robots.txt — AI training crawler block

User-agent: *
Allow: /

# OpenAI
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
Disallow: /

# Anthropic
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: Claude-Web
Disallow: /

# Google AI (NOT regular Googlebot)
User-agent: Google-Extended
Disallow: /

# Common Crawl
User-agent: CCBot
Disallow: /

# ByteDance
User-agent: Bytespider
Disallow: /

# Apple AI
User-agent: Applebot-Extended
Disallow: /

# AI search engines
User-agent: PerplexityBot
User-agent: iaskspider
User-agent: YouBot
Disallow: /

# Data aggregators
User-agent: Diffbot
User-agent: cohere-ai
User-agent: FacebookBot
User-agent: Amazonbot
User-agent: omgili
User-agent: omgilibot
Disallow: /

How often does this list change?

Major new AI bot user agents appear 2–4 times per year, typically when a large AI company launches a new product or changes their crawling strategy. The core list has been stable since 2023.

How to check for new bots

  • OpenAI: platform.openai.com/docs/gptbot
  • Anthropic: docs.anthropic.com/claude/docs/claude-crawler
  • Google: developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
  • Check your server access logs monthly for unknown user agents — new bots almost never announce themselves

Block these bots on your platform

Now that you have the bot list, implement the block in your specific framework or platform:

Don't see your platform? Browse all guides →

FAQ

How do I detect AI bots by user agent string?

Use case-insensitive substring matching on the User-Agent request header. Convert the incoming header to lowercase and check if it contains any of the known bot strings (also lowercased). Exact full-string matching is too brittle — user agent strings include version numbers that change between releases.

Do AI bots respect the robots.txt Disallow directive?

Major AI companies — OpenAI (GPTBot), Anthropic (ClaudeBot), Google (Google-Extended), Apple (Applebot-Extended), and Perplexity (PerplexityBot) — officially state that their training crawlers respect robots.txt. CCBot, Bytespider, and Diffbot have mixed compliance records. For reliable blocking, use server-level middleware returning a 403 in addition to robots.txt.

Should I block Googlebot to stop Google AI from using my content?

No. Googlebot is the standard search indexing crawler — blocking it removes your site from Google Search entirely. Google uses a separate token, Google-Extended, for Gemini and Vertex AI training. Block Google-Extended. Never block Googlebot.

What is the difference between GPTBot and ChatGPT-User?

GPTBot is OpenAI's training crawler — it scrapes the web to build training datasets. ChatGPT-User is used when a ChatGPT user triggers a browse/search action. OAI-SearchBot is for SearchGPT indexing. Block all three to prevent any OpenAI access.

How often does this list change?

Major new AI bot user agents appear 2-4 times per year. The core list (GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider) has been stable since 2023. Check official docs from OpenAI, Anthropic, and Google periodically.

What is CCBot and why is it on the AI bot list?

CCBot is the Common Crawl foundation crawler — it indexes the web into a publicly available dataset. Common Crawl data has been used to train GPT-3, GPT-4, Llama, Mistral, and most major LLMs. Blocking CCBot prevents your content from entering the datasets that train AI models.

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.