How is this different from Google Analytics?

Google Analytics shows you traffic. Shadow shows you traffic, AI bot activity, what AI platforms say about your brand, AND tells you what to do about all of it. It's analytics + AI intelligence + action steps in one tool.

Do I need to install anything?

For basic monitoring (bot detection, AI perception, readiness score) — nope, just enter your URL. For full visitor analytics (clicks, behavior, sessions), add one script tag. One-click integrations for Vercel, Shopify, WordPress, and more.

Will it slow down my site?

No. The script is under 5KB and loads async. Zero impact on page speed or Core Web Vitals. External monitoring has literally no impact — it watches from the outside.

What AI bots does Shadow detect?

All of them. GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, Amazonbot, and dozens more. The Shadow Network means new bots get identified across all users instantly.

What do you mean by "actionable steps"?

Shadow doesn't just show you graphs. It says things like: "ChatGPT has your pricing wrong — add structured data to /pricing to fix it" or "Your bounce rate on /features is 68% — here's why and what to change." Specific, do-it-today recommendations.

Can Shadow block bots?

Shadow is a telescope, not a shield. It shows you who's visiting and what AI says about you. It generates block rules and robots.txt configs you can apply — but it doesn't intercept traffic.

Yes. Shadow never collects PII. IP addresses are hashed after classification. No cookies on your visitors. All Shadow Network data is anonymized. GDPR compliant by design.

Guides/Reference

AI Bot User Agents List 2026 — Complete Reference

The complete list of AI crawler and AI training bot user agent strings, with detection notes and compliance information. Use this as your canonical reference when building bot-blocking middleware for any platform.

Last updated: April 202620 bots listed

Quick-copy bot list

Lowercase strings for case-insensitive substring matching. Use this array directly in your middleware. See the full table below for details on each bot.

// AI bot user agents — 2026 reference list
// Use case-insensitive substring matching (see detection section below)

const AI_BOTS = [
  // OpenAI
  'gptbot',           // Training crawler
  'chatgpt-user',     // ChatGPT browsing feature
  'oai-searchbot',    // SearchGPT indexing

  // Anthropic
  'claudebot',        // Training crawler
  'anthropic-ai',     // General Anthropic crawl
  'claude-web',       // Claude web tool

  // Google
  'google-extended',  // Gemini/Vertex AI training (NOT regular Googlebot)

  // Common Crawl (used by most LLMs)
  'ccbot',

  // Other major AI companies
  'bytespider',       // ByteDance / TikTok
  'applebot-extended',// Apple Intelligence training
  'perplexitybot',    // Perplexity AI search
  'diffbot',          // Knowledge graph
  'cohere-ai',        // Cohere
  'facebookbot',      // Meta AI / Llama
  'amazonbot',        // Amazon / Alexa AI

  // Data aggregators
  'omgili',           // Webz.io (also omgilibot)
  'omgilibot',

  // AI search engines
  'iaskspider',       // iAsk.ai
  'youbot',           // You.com

  // Image dataset builders
  'img2dataset',
];

Full bot reference table

Bot	UA string	Company	Purpose	robots.txt
GPTBot	`GPTBot`	OpenAI	AI training crawler	✅ Yes
ChatGPT-User	`ChatGPT-User`	OpenAI	ChatGPT web browsing	✅ Yes
OAI-SearchBot	`OAI-SearchBot`	OpenAI	SearchGPT indexing	✅ Yes
ClaudeBot	`ClaudeBot`	Anthropic	AI training crawler	✅ Yes
anthropic-ai	`anthropic-ai`	Anthropic	General Anthropic crawl	✅ Yes
Claude-Web	`Claude-Web`	Anthropic	Claude web access feature	✅ Yes
Google-Extended	`Google-Extended`	Google	Gemini / Vertex AI training	✅ Yes
CCBot	`CCBot`	Common Crawl	Web archive for AI training datasets	⚠️ Variable
Bytespider	`Bytespider`	ByteDance	AI training / TikTok	⚠️ Variable
Applebot-Extended	`Applebot-Extended`	Apple	Apple Intelligence training	✅ Yes
PerplexityBot	`PerplexityBot`	Perplexity AI	AI search indexing	✅ Yes
Diffbot	`Diffbot`	Diffbot	Knowledge graph / AI data	⚠️ Variable
cohere-ai	`cohere-ai`	Cohere	AI training	✅ Yes
FacebookBot	`FacebookBot`	Meta	Meta AI training	✅ Yes
Amazonbot	`Amazonbot`	Amazon	Alexa / Amazon AI	✅ Yes
omgili	`omgili`	Webz.io	Data aggregation for AI	⚠️ Variable
iaskspider	`iaskspider`	iAsk.ai	AI search indexing	✅ Yes
YouBot	`YouBot`	You.com	AI search indexing	✅ Yes
img2dataset	`img2dataset`	Various (HuggingFace)	Image dataset collection	❌ No
Scrapy	`Scrapy`	Open source	Generic scraping framework	❌ No

robots.txt compliance is self-reported by operators. "Variable" means documented cases of non-compliance or inconsistent behaviour across deployments.

⚠️ Do not block Googlebot

Googlebot is Google's standard search indexing crawler. Blocking it removes your site from Google Search entirely. Google uses a separate token — Google-Extended — for Gemini and Vertex AI training. Block Google-Extended in both robots.txt and your server middleware. Never block Googlebot.

How bot detection works

Always use case-insensitive substring matching on the User-Agent header. Never use exact equality — user agent strings include version numbers, platform tokens, and additional metadata that change between versions.

JavaScript / TypeScript

const AI_BOTS = ['gptbot', 'claudebot', 'anthropic-ai', 'google-extended',
  'ccbot', 'bytespider', 'applebot-extended', 'perplexitybot',
  'diffbot', 'cohere-ai', 'facebookbot', 'amazonbot', 'omgili',
  'omgilibot', 'iaskspider', 'youbot', 'img2dataset',
  'chatgpt-user', 'oai-searchbot', 'claude-web'];

function isAIBot(userAgent: string): boolean {
  const ua = userAgent.toLowerCase();
  return AI_BOTS.some(bot => ua.includes(bot));
}

Python

AI_BOTS = [
    'gptbot', 'chatgpt-user', 'oai-searchbot',
    'claudebot', 'anthropic-ai', 'claude-web',
    'google-extended', 'ccbot', 'bytespider',
    'applebot-extended', 'perplexitybot', 'diffbot',
    'cohere-ai', 'facebookbot', 'amazonbot',
    'omgili', 'omgilibot', 'iaskspider', 'youbot', 'img2dataset',
]

def is_ai_bot(user_agent: str) -> bool:
    ua = user_agent.lower()
    return any(bot in ua for bot in AI_BOTS)

PHP

$AI_BOTS = [
    'gptbot', 'chatgpt-user', 'oai-searchbot',
    'claudebot', 'anthropic-ai', 'claude-web',
    'google-extended', 'ccbot', 'bytespider',
    'applebot-extended', 'perplexitybot', 'diffbot',
    'cohere-ai', 'facebookbot', 'amazonbot',
    'omgili', 'omgilibot', 'iaskspider', 'youbot', 'img2dataset',
];

function isAIBot(string $userAgent): bool {
    $ua = strtolower($userAgent);
    foreach ($AI_BOTS as $bot) {
        if (str_contains($ua, $bot)) return true;
    }
    return false;
}

Go

var aiBots = []string{
    "gptbot", "chatgpt-user", "oai-searchbot",
    "claudebot", "anthropic-ai", "claude-web",
    "google-extended", "ccbot", "bytespider",
    "applebot-extended", "perplexitybot", "diffbot",
    "cohere-ai", "facebookbot", "amazonbot",
    "omgili", "omgilibot", "iaskspider", "youbot", "img2dataset",
}

func isAIBot(userAgent string) bool {
    ua := strings.ToLower(userAgent)
    for _, bot := range aiBots {
        if strings.Contains(ua, bot) {
            return true
        }
    }
    return false
}

robots.txt directives for all bots

Each AI bot that respects robots.txt has its own User-agent token. List them individually to be explicit. A single User-agent: * disallow would also block Googlebot — don't do that.

# robots.txt — AI training crawler block

User-agent: *
Allow: /

# OpenAI
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
Disallow: /

# Anthropic
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: Claude-Web
Disallow: /

# Google AI (NOT regular Googlebot)
User-agent: Google-Extended
Disallow: /

# Common Crawl
User-agent: CCBot
Disallow: /

# ByteDance
User-agent: Bytespider
Disallow: /

# Apple AI
User-agent: Applebot-Extended
Disallow: /

# AI search engines
User-agent: PerplexityBot
User-agent: iaskspider
User-agent: YouBot
Disallow: /

# Data aggregators
User-agent: Diffbot
User-agent: cohere-ai
User-agent: FacebookBot
User-agent: Amazonbot
User-agent: omgili
User-agent: omgilibot
Disallow: /

How often does this list change?

Major new AI bot user agents appear 2–4 times per year, typically when a large AI company launches a new product or changes their crawling strategy. The core list has been stable since 2023.

How to check for new bots

→OpenAI: platform.openai.com/docs/gptbot
→Anthropic: docs.anthropic.com/claude/docs/claude-crawler
→Google: developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
→Check your server access logs monthly for unknown user agents — new bots almost never announce themselves

Block these bots on your platform

Now that you have the bot list, implement the block in your specific framework or platform:

Vercel Netlify Next.js Nuxt SvelteKit Astro WordPress Django Laravel Express.js FastAPI Ruby on Rails

Don't see your platform? Browse all guides →

FAQ

How do I detect AI bots by user agent string?

Use case-insensitive substring matching on the User-Agent request header. Convert the incoming header to lowercase and check if it contains any of the known bot strings (also lowercased). Exact full-string matching is too brittle — user agent strings include version numbers that change between releases.

Do AI bots respect the robots.txt Disallow directive?

Major AI companies — OpenAI (GPTBot), Anthropic (ClaudeBot), Google (Google-Extended), Apple (Applebot-Extended), and Perplexity (PerplexityBot) — officially state that their training crawlers respect robots.txt. CCBot, Bytespider, and Diffbot have mixed compliance records. For reliable blocking, use server-level middleware returning a 403 in addition to robots.txt.

Should I block Googlebot to stop Google AI from using my content?

No. Googlebot is the standard search indexing crawler — blocking it removes your site from Google Search entirely. Google uses a separate token, Google-Extended, for Gemini and Vertex AI training. Block Google-Extended. Never block Googlebot.

What is the difference between GPTBot and ChatGPT-User?

GPTBot is OpenAI's training crawler — it scrapes the web to build training datasets. ChatGPT-User is used when a ChatGPT user triggers a browse/search action. OAI-SearchBot is for SearchGPT indexing. Block all three to prevent any OpenAI access.

How often does this list change?

Major new AI bot user agents appear 2-4 times per year. The core list (GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider) has been stable since 2023. Check official docs from OpenAI, Anthropic, and Google periodically.

What is CCBot and why is it on the AI bot list?

CCBot is the Common Crawl foundation crawler — it indexes the web into a publicly available dataset. Common Crawl data has been used to train GPT-3, GPT-4, Llama, Mistral, and most major LLMs. Blocking CCBot prevents your content from entering the datasets that train AI models.

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.