Skip to content
MonitoringAnalytics9 min read

Monitor AI Bot Traffic
Know What's Crawling You

You can't protect what you can't see. Before blocking AI bots, you need to know which ones are hitting your site, how often, and what pages they're reading. Here's how to find out.

The visibility problem

Most site owners have zero visibility into AI bot traffic. Google Analytics doesn't show it. Your CMS dashboard doesn't show it. The bots crawl silently, extract your content, and feed it to AI models — and you never know it happened.

Why? AI crawlers don't execute JavaScript. They fetch raw HTML via HTTP requests, which means client-side analytics (GA4, Plausible, Fathom) never fire. The only place AI bot traffic is visible is in server access logs and CDN-level analytics.

What sees AI bot traffic?
Server access logs (nginx, Apache, Caddy) — raw HTTP requests with full user agent strings
CDN analytics (Cloudflare, Fastly, Akamai) — request-level data with bot classification
Edge middleware logs (Vercel, Netlify) — request headers visible at the edge
Open Shadow scanner — detects AI bot exposure from your robots.txt + meta tags configuration
Google Analytics (GA4) — JavaScript-based, never triggered by crawlers
Plausible / Fathom / Matomo — also JavaScript-based (unless using server-side tracking)
WordPress Stats / Jetpack — client-side pixel, invisible to non-JS clients

AI bot user agent reference (2026)

These are the user agent tokens you're looking for in your logs. Each line represents a distinct AI company and purpose.

Major AI bot user agent tokens — grep these in your logs
# OpenAI (3 bots, 3 purposes)
GPTBot          # Training data for GPT models
ChatGPT-User    # Real-time browsing when users ask ChatGPT to read a URL
OAI-SearchBot   # Indexing for ChatGPT Search results

# Anthropic (2 tokens)
ClaudeBot       # Training data for Claude models
anthropic-ai    # Alternative Anthropic identifier

# Google (3 AI-specific bots)
Google-Extended         # Gemini/Bard AI training
Gemini-Deep-Research    # Gemini Advanced deep research feature
Google-NotebookLM       # NotebookLM URL ingestion

# Meta
meta-externalagent  # Llama training (NOT the link preview bot)

# Microsoft
bingbot         # Powers both Bing Search AND Copilot answers

# ByteDance
Bytespider      # TikTok / Doubao AI training

# Amazon (3 bots)
Amazonbot           # AI training for Alexa/Nova
Amzn-SearchBot      # Rufus AI shopping assistant
Amzn-User           # Live query browsing

# AI Search Engines
PerplexityBot       # Perplexity AI search indexing
YouBot              # You.com AI assistant
DuckAssistBot       # DuckDuckGo AI answers

# European AI
MistralBot      # Mistral AI (Le Chat, Mixtral)
DeepSeekBot     # DeepSeek (V3, R1)

# xAI
xAI-Bot         # Grok training (Elon Musk's xAI)

# Data Brokers (feed multiple AI companies)
CCBot           # Common Crawl → GPT, Gemini, Llama, Mistral, etc.
Diffbot         # Commercial data broker → Llama, Mistral, DiffbotLLM
omgili          # Webz.io data broker
omgilibot       # Webz.io (alternative token)
webzio-extended # Webz.io AI training crawler

# Apple
Applebot-Extended   # Apple Intelligence training

# Academic
AI2Bot          # Allen Institute (Semantic Scholar)
Ai2Bot-Dolma    # Allen Institute (OLMo training dataset)

# Enterprise
cohere-ai       # Cohere (Command R, Embed)

Method 1: Server log analysis

If you have access to your server's access logs (nginx, Apache, or any reverse proxy), you can grep for AI bot user agents directly. This is the most accurate method.

bash — Find all AI bot requests in nginx logs
#!/bin/bash
# ai-bot-report.sh — Scan nginx access logs for AI bot traffic

LOG="/var/log/nginx/access.log"

# All known AI bot tokens (case-insensitive grep)
AI_BOTS="GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|anthropic-ai|Google-Extended|Gemini-Deep-Research|Google-NotebookLM|Bytespider|PerplexityBot|meta-externalagent|Amazonbot|Amzn-SearchBot|CCBot|Diffbot|cohere-ai|DeepSeekBot|xAI-Bot|MistralBot|YouBot|DuckAssistBot|AI2Bot|Ai2Bot-Dolma|Applebot-Extended|omgili|omgilibot|webzio-extended"

echo "=== AI Bot Traffic Report ==="
echo "Log: $LOG"
echo "Date: $(date)"
echo ""

# Total AI bot requests
TOTAL=$(grep -ciE "$AI_BOTS" "$LOG")
ALL=$(wc -l < "$LOG")
echo "AI bot requests: $TOTAL / $ALL total ($(echo "scale=1; $TOTAL * 100 / $ALL" | bc)%)"
echo ""

# Breakdown by bot
echo "=== Requests per bot ==="
for bot in GPTBot ChatGPT-User OAI-SearchBot ClaudeBot Bytespider PerplexityBot Google-Extended meta-externalagent Amazonbot CCBot Diffbot DeepSeekBot xAI-Bot MistralBot cohere-ai; do
  COUNT=$(grep -ci "$bot" "$LOG")
  if [ "$COUNT" -gt 0 ]; then
    echo "  $bot: $COUNT"
  fi
done
echo ""

# Top 20 pages hit by AI bots
echo "=== Top 20 pages targeted by AI bots ==="
grep -iE "$AI_BOTS" "$LOG" \
  | awk '{print $7}' \
  | sort | uniq -c | sort -rn \
  | head -20
echo ""

# Hourly distribution (when do they crawl?)
echo "=== Hourly distribution ==="
grep -iE "$AI_BOTS" "$LOG" \
  | awk '{print substr($4,14,2)":00"}' \
  | sort | uniq -c | sort -rn \
  | head -10
bash — Daily AI bot report via cron
# Add to crontab (crontab -e):
# Run daily at midnight, email results
0 0 * * * /opt/scripts/ai-bot-report.sh > /tmp/ai-bot-daily.txt 2>&1

# Or append to a rolling log:
0 0 * * * /opt/scripts/ai-bot-report.sh >> /var/log/ai-bot-reports.log 2>&1

# For Apache logs, change the LOG path:
# LOG="/var/log/apache2/access.log"
# Format is slightly different but grep works the same

Method 2: Dedicated AI bot log file

Instead of grepping through massive access logs, configure nginx to write AI bot requests to a separate log file. This makes monitoring trivial and keeps your main logs clean.

nginx.conf — Separate AI bot log
http {
  # Map AI bot user agents to a flag
  map $http_user_agent $is_ai_bot {
    default 0;
    "~*GPTBot"              1;
    "~*ChatGPT-User"        1;
    "~*OAI-SearchBot"       1;
    "~*ClaudeBot"           1;
    "~*anthropic-ai"        1;
    "~*Google-Extended"      1;
    "~*Bytespider"          1;
    "~*PerplexityBot"       1;
    "~*meta-externalagent"  1;
    "~*Amazonbot"           1;
    "~*CCBot"               1;
    "~*Diffbot"             1;
    "~*DeepSeekBot"         1;
    "~*xAI-Bot"             1;
    "~*MistralBot"          1;
    "~*cohere-ai"           1;
    "~*YouBot"              1;
    "~*DuckAssistBot"       1;
    "~*Applebot-Extended"   1;
    "~*AI2Bot"              1;
    "~*Ai2Bot-Dolma"        1;
  }

  # JSON log format for easy parsing
  log_format ai_bot_json escape=json
    '{'
      '"time":"$time_iso8601",'
      '"ip":"$remote_addr",'
      '"method":"$request_method",'
      '"path":"$uri",'
      '"status":$status,'
      '"ua":"$http_user_agent",'
      '"bytes":$body_bytes_sent,'
      '"referer":"$http_referer"'
    '}';

  server {
    # Write AI bot requests to a separate file
    access_log /var/log/nginx/ai-bots.log ai_bot_json if=$is_ai_bot;

    # Normal access log (all traffic)
    access_log /var/log/nginx/access.log;
  }
}
💡 Why JSON format? JSON logs can be piped directly into jq for analysis, ingested by Elasticsearch/Loki, or processed by any monitoring tool. The structured format makes it trivial to build dashboards and alerts.

Method 3: Next.js middleware logging

On Vercel or any serverless platform where you don't have access to raw nginx logs, use Next.js Edge Middleware to detect and log AI bot requests.

middleware.ts — AI bot detection and logging
import { NextRequest, NextResponse } from 'next/server';

const AI_BOT_PATTERNS = [
  /GPTBot/i, /ChatGPT-User/i, /OAI-SearchBot/i,
  /ClaudeBot/i, /anthropic-ai/i,
  /Google-Extended/i, /Gemini-Deep-Research/i,
  /Bytespider/i, /PerplexityBot/i,
  /meta-externalagent/i, /Amazonbot/i,
  /CCBot/i, /Diffbot/i, /cohere-ai/i,
  /DeepSeekBot/i, /xAI-Bot/i, /MistralBot/i,
  /YouBot/i, /DuckAssistBot/i,
  /Applebot-Extended/i, /AI2Bot/i,
];

function detectAIBot(ua: string): string | null {
  for (const pattern of AI_BOT_PATTERNS) {
    const match = ua.match(pattern);
    if (match) return match[0];
  }
  return null;
}

export function middleware(req: NextRequest) {
  const ua = req.headers.get('user-agent') || '';
  const bot = detectAIBot(ua);

  if (bot) {
    // Log to your preferred backend
    // Option 1: Console log (appears in Vercel Function Logs)
    console.log(JSON.stringify({
      type: 'ai_bot',
      bot,
      path: req.nextUrl.pathname,
      ip: req.headers.get('x-forwarded-for'),
      timestamp: new Date().toISOString(),
    }));

    // Option 2: Send to analytics endpoint (non-blocking)
    // fetch('https://your-analytics.com/api/bot-hit', {
    //   method: 'POST',
    //   body: JSON.stringify({ bot, path: req.nextUrl.pathname }),
    // }).catch(() => {}); // Fire and forget

    // Option 3: Add custom header for downstream processing
    const res = NextResponse.next();
    res.headers.set('x-ai-bot', bot);
    return res;
  }

  return NextResponse.next();
}

export const config = {
  matcher: ['/((?!_next|favicon.ico|robots.txt|sitemap.xml).*)'],
};
⚠️ Vercel log retention: Vercel free tier retains function logs for 1 hour. Pro plan extends to 3 days. For persistent AI bot analytics, pipe logs to an external service (Axiom, Datadog, Logtail) or use the webhook approach to store hits in a database.

Method 4: Cloudflare bot analytics

If your site is behind Cloudflare (even the free plan), you already have AI bot visibility built in — you just might not know where to find it.

Free plan: Security → Bots

Cloudflare's free plan shows automated vs human traffic split. While it doesn't break down individual AI bots, it gives you baseline visibility.

Dashboard → Security → Bots → see "Automated" traffic percentage
Enable "AI Labyrinth" to track (and misdirect) AI agents

Pro+ plan: Bot Analytics

Full bot analytics with individual bot identification, request volumes, and time series data.

Dashboard → Security → Bot Analytics → filter by "Automated"
See per-bot request counts, top targeted URLs, and geographic distribution

Any plan: Firewall Events

Create a WAF rule that logs (not blocks) AI bot traffic, giving you a dedicated event stream.

Cloudflare WAF rule — Log (not block) AI bot traffic
# Cloudflare Dashboard → Security → WAF → Custom Rules
# Rule name: "Log AI Bot Traffic"

# Expression (matches any known AI bot user agent):
(http.user_agent contains "GPTBot")
or (http.user_agent contains "ClaudeBot")
or (http.user_agent contains "Bytespider")
or (http.user_agent contains "PerplexityBot")
or (http.user_agent contains "Google-Extended")
or (http.user_agent contains "meta-externalagent")
or (http.user_agent contains "CCBot")
or (http.user_agent contains "Diffbot")
or (http.user_agent contains "DeepSeekBot")
or (http.user_agent contains "xAI-Bot")
or (http.user_agent contains "Amazonbot")

# Action: Log
# (This creates entries in Security Events without
#  blocking any traffic — pure monitoring)

Method 5: Real-time monitoring with alerts

For high-value content, you want to know immediately when a new AI bot starts crawling. Here's a lightweight monitoring setup using the JSON log file from Method 2.

bash — Real-time AI bot monitor with alerts
#!/bin/bash
# ai-bot-monitor.sh — Watch AI bot log in real-time
# Usage: ./ai-bot-monitor.sh

LOG="/var/log/nginx/ai-bots.log"
WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

echo "🔍 Monitoring AI bot traffic..."
echo "   Log: $LOG"
echo "   Press Ctrl+C to stop"
echo ""

tail -f "$LOG" | while read -r line; do
  BOT=$(echo "$line" | jq -r '.ua' | grep -oiE 'GPTBot|ClaudeBot|Bytespider|PerplexityBot|Google-Extended|DeepSeekBot|xAI-Bot|meta-externalagent|Amazonbot|CCBot|Diffbot')
  PATH_HIT=$(echo "$line" | jq -r '.path')
  IP=$(echo "$line" | jq -r '.ip')
  TIME=$(echo "$line" | jq -r '.time')

  # Print to terminal
  echo "[$TIME] $BOT → $PATH_HIT (from $IP)"

  # Alert on new/unusual bots (optional)
  # Uncomment to send Slack alerts:
  # curl -s -X POST "$WEBHOOK" \
  #   -H 'Content-type: application/json' \
  #   -d "{\"text\":\"🤖 AI Bot: $BOT hit $PATH_HIT\"}" \
  #   > /dev/null
done
jq — Quick analytics from the JSON log
# Count requests per bot (last 24 hours)
cat /var/log/nginx/ai-bots.log | jq -r '.ua' \
  | grep -oiE 'GPTBot|ClaudeBot|Bytespider|PerplexityBot|Google-Extended|CCBot' \
  | sort | uniq -c | sort -rn

# Top 10 pages hit today
cat /var/log/nginx/ai-bots.log | jq -r '.path' \
  | sort | uniq -c | sort -rn | head -10

# Unique IPs per bot
cat /var/log/nginx/ai-bots.log \
  | jq -r '[.ua, .ip] | @tsv' \
  | grep -i "GPTBot" | cut -f2 | sort -u | wc -l

# Bandwidth consumed by AI bots
cat /var/log/nginx/ai-bots.log \
  | jq -r '.bytes' | paste -sd+ | bc \
  | awk '{printf "%.2f MB\n", $1/1048576}'

What to do with the data

Once you know which bots are crawling and what pages they target, you can make informed decisions instead of blindly blocking everything.

Allow AI search bots, block training crawlers
Keep OAI-SearchBot and PerplexityBot (they drive traffic back to you) while blocking GPTBot, ClaudeBot, and Google-Extended (pure extraction, no attribution).
robots.txt guide →
Block aggressive crawlers first
If Bytespider is consuming 30% of your bandwidth while GPTBot is 2%, prioritise blocking Bytespider. Log data tells you where the real load is.
Bytespider blocking guide →
Protect high-value content selectively
If bots are targeting your /research or /reports pages but ignoring /blog, use path-specific blocks in robots.txt or rate limit those sections specifically.
Per-page meta tag guide →
Monitor for new, unknown crawlers
AI companies launch new bots regularly. Set up alerts for any non-standard user agent making repeated requests. If you see a new pattern, check our bot directory.
AI bot directory →
Track changes after blocking
After adding robots.txt rules, monitor your AI bot log to verify the blocks are working. Not all bots respect robots.txt — Bytespider and some others may continue despite Disallow rules.
Advanced agent blocking →

Frequently asked questions

How do I know which AI bots are crawling my site?
AI bots identify themselves through user agent strings in HTTP requests. Check your server access logs (nginx, Apache, or CDN logs) and grep for known AI bot tokens like GPTBot, ClaudeBot, Bytespider, PerplexityBot, Amazonbot, and others. Most hosting platforms (Cloudflare, Vercel, Netlify) also expose bot traffic in their analytics dashboards. Open Shadow's free scanner can also detect which AI bots have recently accessed your site.
What percentage of my traffic is AI bots?
Studies from Cloudflare (2025) found that AI bot traffic accounts for 5-15% of total web requests on average, but this varies dramatically by site type. News and media sites see 20-40%+ AI bot traffic. Technical documentation and API docs can see 30%+ from AI crawlers. Small personal blogs may see less than 2%. The ratio has been growing roughly 50% year-over-year since 2023.
Do AI bots show up in Google Analytics?
No — most AI crawlers do not execute JavaScript, so they never trigger Google Analytics (GA4) or similar client-side analytics scripts. This is why server-side log analysis is essential for detecting AI bot traffic. Some exceptions: ChatGPT-User and AI agents that use headless browsers do execute JavaScript, but they're typically filtered by analytics platforms as bot traffic. For accurate AI bot monitoring, you need server access logs or a CDN-level analytics tool.
Which AI bot user agent strings should I look for?
The most common AI bot user agents as of 2026 are: GPTBot (OpenAI training), ChatGPT-User (ChatGPT browsing), OAI-SearchBot (ChatGPT Search), ClaudeBot (Anthropic), anthropic-ai (Anthropic), Bytespider (ByteDance/TikTok), PerplexityBot (Perplexity AI), Google-Extended (Gemini training), Amazonbot (Amazon/Alexa), meta-externalagent (Meta/Llama), CCBot (Common Crawl), Diffbot, cohere-ai, DeepSeekBot, xAI-Bot (Grok), MistralBot, YouBot, AI2Bot, and DuckAssistBot. A comprehensive list includes 50+ distinct AI bot tokens.
How often do AI bots crawl a typical site?
Crawl frequency varies by bot and site authority. GPTBot typically crawls medium-traffic sites daily, with high-authority sites seeing multiple visits per hour. ClaudeBot tends to be less aggressive — weekly for smaller sites. Bytespider is notoriously aggressive, sometimes making thousands of requests per day. PerplexityBot crawls based on search demand — pages that appear in Perplexity answers get re-crawled frequently. Common Crawl (CCBot) runs in batch cycles, typically monthly.
Can I see what pages AI bots are reading most?
Yes — server logs contain the exact URL path for every request. By filtering logs to AI bot user agents and sorting by URL frequency, you can see exactly which pages each AI bot reads most. Common patterns: GPTBot focuses on text-heavy content pages, Bytespider hits everything including assets, PerplexityBot targets pages that match active search queries, and Google-Extended mirrors your sitemap.xml priority. This page-level data helps you make informed decisions about which bots to allow on which content.

Related guides

See your AI bot exposure — instantly

Open Shadow scans your robots.txt, meta tags, and headers to show exactly which AI bots can access your content — and which are blocked.

Scan My Site — Free →