Monitor AI Bot Traffic
Know What's Crawling You
You can't protect what you can't see. Before blocking AI bots, you need to know which ones are hitting your site, how often, and what pages they're reading. Here's how to find out.
The visibility problem
Most site owners have zero visibility into AI bot traffic. Google Analytics doesn't show it. Your CMS dashboard doesn't show it. The bots crawl silently, extract your content, and feed it to AI models — and you never know it happened.
Why? AI crawlers don't execute JavaScript. They fetch raw HTML via HTTP requests, which means client-side analytics (GA4, Plausible, Fathom) never fire. The only place AI bot traffic is visible is in server access logs and CDN-level analytics.
AI bot user agent reference (2026)
These are the user agent tokens you're looking for in your logs. Each line represents a distinct AI company and purpose.
# OpenAI (3 bots, 3 purposes)
GPTBot # Training data for GPT models
ChatGPT-User # Real-time browsing when users ask ChatGPT to read a URL
OAI-SearchBot # Indexing for ChatGPT Search results
# Anthropic (2 tokens)
ClaudeBot # Training data for Claude models
anthropic-ai # Alternative Anthropic identifier
# Google (3 AI-specific bots)
Google-Extended # Gemini/Bard AI training
Gemini-Deep-Research # Gemini Advanced deep research feature
Google-NotebookLM # NotebookLM URL ingestion
# Meta
meta-externalagent # Llama training (NOT the link preview bot)
# Microsoft
bingbot # Powers both Bing Search AND Copilot answers
# ByteDance
Bytespider # TikTok / Doubao AI training
# Amazon (3 bots)
Amazonbot # AI training for Alexa/Nova
Amzn-SearchBot # Rufus AI shopping assistant
Amzn-User # Live query browsing
# AI Search Engines
PerplexityBot # Perplexity AI search indexing
YouBot # You.com AI assistant
DuckAssistBot # DuckDuckGo AI answers
# European AI
MistralBot # Mistral AI (Le Chat, Mixtral)
DeepSeekBot # DeepSeek (V3, R1)
# xAI
xAI-Bot # Grok training (Elon Musk's xAI)
# Data Brokers (feed multiple AI companies)
CCBot # Common Crawl → GPT, Gemini, Llama, Mistral, etc.
Diffbot # Commercial data broker → Llama, Mistral, DiffbotLLM
omgili # Webz.io data broker
omgilibot # Webz.io (alternative token)
webzio-extended # Webz.io AI training crawler
# Apple
Applebot-Extended # Apple Intelligence training
# Academic
AI2Bot # Allen Institute (Semantic Scholar)
Ai2Bot-Dolma # Allen Institute (OLMo training dataset)
# Enterprise
cohere-ai # Cohere (Command R, Embed)Method 1: Server log analysis
If you have access to your server's access logs (nginx, Apache, or any reverse proxy), you can grep for AI bot user agents directly. This is the most accurate method.
#!/bin/bash
# ai-bot-report.sh — Scan nginx access logs for AI bot traffic
LOG="/var/log/nginx/access.log"
# All known AI bot tokens (case-insensitive grep)
AI_BOTS="GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|anthropic-ai|Google-Extended|Gemini-Deep-Research|Google-NotebookLM|Bytespider|PerplexityBot|meta-externalagent|Amazonbot|Amzn-SearchBot|CCBot|Diffbot|cohere-ai|DeepSeekBot|xAI-Bot|MistralBot|YouBot|DuckAssistBot|AI2Bot|Ai2Bot-Dolma|Applebot-Extended|omgili|omgilibot|webzio-extended"
echo "=== AI Bot Traffic Report ==="
echo "Log: $LOG"
echo "Date: $(date)"
echo ""
# Total AI bot requests
TOTAL=$(grep -ciE "$AI_BOTS" "$LOG")
ALL=$(wc -l < "$LOG")
echo "AI bot requests: $TOTAL / $ALL total ($(echo "scale=1; $TOTAL * 100 / $ALL" | bc)%)"
echo ""
# Breakdown by bot
echo "=== Requests per bot ==="
for bot in GPTBot ChatGPT-User OAI-SearchBot ClaudeBot Bytespider PerplexityBot Google-Extended meta-externalagent Amazonbot CCBot Diffbot DeepSeekBot xAI-Bot MistralBot cohere-ai; do
COUNT=$(grep -ci "$bot" "$LOG")
if [ "$COUNT" -gt 0 ]; then
echo " $bot: $COUNT"
fi
done
echo ""
# Top 20 pages hit by AI bots
echo "=== Top 20 pages targeted by AI bots ==="
grep -iE "$AI_BOTS" "$LOG" \
| awk '{print $7}' \
| sort | uniq -c | sort -rn \
| head -20
echo ""
# Hourly distribution (when do they crawl?)
echo "=== Hourly distribution ==="
grep -iE "$AI_BOTS" "$LOG" \
| awk '{print substr($4,14,2)":00"}' \
| sort | uniq -c | sort -rn \
| head -10# Add to crontab (crontab -e):
# Run daily at midnight, email results
0 0 * * * /opt/scripts/ai-bot-report.sh > /tmp/ai-bot-daily.txt 2>&1
# Or append to a rolling log:
0 0 * * * /opt/scripts/ai-bot-report.sh >> /var/log/ai-bot-reports.log 2>&1
# For Apache logs, change the LOG path:
# LOG="/var/log/apache2/access.log"
# Format is slightly different but grep works the sameMethod 2: Dedicated AI bot log file
Instead of grepping through massive access logs, configure nginx to write AI bot requests to a separate log file. This makes monitoring trivial and keeps your main logs clean.
http {
# Map AI bot user agents to a flag
map $http_user_agent $is_ai_bot {
default 0;
"~*GPTBot" 1;
"~*ChatGPT-User" 1;
"~*OAI-SearchBot" 1;
"~*ClaudeBot" 1;
"~*anthropic-ai" 1;
"~*Google-Extended" 1;
"~*Bytespider" 1;
"~*PerplexityBot" 1;
"~*meta-externalagent" 1;
"~*Amazonbot" 1;
"~*CCBot" 1;
"~*Diffbot" 1;
"~*DeepSeekBot" 1;
"~*xAI-Bot" 1;
"~*MistralBot" 1;
"~*cohere-ai" 1;
"~*YouBot" 1;
"~*DuckAssistBot" 1;
"~*Applebot-Extended" 1;
"~*AI2Bot" 1;
"~*Ai2Bot-Dolma" 1;
}
# JSON log format for easy parsing
log_format ai_bot_json escape=json
'{'
'"time":"$time_iso8601",'
'"ip":"$remote_addr",'
'"method":"$request_method",'
'"path":"$uri",'
'"status":$status,'
'"ua":"$http_user_agent",'
'"bytes":$body_bytes_sent,'
'"referer":"$http_referer"'
'}';
server {
# Write AI bot requests to a separate file
access_log /var/log/nginx/ai-bots.log ai_bot_json if=$is_ai_bot;
# Normal access log (all traffic)
access_log /var/log/nginx/access.log;
}
} jq for analysis, ingested by Elasticsearch/Loki, or processed by any monitoring tool. The structured format makes it trivial to build dashboards and alerts.Method 3: Next.js middleware logging
On Vercel or any serverless platform where you don't have access to raw nginx logs, use Next.js Edge Middleware to detect and log AI bot requests.
import { NextRequest, NextResponse } from 'next/server';
const AI_BOT_PATTERNS = [
/GPTBot/i, /ChatGPT-User/i, /OAI-SearchBot/i,
/ClaudeBot/i, /anthropic-ai/i,
/Google-Extended/i, /Gemini-Deep-Research/i,
/Bytespider/i, /PerplexityBot/i,
/meta-externalagent/i, /Amazonbot/i,
/CCBot/i, /Diffbot/i, /cohere-ai/i,
/DeepSeekBot/i, /xAI-Bot/i, /MistralBot/i,
/YouBot/i, /DuckAssistBot/i,
/Applebot-Extended/i, /AI2Bot/i,
];
function detectAIBot(ua: string): string | null {
for (const pattern of AI_BOT_PATTERNS) {
const match = ua.match(pattern);
if (match) return match[0];
}
return null;
}
export function middleware(req: NextRequest) {
const ua = req.headers.get('user-agent') || '';
const bot = detectAIBot(ua);
if (bot) {
// Log to your preferred backend
// Option 1: Console log (appears in Vercel Function Logs)
console.log(JSON.stringify({
type: 'ai_bot',
bot,
path: req.nextUrl.pathname,
ip: req.headers.get('x-forwarded-for'),
timestamp: new Date().toISOString(),
}));
// Option 2: Send to analytics endpoint (non-blocking)
// fetch('https://your-analytics.com/api/bot-hit', {
// method: 'POST',
// body: JSON.stringify({ bot, path: req.nextUrl.pathname }),
// }).catch(() => {}); // Fire and forget
// Option 3: Add custom header for downstream processing
const res = NextResponse.next();
res.headers.set('x-ai-bot', bot);
return res;
}
return NextResponse.next();
}
export const config = {
matcher: ['/((?!_next|favicon.ico|robots.txt|sitemap.xml).*)'],
};Method 4: Cloudflare bot analytics
If your site is behind Cloudflare (even the free plan), you already have AI bot visibility built in — you just might not know where to find it.
Free plan: Security → Bots
Cloudflare's free plan shows automated vs human traffic split. While it doesn't break down individual AI bots, it gives you baseline visibility.
Pro+ plan: Bot Analytics
Full bot analytics with individual bot identification, request volumes, and time series data.
Any plan: Firewall Events
Create a WAF rule that logs (not blocks) AI bot traffic, giving you a dedicated event stream.
# Cloudflare Dashboard → Security → WAF → Custom Rules
# Rule name: "Log AI Bot Traffic"
# Expression (matches any known AI bot user agent):
(http.user_agent contains "GPTBot")
or (http.user_agent contains "ClaudeBot")
or (http.user_agent contains "Bytespider")
or (http.user_agent contains "PerplexityBot")
or (http.user_agent contains "Google-Extended")
or (http.user_agent contains "meta-externalagent")
or (http.user_agent contains "CCBot")
or (http.user_agent contains "Diffbot")
or (http.user_agent contains "DeepSeekBot")
or (http.user_agent contains "xAI-Bot")
or (http.user_agent contains "Amazonbot")
# Action: Log
# (This creates entries in Security Events without
# blocking any traffic — pure monitoring)Method 5: Real-time monitoring with alerts
For high-value content, you want to know immediately when a new AI bot starts crawling. Here's a lightweight monitoring setup using the JSON log file from Method 2.
#!/bin/bash
# ai-bot-monitor.sh — Watch AI bot log in real-time
# Usage: ./ai-bot-monitor.sh
LOG="/var/log/nginx/ai-bots.log"
WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
echo "🔍 Monitoring AI bot traffic..."
echo " Log: $LOG"
echo " Press Ctrl+C to stop"
echo ""
tail -f "$LOG" | while read -r line; do
BOT=$(echo "$line" | jq -r '.ua' | grep -oiE 'GPTBot|ClaudeBot|Bytespider|PerplexityBot|Google-Extended|DeepSeekBot|xAI-Bot|meta-externalagent|Amazonbot|CCBot|Diffbot')
PATH_HIT=$(echo "$line" | jq -r '.path')
IP=$(echo "$line" | jq -r '.ip')
TIME=$(echo "$line" | jq -r '.time')
# Print to terminal
echo "[$TIME] $BOT → $PATH_HIT (from $IP)"
# Alert on new/unusual bots (optional)
# Uncomment to send Slack alerts:
# curl -s -X POST "$WEBHOOK" \
# -H 'Content-type: application/json' \
# -d "{\"text\":\"🤖 AI Bot: $BOT hit $PATH_HIT\"}" \
# > /dev/null
done# Count requests per bot (last 24 hours)
cat /var/log/nginx/ai-bots.log | jq -r '.ua' \
| grep -oiE 'GPTBot|ClaudeBot|Bytespider|PerplexityBot|Google-Extended|CCBot' \
| sort | uniq -c | sort -rn
# Top 10 pages hit today
cat /var/log/nginx/ai-bots.log | jq -r '.path' \
| sort | uniq -c | sort -rn | head -10
# Unique IPs per bot
cat /var/log/nginx/ai-bots.log \
| jq -r '[.ua, .ip] | @tsv' \
| grep -i "GPTBot" | cut -f2 | sort -u | wc -l
# Bandwidth consumed by AI bots
cat /var/log/nginx/ai-bots.log \
| jq -r '.bytes' | paste -sd+ | bc \
| awk '{printf "%.2f MB\n", $1/1048576}'What to do with the data
Once you know which bots are crawling and what pages they target, you can make informed decisions instead of blindly blocking everything.
Frequently asked questions
How do I know which AI bots are crawling my site?▾
What percentage of my traffic is AI bots?▾
Do AI bots show up in Google Analytics?▾
Which AI bot user agent strings should I look for?▾
How often do AI bots crawl a typical site?▾
Can I see what pages AI bots are reading most?▾
Related guides
See your AI bot exposure — instantly
Open Shadow scans your robots.txt, meta tags, and headers to show exactly which AI bots can access your content — and which are blocked.
Scan My Site — Free →Is your site protected from AI bots?
Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.
Scan My Site Free →