How to Block AI Agents
When robots.txt Isn't Enough
Traditional AI crawlers identify themselves and respect robots.txt. AI agents don't. They launch real browsers, execute JavaScript, and look like human visitors. This guide covers the defences that actually work.
The shift: crawlers → agents
In 2024, the AI content harvesting problem was about crawlers — GPTBot, ClaudeBot, Bytespider. They sent identifiable user agent strings, fetched raw HTML, and (mostly) respected robots.txt. Blocking them was a one-line config change.
In 2026, the problem is agents. LLM-powered programs that use web browsing as a tool — launching Chromium, rendering JavaScript, navigating pages, filling forms, and extracting content. They appear as regular Chrome sessions. robots.txt is irrelevant.
The AI agent browsing ecosystem (2026)
These are the frameworks and tools that AI agents use to browse your site. None of them send crawler-specific user agents. All of them render JavaScript.
Defence 1: Headless browser detection
Most AI agent frameworks use headless Chromium (no visible UI). Headless browsers leave fingerprints that differ from real user sessions. Detecting these is your first defence layer.
import { NextRequest, NextResponse } from 'next/server';
const HEADLESS_SIGNALS = [
// Headless Chromium default viewport
(req: NextRequest) => req.headers.get('sec-ch-viewport-width') === '800',
// Missing or empty Accept-Language (real browsers always send this)
(req: NextRequest) => !req.headers.get('accept-language'),
// Headless Chrome often sends minimal accept headers
(req: NextRequest) => req.headers.get('accept') === '*/*',
// No sec-ch-ua hints (real Chrome 90+ always sends these)
(req: NextRequest) => {
const ua = req.headers.get('user-agent') || '';
const isChrome = /Chrome\/\d/.test(ua);
const hasHints = !!req.headers.get('sec-ch-ua');
return isChrome && !hasHints;
},
];
export function middleware(req: NextRequest) {
const score = HEADLESS_SIGNALS
.filter(check => check(req))
.length;
// 2+ signals = likely headless browser
if (score >= 2) {
// Option A: Block
return new NextResponse('Access denied', { status: 403 });
// Option B: Serve decoy content (recommended)
// return NextResponse.rewrite(new URL('/honeypot', req.url));
// Option C: Rate limit aggressively
// Set a header for downstream rate limiter
// const res = NextResponse.next();
// res.headers.set('x-bot-score', String(score));
// return res;
}
return NextResponse.next();
}
export const config = {
matcher: ['/((?!_next|favicon.ico|robots.txt).*)'],
};<script>
// Detect headless browser signals client-side
(function() {
const signals = [];
// 1. WebDriver flag (Puppeteer/Playwright set this)
if (navigator.webdriver) signals.push('webdriver');
// 2. Missing plugins (real browsers have PDF viewer, etc.)
if (navigator.plugins.length === 0) signals.push('no-plugins');
// 3. Missing languages
if (!navigator.languages || navigator.languages.length === 0)
signals.push('no-languages');
// 4. Chrome without Chrome runtime
if (window.chrome === undefined && /Chrome/.test(navigator.userAgent))
signals.push('fake-chrome');
// 5. Permissions API anomaly
navigator.permissions?.query({ name: 'notifications' })
.then(p => {
if (p.state === 'denied' && Notification.permission === 'default')
signals.push('permissions-mismatch');
});
// Report if 2+ signals detected
if (signals.length >= 2) {
fetch('/api/bot-report', {
method: 'POST',
body: JSON.stringify({ signals, ua: navigator.userAgent }),
headers: { 'Content-Type': 'application/json' },
});
}
})();
</script>navigator.webdriver = false, injecting fake plugins, spoofing screen dimensions. Client-side detection alone is not sufficient. Layer it with server-side behavioural analysis.Defence 2: Behavioural analysis
Even when an agent perfectly spoofs browser fingerprints, its behaviour is inhuman. AI agents navigate with machine precision — zero scroll jitter, no mouse movement, instant page transitions, and systematic content extraction patterns.
const sessionScores = new Map(); // IP → { pages, timestamps, score }
function scoreSession(ip, path) {
const session = sessionScores.get(ip) || {
pages: [], timestamps: [], score: 0
};
const now = Date.now();
session.pages.push(path);
session.timestamps.push(now);
// Check 1: Request frequency (>10 pages in 30 seconds)
const recent = session.timestamps
.filter(t => now - t < 30_000);
if (recent.length > 10) session.score += 3;
// Check 2: Uniform timing (stddev < 200ms between requests)
if (session.timestamps.length >= 5) {
const gaps = [];
for (let i = 1; i < session.timestamps.length; i++) {
gaps.push(session.timestamps[i] - session.timestamps[i-1]);
}
const mean = gaps.reduce((a,b) => a+b) / gaps.length;
const stddev = Math.sqrt(
gaps.reduce((sum, g) => sum + (g - mean) ** 2, 0) / gaps.length
);
if (stddev < 200) session.score += 2; // Too regular
}
// Check 3: Sequential path access
const pathNums = session.pages
.map(p => parseInt(p.match(/\d+/)?.[0] || '0'));
const isSequential = pathNums.every(
(n, i) => i === 0 || n >= pathNums[i-1]
);
if (isSequential && session.pages.length > 5)
session.score += 2;
sessionScores.set(ip, session);
return session.score;
}
// In your request handler:
app.use((req, res, next) => {
const score = scoreSession(req.ip, req.path);
if (score >= 5) {
return res.status(429)
.json({ error: 'Rate limited' });
}
next();
});Defence 3: Honeypot traps
Honeypot links are invisible to humans but visible to bots that parse raw HTML. When an AI agent follows a honeypot link, you've confirmed it's automated — and can block everything from that session.
This is the principle behind Cloudflare AI Labyrinth (launched early 2025) — it serves AI agents an endless maze of realistic but fake AI-generated content, wasting their compute while protecting your real pages.
<!-- Hidden link in your page footer — CSS makes it invisible -->
<a
href="/internal/system-config"
style="position:absolute;left:-9999px;opacity:0;pointer-events:none"
tabindex="-1"
aria-hidden="true"
>
System Configuration
</a>
<!-- The honeypot route logs and blocks the visitor -->
// app/internal/system-config/route.ts (Next.js)
import { NextRequest, NextResponse } from 'next/server';
const blockedIPs = new Set<string>();
export async function GET(req: NextRequest) {
const ip = req.headers.get('x-forwarded-for')
|| req.ip || 'unknown';
// Log the bot
console.log(`[HONEYPOT] Bot detected: ${ip}`);
console.log(` UA: ${req.headers.get('user-agent')}`);
console.log(` Referer: ${req.headers.get('referer')}`);
// Block this IP for 24 hours
blockedIPs.add(ip);
setTimeout(() => blockedIPs.delete(ip), 86_400_000);
// Serve fake content to waste tokens
return NextResponse.json({
config: {
version: "3.2.1",
features: Array.from({ length: 100 }, (_, i) =>
`feature_${i}: Detailed configuration for module ${i}...`
),
},
});
}
// Export blockedIPs for use in middleware
export { blockedIPs };Defence 4: Intelligent rate limiting
Standard rate limiting (X requests per minute per IP) catches naive agents but misses distributed ones. Content-aware rate limiting is more effective: limit the amount of unique content a session can access, not just the request count.
# Basic rate limit — 2 requests/second per IP
limit_req_zone $binary_remote_addr
zone=general:10m rate=2r/s;
# Stricter limit for content-heavy pages
limit_req_zone $binary_remote_addr
zone=content:10m rate=10r/m;
server {
# General pages — allow small burst
location / {
limit_req zone=general burst=10 nodelay;
}
# Blog/guide pages — strict limit
location /blog {
limit_req zone=content burst=3 nodelay;
}
location /guides {
limit_req zone=content burst=3 nodelay;
}
# API endpoints — very strict
location /api {
limit_req zone=general burst=5 nodelay;
limit_req_status 429;
}
}# Cloudflare WAF Custom Rule (dashboard → Security → WAF)
# Expression:
(cf.bot_management.score lt 30)
and not (cf.bot_management.verified_bot)
# Action: Managed Challenge
# For AI Labyrinth (Cloudflare dashboard):
# Security → Bots → Enable "AI Labyrinth"
# This serves fake content to detected AI agents
# without returning errors that might trigger retriesDefence 5: TLS fingerprinting
Every browser and HTTP client has a unique TLS handshake fingerprint (JA3/JA4 hash). Headless Chromium has a different JA3 hash than real Chrome — even when the user agent string is identical. This is the hardest signal for agent frameworks to spoof because it operates at the TLS protocol level.
Platform configs: Vercel, Cloudflare, AWS
Vercel (Next.js)
Vercel doesn't have built-in AI agent detection. Your options:
Cloudflare
Best out-of-the-box protection against AI agents:
AWS (CloudFront + WAF)
Decision matrix: what to implement
Not every site needs every defence. Match your protection level to your content value and risk.
Frequently asked questions
Do AI agents respect robots.txt?▾
What is Firecrawl and how does it work?▾
How can I detect AI agents on my site?▾
What is the difference between AI crawlers and AI agents?▾
Can Cloudflare or Vercel block AI agents?▾
Will blocking AI agents hurt my SEO?▾
What are honeypot links and do they work against AI agents?▾
Related guides
How exposed is your site?
Open Shadow scans your site for AI bot exposure — crawlers and agent vulnerabilities. See exactly what's accessing your content.
Scan My Site — Free →Is your site protected from AI bots?
Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.
Scan My Site Free →