Skip to content
AdvancedAI Agents11 min read

How to Block AI Agents
When robots.txt Isn't Enough

Traditional AI crawlers identify themselves and respect robots.txt. AI agents don't. They launch real browsers, execute JavaScript, and look like human visitors. This guide covers the defences that actually work.

The shift: crawlers → agents

In 2024, the AI content harvesting problem was about crawlers — GPTBot, ClaudeBot, Bytespider. They sent identifiable user agent strings, fetched raw HTML, and (mostly) respected robots.txt. Blocking them was a one-line config change.

In 2026, the problem is agents. LLM-powered programs that use web browsing as a tool — launching Chromium, rendering JavaScript, navigating pages, filling forms, and extracting content. They appear as regular Chrome sessions. robots.txt is irrelevant.

Crawlers vs Agents — Key Differences
IdentityCrawlers send identifiable user agents. Agents appear as Chrome 124+.
ProtocolCrawlers fetch raw HTTP. Agents launch full browsers with JavaScript execution.
ScaleCrawlers hit millions of pages/day. Agents hit tens-to-hundreds per task, but deeper.
robots.txtCrawlers (mostly) respect it. Agents don't check it — they're browsers, not crawlers.
DefenceCrawlers → robots.txt + UA blocking. Agents → bot detection, rate limiting, fingerprinting.

The AI agent browsing ecosystem (2026)

These are the frameworks and tools that AI agents use to browse your site. None of them send crawler-specific user agents. All of them render JavaScript.

FirecrawlOpen-source LLM-optimised web scraper
How it works: Launches Chromium per request, extracts clean markdown, handles pagination. Default browsing tool in LangChain, CrewAI, and many agent frameworks.
User agent: Chrome (standard Chromium UA)
Risk level: High — purpose-built for feeding content to LLMs
browser-usePython library for LLM-controlled browsing
How it works: Gives any LLM (GPT-4, Claude, etc.) direct control of a Playwright browser. The agent sees the page, decides what to click, and extracts what it needs.
User agent: Chrome/Firefox (standard browser UA)
Risk level: High — fastest-growing agent browsing framework on GitHub
Playwright MCPModel Context Protocol server for browser control
How it works: Exposes Playwright browser automation as MCP tools. Any MCP-compatible AI (Claude, GPT, etc.) can navigate, screenshot, click, and extract from any page.
User agent: Chrome (Playwright default)
Risk level: High — used by Claude Desktop, Cursor, Windsurf, and other coding agents
Stagehand (Browserbase)AI-native browser automation SDK
How it works: Natural-language browser control ("click the login button", "extract all prices"). Runs on Browserbase cloud infrastructure — IPs rotate across data centres.
User agent: Chrome (residential proxy rotation)
Risk level: Very high — rotating IPs make IP blocking ineffective
Crawl4AIOpen-source async crawler with LLM extraction
How it works: Async Playwright-based crawling with built-in LLM content extraction. Supports chunking strategies optimised for RAG pipelines.
User agent: Chrome (standard Chromium UA)
Risk level: High — explicitly designed for RAG and LLM data pipelines
Jina Reader APICommercial URL-to-markdown API
How it works: Prefix any URL with r.jina.ai/ to get a clean markdown version. Used as a browsing tool by many AI agents. Also powers search.jina.ai for AI search.
User agent: Identifies as "jina" in some cases; proxied requests vary
Risk level: Medium — identifiable when using the API directly, but proxied calls are harder to detect

Defence 1: Headless browser detection

Most AI agent frameworks use headless Chromium (no visible UI). Headless browsers leave fingerprints that differ from real user sessions. Detecting these is your first defence layer.

middleware.ts — Next.js headless browser detection
import { NextRequest, NextResponse } from 'next/server';

const HEADLESS_SIGNALS = [
  // Headless Chromium default viewport
  (req: NextRequest) => req.headers.get('sec-ch-viewport-width') === '800',
  // Missing or empty Accept-Language (real browsers always send this)
  (req: NextRequest) => !req.headers.get('accept-language'),
  // Headless Chrome often sends minimal accept headers
  (req: NextRequest) => req.headers.get('accept') === '*/*',
  // No sec-ch-ua hints (real Chrome 90+ always sends these)
  (req: NextRequest) => {
    const ua = req.headers.get('user-agent') || '';
    const isChrome = /Chrome\/\d/.test(ua);
    const hasHints = !!req.headers.get('sec-ch-ua');
    return isChrome && !hasHints;
  },
];

export function middleware(req: NextRequest) {
  const score = HEADLESS_SIGNALS
    .filter(check => check(req))
    .length;

  // 2+ signals = likely headless browser
  if (score >= 2) {
    // Option A: Block
    return new NextResponse('Access denied', { status: 403 });

    // Option B: Serve decoy content (recommended)
    // return NextResponse.rewrite(new URL('/honeypot', req.url));

    // Option C: Rate limit aggressively
    // Set a header for downstream rate limiter
    // const res = NextResponse.next();
    // res.headers.set('x-bot-score', String(score));
    // return res;
  }

  return NextResponse.next();
}

export const config = {
  matcher: ['/((?!_next|favicon.ico|robots.txt).*)'],
};
Client-side detection — fingerprint.js snippet
<script>
  // Detect headless browser signals client-side
  (function() {
    const signals = [];

    // 1. WebDriver flag (Puppeteer/Playwright set this)
    if (navigator.webdriver) signals.push('webdriver');

    // 2. Missing plugins (real browsers have PDF viewer, etc.)
    if (navigator.plugins.length === 0) signals.push('no-plugins');

    // 3. Missing languages
    if (!navigator.languages || navigator.languages.length === 0)
      signals.push('no-languages');

    // 4. Chrome without Chrome runtime
    if (window.chrome === undefined && /Chrome/.test(navigator.userAgent))
      signals.push('fake-chrome');

    // 5. Permissions API anomaly
    navigator.permissions?.query({ name: 'notifications' })
      .then(p => {
        if (p.state === 'denied' && Notification.permission === 'default')
          signals.push('permissions-mismatch');
      });

    // Report if 2+ signals detected
    if (signals.length >= 2) {
      fetch('/api/bot-report', {
        method: 'POST',
        body: JSON.stringify({ signals, ua: navigator.userAgent }),
        headers: { 'Content-Type': 'application/json' },
      });
    }
  })();
</script>
⚠️ Arms race warning: Sophisticated agent frameworks like Stagehand actively patch headless fingerprints — setting navigator.webdriver = false, injecting fake plugins, spoofing screen dimensions. Client-side detection alone is not sufficient. Layer it with server-side behavioural analysis.

Defence 2: Behavioural analysis

Even when an agent perfectly spoofs browser fingerprints, its behaviour is inhuman. AI agents navigate with machine precision — zero scroll jitter, no mouse movement, instant page transitions, and systematic content extraction patterns.

No mouse events: Real users generate mousemove, mouseover, and click events constantly. Agents that use accessibility trees or DOM extraction generate zero mouse activity.
Zero scroll variation: Humans scroll in irregular bursts with varying speeds. Agents either don't scroll at all (they read the DOM directly) or scroll in perfectly uniform increments.
Instant navigation: Humans spend 15-120 seconds per page on average. Agents fetch content and move on in <2 seconds — often navigating through an entire section in sequential order.
Sitemap-order access: An agent given your sitemap.xml will often access pages in the exact order listed. No human reads a site in sitemap order.
No referrer on deep pages: When an agent navigates directly to /pricing without first visiting / or /features, the missing referrer chain reveals non-human navigation.
Uniform request timing: Requests spaced at exactly 1.0s, 2.0s, or 5.0s intervals. Human timing has natural variance (Gaussian distribution). Machine timing is suspiciously regular.
Server-side session scoring — Express / Node.js example
const sessionScores = new Map(); // IP → { pages, timestamps, score }

function scoreSession(ip, path) {
  const session = sessionScores.get(ip) || {
    pages: [], timestamps: [], score: 0
  };

  const now = Date.now();
  session.pages.push(path);
  session.timestamps.push(now);

  // Check 1: Request frequency (>10 pages in 30 seconds)
  const recent = session.timestamps
    .filter(t => now - t < 30_000);
  if (recent.length > 10) session.score += 3;

  // Check 2: Uniform timing (stddev < 200ms between requests)
  if (session.timestamps.length >= 5) {
    const gaps = [];
    for (let i = 1; i < session.timestamps.length; i++) {
      gaps.push(session.timestamps[i] - session.timestamps[i-1]);
    }
    const mean = gaps.reduce((a,b) => a+b) / gaps.length;
    const stddev = Math.sqrt(
      gaps.reduce((sum, g) => sum + (g - mean) ** 2, 0) / gaps.length
    );
    if (stddev < 200) session.score += 2; // Too regular
  }

  // Check 3: Sequential path access
  const pathNums = session.pages
    .map(p => parseInt(p.match(/\d+/)?.[0] || '0'));
  const isSequential = pathNums.every(
    (n, i) => i === 0 || n >= pathNums[i-1]
  );
  if (isSequential && session.pages.length > 5)
    session.score += 2;

  sessionScores.set(ip, session);
  return session.score;
}

// In your request handler:
app.use((req, res, next) => {
  const score = scoreSession(req.ip, req.path);
  if (score >= 5) {
    return res.status(429)
      .json({ error: 'Rate limited' });
  }
  next();
});

Defence 3: Honeypot traps

Honeypot links are invisible to humans but visible to bots that parse raw HTML. When an AI agent follows a honeypot link, you've confirmed it's automated — and can block everything from that session.

This is the principle behind Cloudflare AI Labyrinth (launched early 2025) — it serves AI agents an endless maze of realistic but fake AI-generated content, wasting their compute while protecting your real pages.

Simple honeypot implementation
<!-- Hidden link in your page footer — CSS makes it invisible -->
<a
  href="/internal/system-config"
  style="position:absolute;left:-9999px;opacity:0;pointer-events:none"
  tabindex="-1"
  aria-hidden="true"
>
  System Configuration
</a>

<!-- The honeypot route logs and blocks the visitor -->
// app/internal/system-config/route.ts (Next.js)
import { NextRequest, NextResponse } from 'next/server';
const blockedIPs = new Set<string>();

export async function GET(req: NextRequest) {
  const ip = req.headers.get('x-forwarded-for')
    || req.ip || 'unknown';

  // Log the bot
  console.log(`[HONEYPOT] Bot detected: ${ip}`);
  console.log(`  UA: ${req.headers.get('user-agent')}`);
  console.log(`  Referer: ${req.headers.get('referer')}`);

  // Block this IP for 24 hours
  blockedIPs.add(ip);
  setTimeout(() => blockedIPs.delete(ip), 86_400_000);

  // Serve fake content to waste tokens
  return NextResponse.json({
    config: {
      version: "3.2.1",
      features: Array.from({ length: 100 }, (_, i) =>
        `feature_${i}: Detailed configuration for module ${i}...`
      ),
    },
  });
}

// Export blockedIPs for use in middleware
export { blockedIPs };
💡 Pro tip: Place multiple honeypot links across your site with different paths. An AI agent that hits 2+ honeypots in a session is confirmed automated with near-zero false positive rate. Combine this with Cloudflare AI Labyrinth for defence in depth.

Defence 4: Intelligent rate limiting

Standard rate limiting (X requests per minute per IP) catches naive agents but misses distributed ones. Content-aware rate limiting is more effective: limit the amount of unique content a session can access, not just the request count.

nginx rate limiting configuration
# Basic rate limit — 2 requests/second per IP
limit_req_zone $binary_remote_addr
  zone=general:10m rate=2r/s;

# Stricter limit for content-heavy pages
limit_req_zone $binary_remote_addr
  zone=content:10m rate=10r/m;

server {
  # General pages — allow small burst
  location / {
    limit_req zone=general burst=10 nodelay;
  }

  # Blog/guide pages — strict limit
  location /blog {
    limit_req zone=content burst=3 nodelay;
  }
  location /guides {
    limit_req zone=content burst=3 nodelay;
  }

  # API endpoints — very strict
  location /api {
    limit_req zone=general burst=5 nodelay;
    limit_req_status 429;
  }
}
Cloudflare WAF rule — block headless browsers
# Cloudflare WAF Custom Rule (dashboard → Security → WAF)
# Expression:
(cf.bot_management.score lt 30)
and not (cf.bot_management.verified_bot)

# Action: Managed Challenge

# For AI Labyrinth (Cloudflare dashboard):
# Security → Bots → Enable "AI Labyrinth"
# This serves fake content to detected AI agents
# without returning errors that might trigger retries

Defence 5: TLS fingerprinting

Every browser and HTTP client has a unique TLS handshake fingerprint (JA3/JA4 hash). Headless Chromium has a different JA3 hash than real Chrome — even when the user agent string is identical. This is the hardest signal for agent frameworks to spoof because it operates at the TLS protocol level.

JA3 fingerprinting — hashes the TLS ClientHello parameters (cipher suites, extensions, elliptic curves). Headless Chromium uses different TLS extension ordering than real Chrome.
JA4 fingerprinting — next-gen version that includes ALPN and signature algorithms. Even more precise at distinguishing headless browsers.
Where to implement: Cloudflare (built-in), nginx with ssl_ja3 module, HAProxy, or any TLS-terminating reverse proxy. Not available in application-level middleware (Next.js, Express) because TLS is terminated before the request reaches your app.
⚠️ Limitation: Cloud-based agent services like Stagehand/Browserbase run on real browser instances (not headless) with residential IPs, producing legitimate TLS fingerprints. TLS fingerprinting catches self-hosted agent frameworks but not commercial browser-as-a-service platforms.

Platform configs: Vercel, Cloudflare, AWS

Vercel (Next.js)

Vercel doesn't have built-in AI agent detection. Your options:

1. Edge Middleware — headless detection + behavioural scoring (see code above)
2. Vercel WAF — custom rules on Enterprise plan for rate limiting by path
3. Vercel Firewall + Cloudflare — put Cloudflare in front of Vercel for bot management + AI Labyrinth
4. Honeypot links — zero-cost, works on all Vercel plans

Cloudflare

Best out-of-the-box protection against AI agents:

1. Bot Management (Business/Enterprise) — ML-based bot scoring with JA3 fingerprinting
2. AI Labyrinth (all plans) — serves fake AI-generated content to detected bots
3. Super Bot Fight Mode (Pro+) — automated bot classification with challenge/block options
4. WAF Custom Rules (free) — filter by bot score, user agent patterns, and request characteristics

AWS (CloudFront + WAF)

1. AWS WAF Bot Control — targeted bot detection with managed rule groups
2. Rate-based rules — limit requests per IP with CloudFront integration
3. Lambda@Edge — custom bot detection logic at the CDN edge (equivalent to Vercel middleware)

Decision matrix: what to implement

Not every site needs every defence. Match your protection level to your content value and risk.

Basic (free)30 minutes to implement
For: Personal blogs, portfolio sites
Deploy: Honeypot links + basic rate limiting + robots.txt for crawlers
Standard2-3 hours to implement
For: Content sites, SaaS docs, e-commerce
Deploy: Everything above + headless detection middleware + Cloudflare free tier (AI Labyrinth)
Advanced1-2 days to implement
For: Paywalled publishers, premium content, proprietary data
Deploy: Everything above + behavioural analysis + TLS fingerprinting + Cloudflare Bot Management
EnterpriseOngoing to implement
For: News orgs, research databases, legal/financial content
Deploy: Everything above + device fingerprinting + CAPTCHA on suspicious sessions + legal TDMRep headers

Frequently asked questions

Do AI agents respect robots.txt?
No. Traditional AI crawlers like GPTBot or ClaudeBot respect robots.txt because they identify themselves with known user agent strings. AI agents are different — they use headless browsers (Playwright, Puppeteer, Selenium) or tool-calling frameworks (Firecrawl, browser-use, Stagehand) that appear as regular Chrome or Firefox browsers. They don't send crawler-specific user agents and therefore never check robots.txt. This is the fundamental shift: agents browse, crawlers crawl.
What is Firecrawl and how does it work?
Firecrawl is an open-source web scraping framework purpose-built for feeding content to LLMs. It renders JavaScript, extracts clean markdown, handles pagination, and can crawl entire sitemaps. Unlike traditional crawlers, Firecrawl launches a real Chromium browser session per request — making it nearly indistinguishable from a human visitor at the HTTP level. It's used by AI agent frameworks like LangChain, CrewAI, and AutoGPT as a default web browsing tool.
How can I detect AI agents on my site?
AI agents leave detectable patterns: (1) Headless browser fingerprints — navigator.webdriver=true, missing plugins, zero screen dimensions. (2) Behavioural signals — no mouse movement, no scroll jitter, instant page-to-page navigation, zero idle time. (3) Request patterns — sequential page access following sitemap order, uniform timing between requests, hitting every page in a section. (4) Missing signals — no cookies, no referrer on internal pages, no font rendering fingerprint. (5) TLS fingerprinting — headless Chromium has distinctive JA3/JA4 hashes different from real Chrome.
What is the difference between AI crawlers and AI agents?
AI crawlers (GPTBot, ClaudeBot, Bytespider) are purpose-built HTTP clients that fetch raw HTML, identify themselves via user agent strings, and typically respect robots.txt. They operate at infrastructure scale — millions of pages per day. AI agents are LLM-powered programs that use web browsing as a tool. They launch real browsers, execute JavaScript, interact with forms, and navigate like humans. They operate at task scale — tens to hundreds of pages per session, but with much deeper page interaction. The defence strategies are completely different.
Can Cloudflare or Vercel block AI agents?
Cloudflare's Bot Management and AI Labyrinth feature (launched early 2025) specifically target AI agents by serving them fake, AI-generated content that wastes their tokens while protecting real pages. Cloudflare's free tier includes basic bot detection. Vercel's WAF and Edge Middleware can implement custom bot detection but requires manual configuration — no built-in AI agent blocking. AWS WAF and Akamai Bot Manager also offer headless browser detection that catches most agent frameworks.
Will blocking AI agents hurt my SEO?
No. AI agents are not search engine crawlers. Googlebot, Bingbot, and other search indexers identify themselves with known user agents and are trivially whitelisted in any bot detection system. The techniques in this guide — headless browser detection, behavioural analysis, rate limiting — target automated browsing sessions that don't identify themselves. Real users and real search crawlers are unaffected when detection is properly configured.
What are honeypot links and do they work against AI agents?
Honeypot links are invisible links (hidden via CSS) that humans never click but automated scrapers follow. When an AI agent follows a honeypot link, you can immediately flag the session as a bot and block all subsequent requests from that IP or session. This technique is highly effective against AI agents because they typically extract all links from a page's HTML without rendering CSS — so they see and follow links that are invisible to real users. Cloudflare's AI Labyrinth is essentially a scaled-up version of this approach.

Related guides

How exposed is your site?

Open Shadow scans your site for AI bot exposure — crawlers and agent vulnerabilities. See exactly what's accessing your content.

Scan My Site — Free →

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.

Scan My Site Free →

Related Guides