How to Block AI Bots on HAProxy: Complete 2026 Guide
HAProxy is a high-performance TCP/HTTP load balancer and reverse proxy used in production by GitHub, Reddit, Airbnb, and many high-traffic sites. Unlike nginx or Apache, HAProxy's configuration is ACL-driven — bot blocking is concise and fast. This guide covers ACL-based UA matching, X-Robots-Tag headers, robots.txt serving, rate limiting via stick tables, and logging.
Contents
ACL-based UA blocking
HAProxy uses ACLs (Access Control Lists) in the frontend block to match request properties. Block AI bots before the request reaches any backend:
frontend http-in
bind *:80
bind *:443 ssl crt /etc/haproxy/certs/
# AI bot blocking — substring match, case-insensitive
acl is_ai_bot req.hdr(User-Agent) -m sub -i \
GPTBot \
ClaudeBot \
anthropic-ai \
CCBot \
Google-Extended \
AhrefsBot \
Bytespider \
Amazonbot \
Diffbot \
FacebookBot \
cohere-ai \
PerplexityBot \
YouBot
http-request deny status 403 if is_ai_bot
default_backend apphttp-request deny in the frontend stops the request before it reaches the backend, saving backend resources (connections, threads, DB queries). Backend rules only fire after HAProxy has selected a server — the request has already consumed resources by then.\. The ACL values above are space-separated — each additional value on the same acl line is OR logic. Alternatively, list each on a separate acl is_ai_bot line (repeated ACL name = OR logic too).Alternative — one value per line (equivalent)
acl is_ai_bot req.hdr(User-Agent) -m sub -i GPTBot
acl is_ai_bot req.hdr(User-Agent) -m sub -i ClaudeBot
acl is_ai_bot req.hdr(User-Agent) -m sub -i anthropic-ai
acl is_ai_bot req.hdr(User-Agent) -m sub -i CCBot
acl is_ai_bot req.hdr(User-Agent) -m sub -i Google-Extended
http-request deny status 403 if is_ai_botRepeated acl lines with the same name are ORed together — same result as the multi-value single line.
ACL match flags: -m sub vs -m reg
HAProxy ACLs support multiple match methods. The two most useful for bot blocking:
| Flag | Method | Use case | Performance |
|---|---|---|---|
-m sub | Substring match | Bot name appears anywhere in UA string | Fast — string search |
-m reg | Regex match | Complex patterns, anchoring | Slower — regex engine |
-m str | Exact match | Exact UA string equality | Fastest — hash lookup |
-m beg | Prefix match | UA starts with pattern | Fast |
-i makes any match case-insensitive. Always use -i for User-Agent matching — bots sometimes vary their capitalisation across versions.
Regex example (more precise but slower)
acl is_ai_bot req.hdr(User-Agent) -m reg -i \
(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)
http-request deny status 403 if is_ai_bot-m sub -i with a list of bot names is the best balance of clarity, performance, and coverage. Use -m reg only if you need anchoring (^/$) or more complex pattern logic.X-Robots-Tag response header
Use http-response set-header to add X-Robots-Tag to all responses. Place in the backend block so it applies to responses from your upstream server (not to HAProxy-generated error pages):
backend app
balance roundrobin
option forwardfor
http-response set-header X-Robots-Tag "noai, noimageai"
server app1 127.0.0.1:3000 check
server app2 127.0.0.1:3001 checkApply to all responses including error pages (frontend placement)
frontend http-in
bind *:80
# Apply X-Robots-Tag to everything, including HAProxy error pages
http-response set-header X-Robots-Tag "noai, noimageai"
default_backend apphttp-response set-header applies to all responses including HAProxy- generated 400/403/503 pages. In the backend, it only applies to responses proxied from your upstream server. For SEO headers, backend placement is usually preferable — you don't need X-Robots-Tag on error responses.Add to existing header (if upstream already sets it)
# Replace the header entirely (preferred — avoid duplicate headers)
http-response set-header X-Robots-Tag "noai, noimageai"
# Or add to existing value
http-response add-header X-Robots-Tag "noai, noimageai"Serving robots.txt from HAProxy
HAProxy can serve a static robots.txt response directly from the frontend, without forwarding the request to the backend:
frontend http-in
bind *:80
bind *:443 ssl crt /etc/haproxy/certs/
# Serve robots.txt directly from HAProxy
acl is_robots_txt path /robots.txt
http-request return status 200 \
content-type "text/plain" \
string "User-agent: *\nAllow: /\n\nUser-agent: GPTBot\nDisallow: /\n\nUser-agent: ClaudeBot\nDisallow: /\n\nUser-agent: anthropic-ai\nDisallow: /\n\nUser-agent: CCBot\nDisallow: /\n\nUser-agent: Google-Extended\nDisallow: /\n" \
if is_robots_txt
default_backend apphttp-request redirect to a static file served by your backend, or upgrade to HAProxy 2.2+.Serve robots.txt from a file (HAProxy 2.4+)
# haproxy.cfg — HAProxy 2.4+
frontend http-in
bind *:80
acl is_robots_txt path /robots.txt
http-request return status 200 content-type "text/plain" file /etc/haproxy/robots.txt if is_robots_txt
default_backend appCreate /etc/haproxy/robots.txt:
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: PerplexityBot
Disallow: /Rate limiting with stick tables
HAProxy's stick tables provide per-IP rate limiting without any additional software. This complements UA blocking — limiting aggressive crawlers even if they spoof their User-Agent:
backend rate_limit_table
stick-table type ip size 1m expire 10m store gpc0,http_req_rate(10s)
frontend http-in
bind *:80
# Track request rate per IP using the stick table in rate_limit_table backend
http-request track-sc0 src table rate_limit_table
# Block IPs making more than 100 requests in 10 seconds
acl too_many_requests sc_http_req_rate(0) gt 100
http-request deny status 429 if too_many_requests
# Block known AI bots by UA (before rate check)
acl is_ai_bot req.hdr(User-Agent) -m sub -i \
GPTBot ClaudeBot anthropic-ai CCBot Google-Extended \
AhrefsBot Bytespider Amazonbot Diffbot PerplexityBot
http-request deny status 403 if is_ai_bot
default_backend appKey stick table options
| Option | Description |
|---|---|
type ip | Key by client IP address |
size 1m | Store up to 1 million entries |
expire 10m | Remove entries after 10 minutes of inactivity |
store gpc0 | General Purpose Counter 0 — for manual counters |
store http_req_rate(10s) | Track request rate over a 10-second sliding window |
Logging blocked bots
Log blocked bot requests to a separate file for analysis without polluting your main access log:
global
log /dev/log local0
log /dev/log local1 notice
defaults
log global
option httplog
option dontlognull
frontend http-in
bind *:80
acl is_ai_bot req.hdr(User-Agent) -m sub -i \
GPTBot ClaudeBot anthropic-ai CCBot Google-Extended \
AhrefsBot Bytespider Amazonbot Diffbot PerplexityBot
# Capture User-Agent for logging (first 100 chars)
http-request capture req.hdr(User-Agent) len 100
# Add custom log tag for blocked bots
http-request set-log-level warning if is_ai_bot
http-request deny status 403 if is_ai_bot
default_backend appCustom log format showing blocked bot UA
defaults
log-format "%ci:%cp [%t] %ft %b/%s %Tq/%Tw/%Tc/%Tr/%Tt %ST %B %tsc %ac/%fc/%bc/%sc/%rc %{+Q}r %[capture.req.hdr(0)]"The %[capture.req.hdr(0)] field outputs the first captured header (the User-Agent in the config above), making it easy to grep blocked bot names from logs.
Full haproxy.cfg example
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
daemon
maxconn 50000
defaults
log global
mode http
option httplog
option dontlognull
option forwardfor
option http-server-close
timeout connect 5s
timeout client 30s
timeout server 30s
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 429 /etc/haproxy/errors/429.http
errorfile 503 /etc/haproxy/errors/503.http
# Stick table for rate limiting (backend with no servers = pure table)
backend rate_limit_table
stick-table type ip size 1m expire 10m store gpc0,http_req_rate(10s)
frontend http-in
bind *:80
bind *:443 ssl crt /etc/haproxy/certs/example.com.pem alpn h2,http/1.1
# Redirect HTTP to HTTPS
http-request redirect scheme https unless { ssl_fc }
# Capture User-Agent for logging
http-request capture req.hdr(User-Agent) len 100
# Rate limiting — track requests per IP
http-request track-sc0 src table rate_limit_table
acl too_many_requests sc_http_req_rate(0) gt 100
http-request deny status 429 if too_many_requests
# AI bot blocking by User-Agent
acl is_ai_bot req.hdr(User-Agent) -m sub -i \
GPTBot \
ClaudeBot \
anthropic-ai \
CCBot \
Google-Extended \
AhrefsBot \
Bytespider \
Amazonbot \
Diffbot \
FacebookBot \
cohere-ai \
PerplexityBot \
YouBot
http-request set-log-level warning if is_ai_bot
http-request deny status 403 if is_ai_bot
# Serve robots.txt directly (HAProxy 2.4+)
acl is_robots_txt path /robots.txt
http-request return status 200 content-type "text/plain" \
file /etc/haproxy/robots.txt if is_robots_txt
# ACME challenge passthrough (if using Let's Encrypt)
acl is_acme path_beg /.well-known/acme-challenge/
use_backend acme_backend if is_acme
default_backend app
backend app
balance leastconn
option httpchk GET /health
http-response set-header X-Robots-Tag "noai, noimageai"
http-response set-header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
server app1 127.0.0.1:3000 check inter 5s rise 2 fall 3
server app2 127.0.0.1:3001 check inter 5s rise 2 fall 3
backend acme_backend
server acme 127.0.0.1:8080
# Stats page (disable in production or restrict to internal IPs)
listen stats
bind *:8404
stats enable
stats uri /stats
stats refresh 10s
acl local_net src 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16
stats http-request deny unless local_netDocker deployment
docker-compose.yml
services:
haproxy:
image: haproxy:2.8-alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
- ./certs:/etc/haproxy/certs:ro
- ./robots.txt:/etc/haproxy/robots.txt:ro
- haproxy_run:/run/haproxy
depends_on:
- app
restart: unless-stopped
app:
image: your-app:latest
expose:
- "3000"
restart: unless-stopped
volumes:
haproxy_run:http-request return ... file. HAProxy 2.2+ is required for http-request return (inline string). Check your version with haproxy -v.Reload config without downtime
# Send SIGUSR2 for graceful reload (HAProxy 1.8+)
docker kill --signal=SIGUSR2 haproxy_container
# Or use the admin socket
echo "reload" | socat stdio /run/haproxy/admin.sockFAQ
How do I block AI bots by User-Agent in HAProxy?
Define an ACL in the frontend block using req.hdr(User-Agent) -m sub -i for case-insensitive substring matching, then http-request deny status 403 if is_ai_bot. List multiple bot names space-separated on one ACL line (OR logic).
What is the difference between -m sub and -m reg in HAProxy ACLs?
-m sub does substring matching — efficient for simple bot name checks. -m reg uses regular expressions — more flexible but slower. For bot blocking, -m sub is preferred: list each bot name on the same ACL line (space-separated = OR) for fast matching without regex overhead.
How do I add X-Robots-Tag in HAProxy?
Use http-response set-header X-Robots-Tag "noai, noimageai" in the backend block. Backend placement applies only to proxied responses (not HAProxy error pages). Frontend placement applies to all responses including error pages.
Can HAProxy serve robots.txt directly without a backend?
Yes — use http-request return status 200 content-type "text/plain" file /etc/haproxy/robots.txt (HAProxy 2.4+) or an inline string (HAProxy 2.2+). Define an ACL for path /robots.txt and return before forwarding to the backend.
How do I rate-limit AI bots in HAProxy?
Use stick tables with http_req_rate(10s). Define a backend with just a stick-table (no servers), track requests per IP in the frontend with http-request track-sc0 src table rate_limit_table, then deny when sc_http_req_rate(0) gt 100.
Should I block AI bots in the frontend or backend in HAProxy?
Frontend — it stops the request before it reaches the backend, saving backend resources (connections, threads, DB queries). Use http-request deny in the frontend block. X-Robots-Tag headers go in the backend, so they only apply to proxied responses.
Is your site protected from AI bots?
Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.