How to Block AI Bots on Varnish Cache: Complete 2026 Guide
Varnish Cache is a high-performance HTTP accelerator (caching reverse proxy) used by major media publishers, e-commerce platforms, and CDNs. It is configured entirely through VCL (Varnish Configuration Language) — a domain-specific language for HTTP request handling. Bot blocking in Varnish is done in the vcl_recv subroutine, before cache lookup and before any backend hit.
Contents
vcl_recv — block bots before cache lookup
vcl_recv is the first subroutine called for every incoming request — it runs before cache lookup, before backend selection, and before any backend connection. This is the correct place to block bots: zero backend load, zero cache pollution.
vcl 4.1;
import std;
sub vcl_recv {
# Block AI training and scraping bots by User-Agent
if (req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)") {
return(synth(403, "Forbidden"));
}
}~ operator does PCRE regex matching. The (?i) flag at the start makes the entire pattern case-insensitive. Alternatives separated by | inside the group. Unlike nginx or HAProxy, Varnish requires a single regex — you cannot list values space-separated.vcl_hit runs only when there is a cache hit — bots on uncached URLs would bypass the check. vcl_recv runs unconditionally for every request.Block and log (using std.log)
vcl 4.1;
import std;
sub vcl_recv {
if (req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)") {
std.log("AI bot blocked: " + req.http.User-Agent);
return(synth(403, "Forbidden"));
}
}std.log() writes to the Varnish shared memory log (VSL), readable with varnishlog -g request -q "VCL_Log ~ \"AI bot\"".
vcl_synth — custom 403 response
When return(synth(403, "Forbidden")) is called in vcl_recv, Varnish calls vcl_synth to build the synthetic response. Customise it to return a clean response body:
sub vcl_synth {
if (resp.status == 403) {
set resp.http.Content-Type = "text/plain; charset=utf-8";
set resp.http.X-Robots-Tag = "noindex";
synthetic("Forbidden" + {"
"});
return(deliver);
}
# Default synth handling for other status codes
return(deliver);
}{ + text + } is VCL's long-string syntax — equivalent to a here-doc. The newline after Forbidden is inside the long string. Use it when your synthetic body contains special characters or line breaks.Return JSON for API consumers
sub vcl_synth {
if (resp.status == 403) {
set resp.http.Content-Type = "application/json; charset=utf-8";
synthetic({"{"status":403,"error":"Forbidden"}"});
return(deliver);
}
}X-Robots-Tag in vcl_backend_response / vcl_deliver
Add X-Robots-Tag to all responses. Two options depending on when you want to set it:
vcl_backend_response — set on backend response (before caching)
sub vcl_backend_response {
# Add X-Robots-Tag to all backend responses
# This value is cached alongside the object
set beresp.http.X-Robots-Tag = "noai, noimageai";
}vcl_backend_response are stored in Varnish's cache alongside the object. All subsequent cache hits will include the header without another backend request.vcl_deliver — set on delivery to client (after cache lookup)
sub vcl_deliver {
# Set X-Robots-Tag on every response sent to the client
# Use this if you need to set/override regardless of cache state
set resp.http.X-Robots-Tag = "noai, noimageai";
# Optional: remove internal headers before delivery
unset resp.http.X-Varnish;
unset resp.http.Via;
}vcl_deliver runs just before sending the response to the client — it can override headers set in vcl_backend_response. Use it when you need unconditional header injection regardless of cache state.
Serving robots.txt from VCL
Serve robots.txt directly from Varnish without a backend hit:
sub vcl_recv {
# Serve robots.txt directly from Varnish (no backend hit)
if (req.url == "/robots.txt") {
return(synth(200, "OK"));
}
# ... rest of vcl_recv
}
sub vcl_synth {
if (resp.status == 200 && req.url == "/robots.txt") {
set resp.http.Content-Type = "text/plain; charset=utf-8";
set resp.http.Cache-Control = "public, max-age=86400";
synthetic({"User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: YouBot
Disallow: /
Sitemap: https://example.com/sitemap.xml
"});
return(deliver);
}
if (resp.status == 403) {
set resp.http.Content-Type = "text/plain; charset=utf-8";
synthetic("Forbidden");
return(deliver);
}
}Rate limiting with vsthrottle
The vsthrottle VMOD provides per-key rate limiting. It's available in the varnish-modules package (open source) and bundled with Varnish Enterprise:
vcl 4.1;
import vsthrottle;
sub vcl_recv {
# Block AI bots by UA first (fastest path)
if (req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)") {
return(synth(403, "Forbidden"));
}
# Rate limit: 100 requests per 10 seconds per IP
# Key: client IP (use X-Forwarded-For if behind a load balancer)
if (vsthrottle.is_denied(req.http.X-Forwarded-For, 100, 10s)) {
return(synth(429, "Too Many Requests"));
}
}Install varnish-modules (Ubuntu/Debian)
apt-get install varnish-modulesInstall varnish-modules (from source)
git clone https://github.com/varnish/varnish-modules.git
cd varnish-modules
./bootstrap
./configure
make
make installclient.ip as the key works for direct connections. If Varnish is behind a load balancer, use req.http.X-Forwarded-For — but validate it first to prevent IP spoofing. For production, consider a trusted IP header from your load balancer (e.g. req.http.X-Real-IP).VCL ACL for IP-based exceptions
VCL's acl statement defines IP ranges. Use it to whitelist your own crawlers or monitoring services from the bot-blocking rules:
vcl 4.1;
import std;
import vsthrottle;
# Trusted IPs — bypass bot blocking (your own crawlers, monitoring)
acl trusted_crawlers {
"127.0.0.1";
"10.0.0.0"/8;
"192.168.0.0"/16;
"203.0.113.42"; # your monitoring service IP
}
sub vcl_recv {
# Bypass all checks for trusted crawlers
if (client.ip ~ trusted_crawlers) {
return(pass);
}
# Block AI bots
if (req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)") {
std.log("AI bot blocked: " + req.http.User-Agent);
return(synth(403, "Forbidden"));
}
}Full VCL example
vcl 4.1;
import std;
import vsthrottle;
# Backend definition
backend default {
.host = "127.0.0.1";
.port = "8080";
.connect_timeout = 5s;
.first_byte_timeout = 30s;
.between_bytes_timeout = 10s;
.probe = {
.url = "/health";
.timeout = 2s;
.interval = 5s;
.window = 5;
.threshold = 3;
}
}
# Trusted IPs — bypass bot blocking
acl trusted_crawlers {
"127.0.0.1";
"10.0.0.0"/8;
"192.168.0.0"/16;
}
sub vcl_recv {
# Health check passthrough
if (req.url == "/health") {
return(pass);
}
# Serve robots.txt from Varnish directly
if (req.url == "/robots.txt") {
return(synth(800, "robots"));
}
# Trusted IPs bypass bot blocking
if (client.ip ~ trusted_crawlers) {
return(pass);
}
# Block AI bots by User-Agent
if (req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)") {
std.log("AI bot blocked UA: " + req.http.User-Agent);
return(synth(403, "Forbidden"));
}
# Rate limiting: 200 req / 10s per IP
if (vsthrottle.is_denied(req.http.X-Forwarded-For + req.http.User-Agent, 200, 10s)) {
return(synth(429, "Too Many Requests"));
}
# Strip cookies on static assets (allow caching)
if (req.url ~ ".(css|js|png|jpg|jpeg|gif|ico|woff2?|svg)$") {
unset req.http.Cookie;
}
return(hash);
}
sub vcl_backend_response {
# Add X-Robots-Tag to all backend responses (cached with object)
set beresp.http.X-Robots-Tag = "noai, noimageai";
# Cache static assets for 1 day
if (bereq.url ~ ".(css|js|png|jpg|jpeg|gif|ico|woff2?|svg)$") {
set beresp.ttl = 1d;
set beresp.http.Cache-Control = "public, max-age=86400";
unset beresp.http.Set-Cookie;
}
}
sub vcl_deliver {
# Ensure X-Robots-Tag is on every delivery (including cache hits)
if (!resp.http.X-Robots-Tag) {
set resp.http.X-Robots-Tag = "noai, noimageai";
}
# Add cache status header for debugging
if (obj.hits > 0) {
set resp.http.X-Cache = "HIT";
} else {
set resp.http.X-Cache = "MISS";
}
# Remove Varnish internals from response
unset resp.http.X-Varnish;
unset resp.http.Via;
}
sub vcl_synth {
# robots.txt (custom status 800)
if (resp.status == 800) {
set resp.status = 200;
set resp.http.Content-Type = "text/plain; charset=utf-8";
set resp.http.Cache-Control = "public, max-age=86400";
synthetic({"User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: AhrefsBot
Disallow: /
Sitemap: https://example.com/sitemap.xml
"});
return(deliver);
}
# 403 Forbidden
if (resp.status == 403) {
set resp.http.Content-Type = "text/plain; charset=utf-8";
synthetic("Forbidden");
return(deliver);
}
# 429 Too Many Requests
if (resp.status == 429) {
set resp.http.Content-Type = "text/plain; charset=utf-8";
set resp.http.Retry-After = "60";
synthetic("Too Many Requests");
return(deliver);
}
return(deliver);
}vcl_synth. Varnish allows any status code in synth() — using a code outside the standard 200–599 range is a common pattern for internal routing logic. Set it back to 200 in vcl_synth before delivering.Docker deployment
docker-compose.yml
services:
varnish:
image: varnish:7.5-alpine
ports:
- "80:80"
- "8443:8443"
volumes:
- ./default.vcl:/etc/varnish/default.vcl:ro
environment:
- VARNISH_SIZE=256m
command: >
-a 0.0.0.0:80,HTTP
-f /etc/varnish/default.vcl
-s malloc,256m
depends_on:
- app
app:
image: your-app:latest
expose:
- "8080"
# For HTTPS: put nginx or caddy in front of varnish for TLS termination
# Varnish does not handle TLS natively in the open-source versionReload VCL without restart
# Load new VCL
varnishadm vcl.load newconfig /etc/varnish/default.vcl
# Activate it
varnishadm vcl.use newconfig
# Verify
varnishadm vcl.listInspect blocked requests
# Watch all VCL log messages in real time
varnishlog -g request -q "VCL_Log ~ "AI bot""
# Count blocked bot requests
varnishstat -f MAIN.synthFAQ
How do I block AI bots by User-Agent in Varnish?
In vcl_recv, use req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|...)" then return(synth(403, "Forbidden")). The ~ operator does PCRE regex matching; (?i) makes it case-insensitive.
What is the difference between vcl_recv and vcl_pass in Varnish?
vcl_recv runs for every incoming request before cache lookup — the correct place for bot blocking. vcl_pass runs when a request is explicitly passed to the backend (bypassing cache). Block in vcl_recv so all requests are checked, cached or not.
How do I add X-Robots-Tag in Varnish?
In vcl_backend_response: set beresp.http.X-Robots-Tag = "noai, noimageai" — cached with the object. Or in vcl_deliver: set resp.http.X-Robots-Tag = "noai, noimageai" — applied on every delivery including cache hits, not stored in cache.
Can Varnish serve robots.txt without hitting the backend?
Yes — detect req.url == "/robots.txt" in vcl_recv and call return(synth(800, "robots")). In vcl_synth, set resp.status = 200, set the Content-Type, and use synthetic() with the robots.txt content.
How do I rate-limit bots in Varnish?
Install the varnish-modules package for the vsthrottle VMOD. In vcl_recv: vsthrottle.is_denied(req.http.X-Forwarded-For, 100, 10s) returns true if the client exceeded 100 requests in 10 seconds. Return synth(429) if denied.
Should I block bots in vcl_recv or at the backend level?
Always in vcl_recv — it fires before cache lookup and before any backend connection. Blocking here means zero backend load from blocked bots. Backend-level blocking wastes a connection and thread for every blocked request.
Is your site protected from AI bots?
Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.