How to Block AI Bots in R Plumber
Plumber converts R functions into REST API endpoints using roxygen-style comments. If you expose models, data, or research via a Plumber API, AI crawlers can harvest your responses at scale. Plumber uses a filter system — functions registered with pr_filter() that run before route handlers. The Plumber-specific detail: short-circuiting means returning a value without calling forward(). If forward() is called, the pipeline continues; if it is omitted, the return value becomes the response body and no further processing occurs.
1. Bot pattern helper
Define patterns once in a sourced file. In Plumber (Rook spec), the User-Agent header arrives as req$HTTP_USER_AGENT — uppercased, hyphen replaced with underscore, HTTP_ prefix added. Apply tolower() before matching and use grepl(..., fixed = TRUE) for literal substring search without regex overhead.
# R/ai_bots.R
# Bot-pattern vector — kept in one place so filters and tests share it
AI_BOT_PATTERNS <- c(
"gptbot",
"chatgpt-user",
"claudebot",
"anthropic-ai",
"ccbot",
"google-extended",
"cohere-ai",
"meta-externalagent",
"bytespider",
"omgili",
"diffbot",
"imagesiftbot",
"magpie-crawler",
"amazonbot",
"dataprovider",
"netcraft"
)
#' Check whether a User-Agent string belongs to a known AI bot
#'
#' @param ua Character string — the raw User-Agent header value.
#' @return Logical TRUE if any pattern matches, FALSE otherwise.
is_ai_bot <- function(ua) {
if (is.null(ua) || !nzchar(ua)) return(FALSE)
ua_lower <- tolower(ua)
# grepl(..., fixed = TRUE) — literal substring, faster than regex
any(vapply(AI_BOT_PATTERNS, function(p) grepl(p, ua_lower, fixed = TRUE), logical(1L)))
}2. pr_filter() — bot blocking filter
Register the filter with pr_filter(name, function). The function receives req, res, and can call forward() to continue. To short-circuit, set res$status, optionally set headers via res$setHeader(), and return a body string without calling forward().
# R/plumber.R
library(plumber)
source("R/ai_bots.R")
# ── Build the router ──────────────────────────────────────────────────────────
pr <- plumber::pr()
# ── AI-bot filter ─────────────────────────────────────────────────────────────
# Filters run in registration order before any route handler.
# Returning a value WITHOUT calling forward() short-circuits the pipeline.
pr <- pr |>
pr_filter("ai_bot_blocker", function(req, res) {
# Always let AI crawlers fetch robots.txt so they learn they're blocked
if (!is.null(req$PATH_INFO) && req$PATH_INFO == "/robots.txt") {
forward()
return(invisible(NULL))
}
ua <- req$HTTP_USER_AGENT # Rook-spec: uppercase, HTTP_ prefix, _ for -
if (is_ai_bot(ua)) {
res$status <- 403L
res$setHeader("X-Robots-Tag", "noai, noimageai")
res$setHeader("Content-Type", "text/plain; charset=utf-8")
# Return WITHOUT calling forward() — pipeline stops here
return("Forbidden")
}
# Pass-through: add header then continue
res$setHeader("X-Robots-Tag", "noai, noimageai")
forward()
})
# ── Routes ────────────────────────────────────────────────────────────────────
#* @get /
function() {
list(message = "Hello")
}
#* @get /robots.txt
#* @serializer text
function(res) {
res$setHeader("Content-Type", "text/plain")
readLines("public/robots.txt", warn = FALSE) |> paste(collapse = "\n")
}3. Inline pipeline — single-file variant
For smaller APIs, chain pr_filter() directly onto the plumb() call. The pipe operator (|> in R 4.1+ or %>% from magrittr) keeps registration declarative and easy to read.
# Inline pipeline style — no separate source file
library(plumber)
AI_BOT_PATTERNS <- c(
"gptbot", "chatgpt-user", "claudebot", "anthropic-ai",
"ccbot", "google-extended", "cohere-ai", "meta-externalagent",
"bytespider", "omgili", "diffbot", "imagesiftbot",
"magpie-crawler", "amazonbot", "dataprovider", "netcraft"
)
is_ai_bot <- function(ua) {
if (is.null(ua) || !nzchar(ua)) return(FALSE)
ua_lower <- tolower(ua)
any(vapply(AI_BOT_PATTERNS, function(p) grepl(p, ua_lower, fixed = TRUE), logical(1L)))
}
plumb("R/plumber.R") |>
pr_filter("ai_bot_blocker", function(req, res) {
if (identical(req$PATH_INFO, "/robots.txt")) {
forward(); return(invisible(NULL))
}
if (is_ai_bot(req$HTTP_USER_AGENT)) {
res$status <- 403L
res$setHeader("X-Robots-Tag", "noai, noimageai")
return("Forbidden")
}
res$setHeader("X-Robots-Tag", "noai, noimageai")
forward()
}) |>
pr_run(host = "0.0.0.0", port = 8000)4. Nested router — protect /api/* only
Use pr_mount() to attach a sub-router under a path prefix. Filters on the sub-router only apply to requests routed through that prefix — public health-check or webhook endpoints on the root router are unaffected.
# Nested router — apply bot blocker only to /api/* routes
library(plumber)
source("R/ai_bots.R")
# Protected sub-router
api_router <- pr() |>
pr_filter("ai_bot_blocker", function(req, res) {
if (is_ai_bot(req$HTTP_USER_AGENT)) {
res$status <- 403L
res$setHeader("X-Robots-Tag", "noai, noimageai")
return("Forbidden")
}
res$setHeader("X-Robots-Tag", "noai, noimageai")
forward()
}) |>
pr_get("/data", function() list(data = "sensitive"))
# Root router — public endpoints bypass the filter
root_router <- pr() |>
pr_get("/health", function() list(status = "ok")) |>
pr_mount("/api", api_router)
pr_run(root_router, host = "0.0.0.0", port = 8000)5. public/robots.txt
Plumber does not serve static files automatically — robots.txt must be an explicit route or served via a static mount. The req$PATH_INFO == "/robots.txt" guard in the filter ensures AI crawlers can still fetch the file and discover they are disallowed.
# public/robots.txt
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /6. Production entrypoint
A clean entrypoint sources helpers, wires the filter, and reads the port from the environment — standard for Shiny Server, Posit Connect, or Docker-based deployments.
# entrypoint.R — production startup
library(plumber)
source("R/ai_bots.R")
pr("R/plumber.R") |>
pr_filter("ai_bot_blocker", function(req, res) {
if (identical(req$PATH_INFO, "/robots.txt")) {
forward(); return(invisible(NULL))
}
if (is_ai_bot(req$HTTP_USER_AGENT)) {
res$status <- 403L
res$setHeader("X-Robots-Tag", "noai, noimageai")
return("Forbidden")
}
res$setHeader("X-Robots-Tag", "noai, noimageai")
forward()
}) |>
pr_run(host = "0.0.0.0", port = as.integer(Sys.getenv("PORT", 8000)))Key points
- Rook header naming: All HTTP headers in Plumber arrive via the Rook spec — uppercase, hyphens become underscores,
HTTP_prefix.User-Agent→req$HTTP_USER_AGENT. Never use the raw header name. - Short-circuit = no forward(): The only way to stop the pipeline is to return a value without calling
forward(). Plumber does not have a throw-style abort. Setres$statusfirst, then return the body. - forward() placement:
forward()must be the last call in the pass-through path — any code after it still runs, butresmodifications afterforward()may not be reflected. Set headers before callingforward(). - grepl fixed vs regex:
grepl(p, ua, fixed = TRUE)is a literal substring search — no regex engine, no backtracking. For 16 fixed patterns, it is measurably faster than a compiled regex alternative and easier to audit. - vapply over sapply:
vapplywith a typed FUN.VALUE enforces that each element returns exactly one logical — safer thansapplyfor production code where a NULL or NA from a bad pattern would silently change behaviour. - robots.txt bypass: Plumber runs filters on every request — there is no automatic static-file bypass. The
PATH_INFO == "/robots.txt"guard is mandatory, otherwise AI crawlers cannot discover they are disallowed. - Posit Connect / Shiny Server: Both platforms proxy requests through their own routing layer. The
HTTP_USER_AGENTheader passes through unchanged — the filter works without modification on either host.
Framework comparison — data-science API frameworks
| Framework | Middleware hook | Short-circuit | UA header |
|---|---|---|---|
| R Plumber | pr_filter() | return value, no forward() | req$HTTP_USER_AGENT |
| Python FastAPI | @app.middleware("http") | return Response(403) | request.headers["user-agent"] |
| Python Flask | @app.before_request | return make_response("", 403) | request.headers.get("User-Agent") |
| Julia Genie.jl | HTTP.jl middleware | return HTTP.Response(403, ...) | HTTP.header(req, "User-Agent") |
Plumber's filter system is unique in using return-value presence vs absence to signal short-circuit vs pass-through — closer to Rack (Ruby) or WAI (Haskell) than to Python's middleware or ASGI pattern. The Rook-spec header naming (HTTP_*) is the sharpest onboarding edge for developers coming from any other framework.
Dependencies
The filter uses only base R and the plumber package — no additional dependencies required.
# Install from CRAN
install.packages("plumber")
# Or using renv for reproducible environments
renv::install("plumber")
# Minimum version: plumber >= 1.0.0 (introduced pr_filter / pr_run / pr_mount)
# R >= 4.1.0 for native |> pipe operator