Skip to content

How to Block AI Bots in R Plumber

Plumber converts R functions into REST API endpoints using roxygen-style comments. If you expose models, data, or research via a Plumber API, AI crawlers can harvest your responses at scale. Plumber uses a filter system — functions registered with pr_filter() that run before route handlers. The Plumber-specific detail: short-circuiting means returning a value without calling forward(). If forward() is called, the pipeline continues; if it is omitted, the return value becomes the response body and no further processing occurs.

1. Bot pattern helper

Define patterns once in a sourced file. In Plumber (Rook spec), the User-Agent header arrives as req$HTTP_USER_AGENT — uppercased, hyphen replaced with underscore, HTTP_ prefix added. Apply tolower() before matching and use grepl(..., fixed = TRUE) for literal substring search without regex overhead.

# R/ai_bots.R
# Bot-pattern vector — kept in one place so filters and tests share it

AI_BOT_PATTERNS <- c(
  "gptbot",
  "chatgpt-user",
  "claudebot",
  "anthropic-ai",
  "ccbot",
  "google-extended",
  "cohere-ai",
  "meta-externalagent",
  "bytespider",
  "omgili",
  "diffbot",
  "imagesiftbot",
  "magpie-crawler",
  "amazonbot",
  "dataprovider",
  "netcraft"
)

#' Check whether a User-Agent string belongs to a known AI bot
#'
#' @param ua Character string — the raw User-Agent header value.
#' @return Logical TRUE if any pattern matches, FALSE otherwise.
is_ai_bot <- function(ua) {
  if (is.null(ua) || !nzchar(ua)) return(FALSE)
  ua_lower <- tolower(ua)
  # grepl(..., fixed = TRUE) — literal substring, faster than regex
  any(vapply(AI_BOT_PATTERNS, function(p) grepl(p, ua_lower, fixed = TRUE), logical(1L)))
}

2. pr_filter() — bot blocking filter

Register the filter with pr_filter(name, function). The function receives req, res, and can call forward() to continue. To short-circuit, set res$status, optionally set headers via res$setHeader(), and return a body string without calling forward().

# R/plumber.R
library(plumber)

source("R/ai_bots.R")

# ── Build the router ──────────────────────────────────────────────────────────
pr <- plumber::pr()

# ── AI-bot filter ─────────────────────────────────────────────────────────────
# Filters run in registration order before any route handler.
# Returning a value WITHOUT calling forward() short-circuits the pipeline.
pr <- pr |>
  pr_filter("ai_bot_blocker", function(req, res) {
    # Always let AI crawlers fetch robots.txt so they learn they're blocked
    if (!is.null(req$PATH_INFO) && req$PATH_INFO == "/robots.txt") {
      forward()
      return(invisible(NULL))
    }

    ua <- req$HTTP_USER_AGENT  # Rook-spec: uppercase, HTTP_ prefix, _ for -

    if (is_ai_bot(ua)) {
      res$status <- 403L
      res$setHeader("X-Robots-Tag", "noai, noimageai")
      res$setHeader("Content-Type", "text/plain; charset=utf-8")
      # Return WITHOUT calling forward() — pipeline stops here
      return("Forbidden")
    }

    # Pass-through: add header then continue
    res$setHeader("X-Robots-Tag", "noai, noimageai")
    forward()
  })

# ── Routes ────────────────────────────────────────────────────────────────────
#* @get /
function() {
  list(message = "Hello")
}

#* @get /robots.txt
#* @serializer text
function(res) {
  res$setHeader("Content-Type", "text/plain")
  readLines("public/robots.txt", warn = FALSE) |> paste(collapse = "\n")
}

3. Inline pipeline — single-file variant

For smaller APIs, chain pr_filter() directly onto the plumb() call. The pipe operator (|> in R 4.1+ or %>% from magrittr) keeps registration declarative and easy to read.

# Inline pipeline style — no separate source file
library(plumber)

AI_BOT_PATTERNS <- c(
  "gptbot", "chatgpt-user", "claudebot", "anthropic-ai",
  "ccbot", "google-extended", "cohere-ai", "meta-externalagent",
  "bytespider", "omgili", "diffbot", "imagesiftbot",
  "magpie-crawler", "amazonbot", "dataprovider", "netcraft"
)

is_ai_bot <- function(ua) {
  if (is.null(ua) || !nzchar(ua)) return(FALSE)
  ua_lower <- tolower(ua)
  any(vapply(AI_BOT_PATTERNS, function(p) grepl(p, ua_lower, fixed = TRUE), logical(1L)))
}

plumb("R/plumber.R") |>
  pr_filter("ai_bot_blocker", function(req, res) {
    if (identical(req$PATH_INFO, "/robots.txt")) {
      forward(); return(invisible(NULL))
    }
    if (is_ai_bot(req$HTTP_USER_AGENT)) {
      res$status <- 403L
      res$setHeader("X-Robots-Tag", "noai, noimageai")
      return("Forbidden")
    }
    res$setHeader("X-Robots-Tag", "noai, noimageai")
    forward()
  }) |>
  pr_run(host = "0.0.0.0", port = 8000)

4. Nested router — protect /api/* only

Use pr_mount() to attach a sub-router under a path prefix. Filters on the sub-router only apply to requests routed through that prefix — public health-check or webhook endpoints on the root router are unaffected.

# Nested router — apply bot blocker only to /api/* routes
library(plumber)

source("R/ai_bots.R")

# Protected sub-router
api_router <- pr() |>
  pr_filter("ai_bot_blocker", function(req, res) {
    if (is_ai_bot(req$HTTP_USER_AGENT)) {
      res$status <- 403L
      res$setHeader("X-Robots-Tag", "noai, noimageai")
      return("Forbidden")
    }
    res$setHeader("X-Robots-Tag", "noai, noimageai")
    forward()
  }) |>
  pr_get("/data", function() list(data = "sensitive"))

# Root router — public endpoints bypass the filter
root_router <- pr() |>
  pr_get("/health", function() list(status = "ok")) |>
  pr_mount("/api", api_router)

pr_run(root_router, host = "0.0.0.0", port = 8000)

5. public/robots.txt

Plumber does not serve static files automatically — robots.txt must be an explicit route or served via a static mount. The req$PATH_INFO == "/robots.txt" guard in the filter ensures AI crawlers can still fetch the file and discover they are disallowed.

# public/robots.txt
User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

6. Production entrypoint

A clean entrypoint sources helpers, wires the filter, and reads the port from the environment — standard for Shiny Server, Posit Connect, or Docker-based deployments.

# entrypoint.R — production startup
library(plumber)

source("R/ai_bots.R")

pr("R/plumber.R") |>
  pr_filter("ai_bot_blocker", function(req, res) {
    if (identical(req$PATH_INFO, "/robots.txt")) {
      forward(); return(invisible(NULL))
    }
    if (is_ai_bot(req$HTTP_USER_AGENT)) {
      res$status <- 403L
      res$setHeader("X-Robots-Tag", "noai, noimageai")
      return("Forbidden")
    }
    res$setHeader("X-Robots-Tag", "noai, noimageai")
    forward()
  }) |>
  pr_run(host = "0.0.0.0", port = as.integer(Sys.getenv("PORT", 8000)))

Key points

Framework comparison — data-science API frameworks

FrameworkMiddleware hookShort-circuitUA header
R Plumberpr_filter()return value, no forward()req$HTTP_USER_AGENT
Python FastAPI@app.middleware("http")return Response(403)request.headers["user-agent"]
Python Flask@app.before_requestreturn make_response("", 403)request.headers.get("User-Agent")
Julia Genie.jlHTTP.jl middlewarereturn HTTP.Response(403, ...)HTTP.header(req, "User-Agent")

Plumber's filter system is unique in using return-value presence vs absence to signal short-circuit vs pass-through — closer to Rack (Ruby) or WAI (Haskell) than to Python's middleware or ASGI pattern. The Rook-spec header naming (HTTP_*) is the sharpest onboarding edge for developers coming from any other framework.

Dependencies

The filter uses only base R and the plumber package — no additional dependencies required.

# Install from CRAN
install.packages("plumber")

# Or using renv for reproducible environments
renv::install("plumber")

# Minimum version: plumber >= 1.0.0 (introduced pr_filter / pr_run / pr_mount)
# R >= 4.1.0 for native |> pipe operator