Skip to content
Guides/Bytespider
robots.txtserver-levelNew

Blocking Bytespider: Why robots.txt Isn't Enough

ByteDance's Bytespider crawler has been documented ignoring robots.txt Disallow rules. Here's how to block it at the infrastructure level — nginx, Cloudflare, Vercel, Apache, and Next.js.

🛡️ 9 min read·Updated March 2026·Open Shadow

⚡ TL;DR

  • Bytespider is ByteDance's web crawler, used to feed TikTok, Douyin, and Toutiao.
  • Despite a Disallow in robots.txt, independent researchers have documented it continuing to crawl blocked paths.
  • The fix: block it at the server layer (nginx, Cloudflare WAF, Vercel headers, Apache) so the 403 is served before any content is touched.
  • Blocking Bytespider has zero impact on Google, Bing, or any traditional SEO ranking.

What is Bytespider?

Bytespider is the web crawling infrastructure operated by ByteDance — the parent company of TikTok, Douyin, CapCut, and the Toutiao news platform. It crawls the public web to build the content databases that power recommendation algorithms, AI features, and training datasets across ByteDance's product suite.

Its primary user-agent string is:

Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)

Variations include bytespider (lowercase), Bytedance, and strings referencing zhanzhang.toutiao.com. When implementing blocks, always use case-insensitive matching.

Why robots.txt Isn't Reliable Against Bytespider

The Robots Exclusion Protocol is a convention — not a technical barrier. It works because crawlers choose to read and respect it. Reputable operators like OpenAI (GPTBot), Anthropic (ClaudeBot), and Perplexity have publicly committed to honoring robots.txt and have been independently verified doing so.

Bytespider is different. Multiple independent researchers have published access log analyses showing Bytespider crawling URLs that were explicitly listed under Disallow: /. The pattern is consistent: robots.txt is fetched, but the disallow directives are not enforced on subsequent crawl requests.

robots.txt — what you write
User-agent: Bytespider
Disallow: /

# ⚠️ Bytespider may read this and continue crawling anyway.
# This alone is not a reliable block.

The practical consequence: if you're relying exclusively on robots.txt to protect proprietary content, training data, or gated pages from Bytespider, you may have a false sense of security. Server-level enforcement is the only technically guaranteed method.

robots.txt vs Server-Level Blocking

MethodHow it worksReliable vs Bytespider?
robots.txt DisallowCrawler reads the file and voluntarily skips paths⚠️ Not guaranteed
noai meta tagPer-page HTML signal; requires page to be fetched first⚠️ Page still served
nginx User-Agent blockReturns 403 before content is served✅ Technically enforced
Cloudflare WAF ruleBlocks at edge, before reaching your server✅ Technically enforced
Apache .htaccess blockReturns 403 at web server level✅ Technically enforced
Vercel headers configReturns 403 via Vercel edge network✅ Technically enforced
Next.js middlewareReturns 403 at the application edge✅ Technically enforced

Block Bytespider: nginx

Add this to your nginx.conf or your site's server block. The ~* makes the match case-insensitive.

nginx.conf
server {
    # ... your existing server config ...

    # Block Bytespider (case-insensitive user-agent match)
    if ($http_user_agent ~* "bytespider|bytedance") {
        return 403;
    }
}

After editing, reload nginx without downtime:

sudo nginx -t && sudo nginx -s reload

Block Bytespider: Cloudflare WAF

If your site is behind Cloudflare (free or paid), you can block Bytespider at the edge — before it ever reaches your server. No server access required.

Option A: Custom WAF Rule (Recommended)

  1. Go to your Cloudflare dashboard → Security → WAF → Custom rules
  2. Click Create rule
  3. Set the rule name: Block Bytespider
  4. Under Field, choose User Agent
  5. Set Operator to contains (case insensitive)
  6. Set Value to bytespider
  7. Action: Block (returns 403)

Option B: Cloudflare Firewall Expression

Cloudflare Firewall Expression (edit as expression)
(lower(http.user_agent) contains "bytespider") or 
(lower(http.user_agent) contains "bytedance")

✅ Cloudflare blocks are applied at the edge CDN — zero server load. Blocked requests don't count against your origin bandwidth or compute. This is the most performant blocking method.

Block Bytespider: Vercel

Vercel doesn't expose server-level nginx config, but you can block Bytespider using either vercel.json headers or Next.js middleware (preferred for full control).

Method 1: Next.js Middleware (Recommended)

Add or update your middleware.ts at the project root:

middleware.ts
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';

const BLOCKED_BOTS = [
  'bytespider',
  'bytedance',
];

export function middleware(request: NextRequest) {
  const ua = (request.headers.get('user-agent') || '').toLowerCase();

  if (BLOCKED_BOTS.some(bot => ua.includes(bot))) {
    return new NextResponse('Forbidden', { 
      status: 403,
      headers: { 'Content-Type': 'text/plain' },
    });
  }

  return NextResponse.next();
}

export const config = {
  // Run on all routes
  matcher: ['/((?!_next/static|_next/image|favicon.ico).*)'],
};

Method 2: vercel.json headers + rewrite

Less flexible but zero code required. Vercel headers don't support conditional user-agent logic directly, but you can combine with a custom 403 page route:

vercel.json
{
  "headers": [
    {
      "source": "/(.*)",
      "headers": [
        { "key": "X-Robots-Tag", "value": "noai, noimageai" }
      ]
    }
  ]
}

// Note: For actual 403 enforcement on Vercel, use
// Next.js middleware (Method 1 above). vercel.json headers
// alone cannot perform conditional UA-based blocking.

Block Bytespider: Apache

Add the following to your .htaccess or your site's Apache virtualhost config. Requires mod_rewrite (enabled by default on most hosts).

.htaccess
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} bytedance [NC]
RewriteRule .* - [F,L]

# [NC] = case-insensitive
# [F]  = Forbidden (403)
# [L]  = Last rule (stop processing)

Still Include robots.txt — for the Right Reasons

Server-level blocking is the primary defence. But you should still include Bytespider in your robots.txt — not because Bytespider will respect it, but because:

  • Audit trail: Your robots.txt is a public record of intent. If you ever need to demonstrate you explicitly denied access, having the Disallow directive documented matters.
  • Future compliance: ByteDance may improve compliance over time (or face regulatory pressure to do so). A robots.txt entry costs nothing.
  • Defense in depth: Other ByteDance-adjacent crawlers may use different UA strings but respect robots.txt. Belt and braces.
  • Open Shadow scanner reads it: The AI Readiness Scanner at Open Shadow checks for Bytespider in your robots.txt as one signal. Keep it there.
robots.txt (keep this, alongside server-level blocking)
# Bytespider (ByteDance / TikTok)
# Note: blocked at server level — this is belt-and-braces
User-agent: Bytespider
Disallow: /

Verify Your Block is Working

After implementing your server-level block, verify with one of these methods:

1. curl simulation

curl -s -o /dev/null -w "%{http_code}" \
  -H "User-Agent: Bytespider" \
  https://yourdomain.com

# Expected output: 403
# If you see 200, your block is not active yet.

2. nginx access log grep

# Check for Bytespider hits (blocked = 403)
grep -i "bytespider" /var/log/nginx/access.log | \
  awk '{print $9}' | sort | uniq -c

# 403 = blocked correctly
# 200 = not blocked

3. Cloudflare Security Events

In your Cloudflare dashboard: Security → Events → filter by your "Block Bytespider" rule name. Matched + blocked requests will appear here with user-agent details.

Frequently Asked Questions

Does Bytespider respect robots.txt?+

Officially yes — ByteDance states it honors robots.txt. In practice, independent researchers have documented continued crawling of disallowed paths. Don't rely on it as your only protection.

What user agents does Bytespider use?+

The primary string contains "Bytespider". Variations include lowercase "bytespider", "Bytedance", and strings referencing spider-feedback@bytedance.com or zhanzhang.toutiao.com. Use case-insensitive matching to catch all variants.

Will blocking Bytespider affect my SEO or organic traffic?+

No. Bytespider has no connection to Google Search, Bing, or any traditional search index. Blocking it will not affect your Google rankings, Search Console data, or any mainstream traffic source.

What's the difference between robots.txt blocking and server-level blocking?+

robots.txt is a polite convention — it relies on the crawler choosing to obey it. Server-level blocking (nginx, Cloudflare, Apache) returns a 403 before any content is served, regardless of what the bot does with robots.txt. Server-level is technically guaranteed; robots.txt is not.

Are there other crawlers that ignore robots.txt?+

Yes. Bytespider is the most well-documented case among named crawlers. Many undisclosed scrapers rotate user agents and ignore robots.txt entirely. Reputable crawlers (GPTBot, ClaudeBot, PerplexityBot) have been independently verified to respect robots.txt.

How do I verify Bytespider is actually blocked?+

Run: curl -s -o /dev/null -w "%{http_code}" -H "User-Agent: Bytespider" https://yourdomain.com — you should get 403. In Cloudflare, check Security > Events for blocked requests matching your rule.

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.

Scan My Site Free →

Related Guides