How is this different from Google Analytics?

Google Analytics shows you traffic. Shadow shows you traffic, AI bot activity, what AI platforms say about your brand, AND tells you what to do about all of it. It's analytics + AI intelligence + action steps in one tool.

Do I need to install anything?

For basic monitoring (bot detection, AI perception, readiness score) — nope, just enter your URL. For full visitor analytics (clicks, behavior, sessions), add one script tag. One-click integrations for Vercel, Shopify, WordPress, and more.

Will it slow down my site?

No. The script is under 5KB and loads async. Zero impact on page speed or Core Web Vitals. External monitoring has literally no impact — it watches from the outside.

What AI bots does Shadow detect?

All of them. GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, Amazonbot, and dozens more. The Shadow Network means new bots get identified across all users instantly.

What do you mean by "actionable steps"?

Shadow doesn't just show you graphs. It says things like: "ChatGPT has your pricing wrong — add structured data to /pricing to fix it" or "Your bounce rate on /features is 68% — here's why and what to change." Specific, do-it-today recommendations.

Can Shadow block bots?

Shadow is a telescope, not a shield. It shows you who's visiting and what AI says about you. It generates block rules and robots.txt configs you can apply — but it doesn't intercept traffic.

Yes. Shadow never collects PII. IP addresses are hashed after classification. No cookies on your visitors. All Shadow Network data is anonymized. GDPR compliant by design.

Will blocking Bytespider hurt my SEO or traffic?

No. Bytespider is not connected to traditional search engines like Google or Bing. Blocking it will not affect your rankings in Google Search, Bing, or any standard search engine. It primarily feeds ByteDance products (TikTok, Douyin, Toutiao). If you're not targeting those platforms for traffic, blocking Bytespider has zero SEO downside.

What's the difference between blocking Bytespider in robots.txt vs server-level?

A robots.txt Disallow is a polite suggestion — it relies on the crawler voluntarily reading and respecting it. Server-level blocking (nginx, Cloudflare WAF, Apache) returns a 403 Forbidden response before any content is served. The request is rejected at the infrastructure level, regardless of what's in robots.txt. Server-level is technically enforced; robots.txt is not.

Are there other crawlers that ignore robots.txt like Bytespider?

Yes. Several scrapers operate without honoring robots.txt — particularly those not affiliated with established AI companies. Common offenders include Diffbot, some AI training scrapers, and various undisclosed user agents that rotate strings. Reputable crawlers like GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot have publicly committed to respecting robots.txt and have had that commitment independently verified.

Guides/Bytespider

robots.txtserver-levelNew

Blocking Bytespider: Why robots.txt Isn't Enough

ByteDance's Bytespider crawler has been documented ignoring robots.txt Disallow rules. Here's how to block it at the infrastructure level — nginx, Cloudflare, Vercel, Apache, and Next.js.

🛡️ 9 min read·Updated March 2026·Open Shadow

⚡ TL;DR

Bytespider is ByteDance's web crawler, used to feed TikTok, Douyin, and Toutiao.
Despite a Disallow in robots.txt, independent researchers have documented it continuing to crawl blocked paths.
The fix: block it at the server layer (nginx, Cloudflare WAF, Vercel headers, Apache) so the 403 is served before any content is touched.
Blocking Bytespider has zero impact on Google, Bing, or any traditional SEO ranking.

What is Bytespider?

Bytespider is the web crawling infrastructure operated by ByteDance — the parent company of TikTok, Douyin, CapCut, and the Toutiao news platform. It crawls the public web to build the content databases that power recommendation algorithms, AI features, and training datasets across ByteDance's product suite.

Its primary user-agent string is:

Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)

Variations include bytespider (lowercase), Bytedance, and strings referencing zhanzhang.toutiao.com. When implementing blocks, always use case-insensitive matching.

Why robots.txt Isn't Reliable Against Bytespider

The Robots Exclusion Protocol is a convention — not a technical barrier. It works because crawlers choose to read and respect it. Reputable operators like OpenAI (GPTBot), Anthropic (ClaudeBot), and Perplexity have publicly committed to honoring robots.txt and have been independently verified doing so.

Bytespider is different. Multiple independent researchers have published access log analyses showing Bytespider crawling URLs that were explicitly listed under Disallow: /. The pattern is consistent: robots.txt is fetched, but the disallow directives are not enforced on subsequent crawl requests.

robots.txt — what you write

User-agent: Bytespider
Disallow: /

# ⚠️ Bytespider may read this and continue crawling anyway.
# This alone is not a reliable block.

The practical consequence: if you're relying exclusively on robots.txt to protect proprietary content, training data, or gated pages from Bytespider, you may have a false sense of security. Server-level enforcement is the only technically guaranteed method.

robots.txt vs Server-Level Blocking

Method	How it works	Reliable vs Bytespider?
robots.txt Disallow	Crawler reads the file and voluntarily skips paths	⚠️ Not guaranteed
noai meta tag	Per-page HTML signal; requires page to be fetched first	⚠️ Page still served
nginx User-Agent block	Returns 403 before content is served	✅ Technically enforced
Cloudflare WAF rule	Blocks at edge, before reaching your server	✅ Technically enforced
Apache .htaccess block	Returns 403 at web server level	✅ Technically enforced
Vercel headers config	Returns 403 via Vercel edge network	✅ Technically enforced
Next.js middleware	Returns 403 at the application edge	✅ Technically enforced

Block Bytespider: nginx

Add this to your nginx.conf or your site's server block. The ~* makes the match case-insensitive.

nginx.conf

server {
    # ... your existing server config ...

    # Block Bytespider (case-insensitive user-agent match)
    if ($http_user_agent ~* "bytespider|bytedance") {
        return 403;
    }
}

After editing, reload nginx without downtime:

sudo nginx -t && sudo nginx -s reload

Block Bytespider: Cloudflare WAF

If your site is behind Cloudflare (free or paid), you can block Bytespider at the edge — before it ever reaches your server. No server access required.

Option A: Custom WAF Rule (Recommended)

Go to your Cloudflare dashboard → Security → WAF → Custom rules
Click Create rule
Set the rule name: Block Bytespider
Under Field, choose User Agent
Set Operator to contains (case insensitive)
Set Value to bytespider
Action: Block (returns 403)

Option B: Cloudflare Firewall Expression

Cloudflare Firewall Expression (edit as expression)

(lower(http.user_agent) contains "bytespider") or 
(lower(http.user_agent) contains "bytedance")

✅ Cloudflare blocks are applied at the edge CDN — zero server load. Blocked requests don't count against your origin bandwidth or compute. This is the most performant blocking method.

Block Bytespider: Vercel

Vercel doesn't expose server-level nginx config, but you can block Bytespider using either vercel.json headers or Next.js middleware (preferred for full control).

Method 1: Next.js Middleware (Recommended)

Add or update your middleware.ts at the project root:

middleware.ts

import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';

const BLOCKED_BOTS = [
  'bytespider',
  'bytedance',
];

export function middleware(request: NextRequest) {
  const ua = (request.headers.get('user-agent') || '').toLowerCase();

  if (BLOCKED_BOTS.some(bot => ua.includes(bot))) {
    return new NextResponse('Forbidden', { 
      status: 403,
      headers: { 'Content-Type': 'text/plain' },
    });
  }

  return NextResponse.next();
}

export const config = {
  // Run on all routes
  matcher: ['/((?!_next/static|_next/image|favicon.ico).*)'],
};

Method 2: vercel.json headers + rewrite

Less flexible but zero code required. Vercel headers don't support conditional user-agent logic directly, but you can combine with a custom 403 page route:

vercel.json

{
  "headers": [
    {
      "source": "/(.*)",
      "headers": [
        { "key": "X-Robots-Tag", "value": "noai, noimageai" }
      ]
    }
  ]
}

// Note: For actual 403 enforcement on Vercel, use
// Next.js middleware (Method 1 above). vercel.json headers
// alone cannot perform conditional UA-based blocking.

Block Bytespider: Apache

Add the following to your .htaccess or your site's Apache virtualhost config. Requires mod_rewrite (enabled by default on most hosts).

.htaccess

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} bytedance [NC]
RewriteRule .* - [F,L]

# [NC] = case-insensitive
# [F]  = Forbidden (403)
# [L]  = Last rule (stop processing)

Still Include robots.txt — for the Right Reasons

Server-level blocking is the primary defence. But you should still include Bytespider in your robots.txt — not because Bytespider will respect it, but because:

→Audit trail: Your robots.txt is a public record of intent. If you ever need to demonstrate you explicitly denied access, having the Disallow directive documented matters.
→Future compliance: ByteDance may improve compliance over time (or face regulatory pressure to do so). A robots.txt entry costs nothing.
→Defense in depth: Other ByteDance-adjacent crawlers may use different UA strings but respect robots.txt. Belt and braces.
→Open Shadow scanner reads it: The AI Readiness Scanner at Open Shadow checks for Bytespider in your robots.txt as one signal. Keep it there.

robots.txt (keep this, alongside server-level blocking)

# Bytespider (ByteDance / TikTok)
# Note: blocked at server level — this is belt-and-braces
User-agent: Bytespider
Disallow: /

Verify Your Block is Working

After implementing your server-level block, verify with one of these methods:

1. curl simulation

curl -s -o /dev/null -w "%{http_code}" \
  -H "User-Agent: Bytespider" \
  https://yourdomain.com

# Expected output: 403
# If you see 200, your block is not active yet.

2. nginx access log grep

# Check for Bytespider hits (blocked = 403)
grep -i "bytespider" /var/log/nginx/access.log | \
  awk '{print $9}' | sort | uniq -c

# 403 = blocked correctly
# 200 = not blocked

3. Cloudflare Security Events

In your Cloudflare dashboard: Security → Events → filter by your "Block Bytespider" rule name. Matched + blocked requests will appear here with user-agent details.

Frequently Asked Questions

Does Bytespider respect robots.txt?+

Officially yes — ByteDance states it honors robots.txt. In practice, independent researchers have documented continued crawling of disallowed paths. Don't rely on it as your only protection.

What user agents does Bytespider use?+

The primary string contains "Bytespider". Variations include lowercase "bytespider", "Bytedance", and strings referencing spider-feedback@bytedance.com or zhanzhang.toutiao.com. Use case-insensitive matching to catch all variants.

Will blocking Bytespider affect my SEO or organic traffic?+

No. Bytespider has no connection to Google Search, Bing, or any traditional search index. Blocking it will not affect your Google rankings, Search Console data, or any mainstream traffic source.

What's the difference between robots.txt blocking and server-level blocking?+

robots.txt is a polite convention — it relies on the crawler choosing to obey it. Server-level blocking (nginx, Cloudflare, Apache) returns a 403 before any content is served, regardless of what the bot does with robots.txt. Server-level is technically guaranteed; robots.txt is not.

Are there other crawlers that ignore robots.txt?+

Yes. Bytespider is the most well-documented case among named crawlers. Many undisclosed scrapers rotate user agents and ignore robots.txt entirely. Reputable crawlers (GPTBot, ClaudeBot, PerplexityBot) have been independently verified to respect robots.txt.

How do I verify Bytespider is actually blocked?+

Run: curl -s -o /dev/null -w "%{http_code}" -H "User-Agent: Bytespider" https://yourdomain.com — you should get 403. In Cloudflare, check Security > Events for blocked requests matching your rule.

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.