Skip to content
For Website OwnersLikely yes10 min fix

Is AI Using My Website Content?

Almost certainly, yes — if your site has been public for more than a few months without blocking AI bots. Here's how to confirm it, check which bots have visited, and stop future crawls in under 10 minutes.

Updated April 2026

The short answer: almost certainly yes

Common Crawl has been archiving the public web continuously since 2008. Its datasets — petabytes of crawled web content released for free — are the default training data for most major AI models: GPT (OpenAI), Gemini (Google), Llama (Meta), Mistral, Falcon, and hundreds of open-source models.

If your site has been publicly accessible and you haven't blocked CCBot in your robots.txt, your content is almost certainly in Common Crawl's archive — and therefore in the training data of dozens of AI models.

Step 1: Find Out Which AI Bots Have Visited

Option A: Free scan (fastest)

Run Open Shadow's free scan — it checks your robots.txt configuration and tells you which AI bots are currently allowed vs blocked on your site. This shows your current exposure, not historical visits.

Option B: Server log analysis

Your access logs record every visitor — including AI bots. Search for known AI user agents:

nginx / Apache
grep -iE "CCBot|GPTBot|ClaudeBot|PerplexityBot|Google-Extended|meta-externalagent|MistralBot|Bytespider" /var/log/nginx/access.log | tail -50

Each line in the results is a page request from that AI bot, including the URL it fetched and the timestamp.

Option C: Cloudflare Analytics

Cloudflare's Firewall Events log captures bot activity with user agent details. In the Cloudflare dashboard: Security → Firewall → Firewall Events → filter by user agent. Known AI bots are also identified in Cloudflare's Bot Analytics report under "Verified Bots."

The AI Bots That Train on Web Content

These are the user agent strings to look for in your logs. Any of these in your logs means that AI company has fetched content from your site:

User AgentCompany
CCBotCommon Crawl
GPTBotOpenAI
ClaudeBotAnthropic
Google-ExtendedGoogle
meta-externalagentMeta
MistralBotMistral AI
BytespiderByteDance
AI2BotAllen Institute
PerplexityBotPerplexity
anthropic-aiAnthropic

Step 2: Stop Future Crawls (10 Minutes)

Add this to your robots.txt file (in the root of your domain):

robots.txt — block all major AI training crawlers
User-agent: CCBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: MistralBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: AI2Bot
Disallow: /

✓ Safe for SEO

These rules don't affect Googlebot, Bingbot, or any search engine crawler. Your SEO is completely unaffected.

⚠ Prospective only

This stops future crawls. Content already in AI training datasets remains there — you cannot retroactively remove it.

What About Content Already in AI Models?

If an AI bot crawled your site before you added these blocks, that content may already be in a training dataset. Here's the honest picture:

You cannot "unlearn" content from deployed models. Neural network weights don't store individual training examples in a recoverable way. There's no technical mechanism to surgically remove your content from GPT-4 or Llama 3.

~

Removal request forms exist but have limited impact. Anthropic (privacy.anthropic.com), OpenAI, and Common Crawl (commoncrawl.org) offer forms to request content removal. These affect future training runs, not deployed models.

Blocking works for future models. AI labs retrain models every 6–18 months. Block now, and your content won't be in GPT-5, Llama 4, Gemini Next, or whatever comes after. The effect compounds over time.

Where to Add robots.txt for Your Platform

Next.js
Create public/robots.txt in your project root, or use the robots() function in app/robots.ts (Next.js 13+)
WordPress
Edit via Settings → Reading → "Search engine visibility" or install a robots.txt editor plugin. Or FTP to create/edit /public_html/robots.txt directly.
Shopify
Online Store → Preferences → robots.txt. Shopify auto-generates it — use the robots.txt.liquid template to add custom rules.
Squarespace / Wix
Limited robots.txt control. Squarespace: Settings → Advanced → External Services. Wix: Marketing & SEO → SEO Settings → robots.txt.
Static sites (Netlify, Vercel)
Create robots.txt in the root of your /public directory. It deploys automatically with your site.
Any server
Place robots.txt at the root of your web server (e.g., /var/www/html/robots.txt). It must be accessible at yourdomain.com/robots.txt.

Frequently Asked Questions

How do I know if my content is in ChatGPT's knowledge?

You can test this directly: ask ChatGPT to tell you about your website or business. If it returns accurate, specific information about your site's content, your material is likely in its training data. This isn't definitive proof (ChatGPT may also be drawing on search results via ChatGPT-User), but accurate factual recall often indicates training data inclusion.

A competitor's AI product is clearly using my content. What can I do?

First, block their crawler in robots.txt (prevents future use). Then submit a removal request if they offer one. If you believe they violated your terms of service or copyright, document the evidence and consult a lawyer. Several publishers have filed lawsuits against AI companies for unauthorized content use — New York Times vs OpenAI being the highest-profile example.

Does blocking AI bots mean my content won't appear in AI search results?

It depends on which bots you block. Blocking training crawlers (GPTBot, CCBot, ClaudeBot) prevents your content from being used to train AI models. But AI search products (Perplexity, ChatGPT Search, Google AI Overviews) use separate crawlers (PerplexityBot, OAI-SearchBot, Googlebot). If you want to appear in AI search results, allow those while blocking training crawlers.

I'm a small blog — does this really matter?

It matters if your content has commercial value, if you rely on traffic from search (AI search is cannibalizing some traditional search traffic), or if you write about topics where being used without attribution or credit concerns you. For small, purely hobbyist sites, the practical impact is lower — but the principle of consent applies regardless of site size.

Check your site right now

Run a free scan to see which AI bots your robots.txt currently allows — and get a full AI readiness score.

Scan My Site Free →

Next Steps