About
The crawler behind Common Crawl, a nonprofit that maintains a massive open repository of web crawl data. This dataset is used by many AI companies to train large language models.
Purpose
Open web dataset for AI training and research
User Agent String
CCBot/2.0 (https://commoncrawl.org/faq/)
How to Control in robots.txt
🚫 Block CCBot
User-agent: CCBot Disallow: /
✅ Allow CCBot
User-agent: CCBot Allow: /
Complete Guide: How to Block CCBot
Server-level blocking, nginx configs, Cloudflare rules, Next.js middleware, and more →
Is CCBot crawling your site?
Run a free scan to check if Common Crawl's crawler is accessing your website.
Check if CCBot is crawling YOUR site →