About
The Internet Archive's Heritrix-based crawler, responsible for the Wayback Machine's web preservation project. While not an AI training crawler itself, the Internet Archive's data is widely used as a source for AI training datasets — Common Crawl, used by many LLMs, draws on archived web data. The 'ia_archiver' user-agent string is an older alias for the same crawler. Blocking this bot prevents your content from being permanently archived.
Purpose
Web preservation and Wayback Machine archiving
User Agent String
Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)
How to Control in robots.txt
🚫 Block archive.org_bot
User-agent: archive.org_bot Disallow: /
✅ Allow archive.org_bot
User-agent: archive.org_bot Allow: /
Is archive.org_bot crawling your site?
Run a free scan to check if Internet Archive's crawler is accessing your website.
Check if archive.org_bot is crawling YOUR site →