Skip to content
AI Training · Mistral AI (France)

How to Block MistralBot

MistralBot is Mistral AI's training crawler — the French lab behind Mistral Large, Mixtral, and Le Chat. Active since early 2024, it crawls web content to train Europe's most prominent AI model family.

✓ Respects robots.txt
Mistral reliably honors Disallow directives — robots.txt block is sufficient
EU / GDPR subject
As a French company, Mistral faces stricter data obligations than US-based AI labs
Block CCBot too
Mistral also trains on Common Crawl data — block CCBot for full coverage

What Does MistralBot Collect?

MistralBot crawls publicly available web content to build training datasets for Mistral AI's model families — including Mistral Large (their flagship proprietary model), Mistral Small, and the open-weight Mixtral series. It also feeds Le Chat, Mistral's consumer AI assistant.

Mistral AI is the standout European AI lab — founded in Paris in 2023 and backed by Andreessen Horowitz, Nvidia, and others. Its models are widely used in enterprise AI applications and embedded into platforms like Slack, Microsoft Azure, and Google Cloud. MistralBot's crawl activity has grown in step with each new model release.

Like most AI labs, Mistral draws from two data channels: its own direct web crawling via MistralBot, and licensed/open datasets including Common Crawl (collected by CCBot). Blocking MistralBot stops the direct crawl pipeline; blocking CCBot cuts the Common Crawl supply that feeds Mistral and 50+ other AI models simultaneously.

MistralBot user agent
Mozilla/5.0 (compatible; MistralBot/1.0; +https://mistral.ai/bot)

In robots.txt, use the token MistralBot — single user agent, no alternate tokens to worry about.

Option 1: Block via robots.txt (Recommended)

Block entire siteRecommended
robots.txt
User-agent: MistralBot
Disallow: /

One rule — MistralBot uses a single user agent token with no known alternates.

Block MistralBot + CCBot for full Mistral coverageRecommended for publishers
robots.txt
# Block Mistral's direct crawler
User-agent: MistralBot
Disallow: /

# Block CCBot — Common Crawl feeds Mistral, GPT, Llama, Gemini, and 50+ others
User-agent: CCBot
Disallow: /

CCBot is the single highest-leverage AI training opt-out — one block affects 50+ models.

Block specific paths only
robots.txt
# Protect premium/original content
User-agent: MistralBot
Disallow: /articles/
Disallow: /research/
Disallow: /premium/
Block all major AI training crawlers at once
robots.txt
# Block all major AI training crawlers
User-agent: MistralBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: xAI-Bot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Search engines — unaffected
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Comprehensive training opt-out. No effect on search rankings.

Option 2: Next.js App Router

app/robots.ts
import { MetadataRoute } from 'next';

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      { userAgent: 'MistralBot', disallow: ['/'] },
      { userAgent: 'CCBot', disallow: ['/'] },
      { userAgent: 'GPTBot', disallow: ['/'] },
      { userAgent: 'ClaudeBot', disallow: ['/'] },
      { userAgent: 'anthropic-ai', disallow: ['/'] },
      { userAgent: 'Google-Extended', disallow: ['/'] },
      { userAgent: 'PerplexityBot', disallow: ['/'] },
      { userAgent: 'xAI-Bot', disallow: ['/'] },
      { userAgent: 'Bytespider', disallow: ['/'] },
      { userAgent: 'Googlebot', allow: ['/'] },
      { userAgent: '*', allow: ['/'] },
    ],
    sitemap: 'https://yoursite.com/sitemap.xml',
  };
}

Option 3: nginx — Hard 403 Block

Mistral reliably respects robots.txt, so a server-level block is optional. Use it if you want hard enforcement, want to eliminate crawler load from your logs, or prefer not to rely on Mistral's compliance.

nginx.conf
# In your server {} block
if ($http_user_agent ~* "MistralBot") {
    return 403;
}

Option 4: Cloudflare WAF Rule

Cloudflare WAF → Custom Rules → Expression
(http.user_agent contains "MistralBot")

Set the action to Block. Blocks at the edge — zero load on your server.

Cloudflare Dashboard → Security → WAF → Custom Rules → Create rule

The EU / GDPR Angle

Mistral AI is a French company, which makes it subject to GDPR and the EU AI Act in a way that US-based AI companies are not. This matters for publishers in a few ways:

Article 21 objection rights

EU-based publishers may have grounds to object to automated processing of their content for AI training under GDPR Article 21. This legal avenue is still being tested, but Mistral's EU incorporation means it faces real legal exposure — more so than a US company operating from outside the EU's jurisdiction.

EU AI Act training data requirements

Under the EU AI Act (effective 2026), high-impact AI models must maintain detailed documentation of training data sources and honor copyright opt-outs. Mistral, as a EU-incorporated company, has compliance obligations here that incentivize it to honor opt-out requests.

Practical upshot

For most publishers, robots.txt is still the fastest and most reliable opt-out. The GDPR/EU AI Act angle provides additional leverage if you want to send a formal opt-out request beyond robots.txt — contact Mistral at legal@mistral.ai with a description of your content and opt-out request.

Verify Your Block

bash
# Check nginx access logs for MistralBot
grep "MistralBot" /var/log/nginx/access.log | tail -20

# Confirm it's fetching robots.txt
grep "MistralBot" /var/log/nginx/access.log | grep "robots.txt"

# If server-level blocked — confirm 403s
grep "MistralBot" /var/log/nginx/access.log | grep " 403 "

Seeing MistralBot fetch /robots.txt and then stop making content requests means the block is working correctly.

Frequently Asked Questions

Does MistralBot respect robots.txt?
Yes. Mistral AI has committed to honoring robots.txt Disallow directives. As a European company subject to GDPR and the EU AI Act, Mistral has legal obligations that reinforce this — beyond just reputational incentives. A robots.txt block is sufficient for most publishers.
Which AI models does blocking MistralBot protect against?
Blocking MistralBot stops direct crawls for Mistral's model family: Mistral Large, Mistral Small, Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, and Le Chat. To also block Common Crawl data that feeds Mistral, add CCBot to your robots.txt as well.
Is blocking MistralBot enough, or do I need to block CCBot too?
For full coverage against Mistral's training pipeline, block both. MistralBot is Mistral's direct crawler; CCBot collects data for Common Crawl, which Mistral (and many other AI labs) license as a training data source. CCBot is the single highest-leverage opt-out you can make — one Disallow blocks 50+ AI models simultaneously.
Will blocking MistralBot affect my search rankings?
No. MistralBot is a training crawler only. Mistral does not operate a public web search product that indexes your site for public queries. Blocking it has zero effect on your Google, Bing, or any other search ranking.
Does Mistral have a content removal form?
Mistral does not currently operate a widely-documented public removal request form. For formal opt-out or removal requests beyond robots.txt, contact legal@mistral.ai. Given Mistral's EU legal obligations, formal written requests carry more weight than with some US-based AI companies.
What is the difference between MistralBot and Mixtral?
MistralBot is Mistral AI's web crawler — it collects training data. Mixtral is the name of Mistral's open-weight mixture-of-experts model family (Mixtral 8x7B, 8x22B, etc.). They're related only in that MistralBot feeds data to the systems that train Mixtral and other Mistral models.

Related Guides

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.

Scan My Site Free →

Related Guides