How to Block Webz.io & Omgili: The AI Data Broker Behind Three Crawlers
Webz.io operates under three different identities — omgili, omgilibot, and webzio-extended — selling web content to AI companies. One Disallow rule isn't enough. Here's how to stop all three.
Quick Block (60 seconds)
Add all three user agents to your robots.txt. Blocking only one leaves two active crawlers still harvesting your site.
# Block Webz.io AI training crawler (current)
User-agent: webzio-extended
Disallow: /
# Block Webzio general crawler (optional — see note below)
User-agent: webzio
Disallow: /
# Block legacy Omgili user agents
User-agent: omgili
Disallow: /
User-agent: omgilibot
Disallow: /
Note on webzio vs webzio-extended: If you only want to block AI training, add webzio-extended only. The standard webzio bot feeds search indexes used by third-party tools — blocking it may reduce referral traffic from platforms that license Webz.io's search data. When in doubt, block both.
Who Is Webz.io (and What Is Omgili)?
Webz.io is an Israeli content intelligence company that crawls the open web and sells structured data to businesses — including AI companies building training datasets. Omgili was an earlier brand and crawler they operated; Webz.io acquired and absorbed it, and the legacy omgili and omgilibot user agents remain active in server logs across the web.
In 2024, Webz.io introduced the “Webzio Duo” — a two-crawler system designed to give content owners more granular control:
| Crawler | Purpose | Block to stop AI training? |
|---|---|---|
| webzio | General web index; powers search tools | Optional |
| webzio-extended | AI training data collection | Yes — block this one |
| omgilibot | Legacy crawler (pre-2024) | Yes — still appears in logs |
| omgili | Older legacy variant | Yes — belt-and-suspenders |
Unlike first-party AI crawlers like GPTBot (OpenAI) or ClaudeBot (Anthropic) — which crawl for their own models — Webz.io is a third-party data broker. It sells structured web content to any paying customer, which may include multiple AI companies simultaneously. One block stops a pipeline that could feed many models at once.
What Does Webz.io Collect?
Webz.io markets itself as a “web content intelligence” platform. Its products include:
- ▸News and article text
Full text, titles, publish dates, author metadata from news sites and blogs.
- ▸Forum and review content
Discussion threads, product reviews, social commentary — structured for sentiment analysis.
- ▸Dark web and open web intelligence
Webz.io sells a 'dark web monitor' product alongside its open web index.
- ▸AI training datasets
Structured text data labeled and sold specifically for LLM and ML training pipelines.
Verify the Block
After updating your robots.txt, verify:
1. Google Search Console Robots.txt Tester
Enter omgilibot as the user agent and test your key URLs — it confirms your rules parse correctly.
2. Check server logs
Grep your access logs for existing Webz.io traffic:
3. Fetch your robots.txt
Confirm Disallow: / appears under each user agent block.
Server-Level Blocking
For stricter enforcement — especially on high-value or paywalled content — add server-level rules that return 403 before the request is processed.
nginx
# In your server {} block
if ($http_user_agent ~* "(omgili|omgilibot|webzio)") {
return 403;
}
Apache (.htaccess)
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (omgili|omgilibot|webzio) [NC]
RewriteRule .* - [F,L]
Cloudflare WAF Rule
Dashboard → Security → WAF → Custom Rules → Create rule:
Field: User Agent
Operator: contains (apply for each)
Values: omgili · omgilibot · webzio-extended
Action: Block
Or use a single regex rule: (omgili|omgilibot|webzio-extended)
Next.js Middleware
// middleware.ts
import { NextRequest, NextResponse } from 'next/server';
const BLOCKED = /omgili|omgilibot|webzio/i;
export function middleware(req: NextRequest) {
const ua = req.headers.get('user-agent') ?? '';
if (BLOCKED.test(ua)) return new NextResponse(null, { status: 403 });
return NextResponse.next();
}
Frequently Asked Questions
Does Webz.io respect robots.txt?+
Yes — Webz.io officially states that both Webzio and Webzio-Extended respect robots.txt Disallow directives. The legacy Omgilibot also documented robots.txt compliance. That said, Webz.io is a commercial data broker with paying clients, so for high-value or paywalled content, combining robots.txt with server-level blocking is advisable.
What's the difference between webzio and webzio-extended?+
Webz.io explicitly designed 'webzio-extended' as its AI training data crawler, separate from its general search index crawler ('webzio'). If you only want to prevent AI training use, block webzio-extended. Blocking 'webzio' may additionally reduce referral traffic from search tools that license Webz.io's data. For full protection, block both.
Should I block omgili, omgilibot, and webzio-extended — or just one?+
Block all three. Omgilibot is the legacy user agent (pre-2024) still appearing in crawl logs. Omgili is an even older variant. Webzio-extended is the current AI-specific crawler. Because Webz.io rotated user agent identities over time, you need all three Disallow directives for comprehensive coverage.
Is Webz.io the same as Omgili?+
Yes. Webz.io is the company behind Omgili. Omgili was an independent web content intelligence service that Webz.io acquired. The Omgilibot user agent is now legacy — replaced by the 'Webzio Duo' in 2024 — but legacy infrastructure still uses the old user agents.
Does blocking Webz.io affect Google or my SEO?+
No. Webz.io operates entirely separately from Google, Bing, and other search engines. Blocking omgili, omgilibot, and webzio-extended in robots.txt has zero effect on Googlebot or your search rankings.
Related Guides
See Which AI Bots Are Hitting Your Site
Open Shadow scans your site and shows you exactly which AI crawlers are active — so you can block the right ones, not just guess.
Run Free AI Bot ScanIs your site protected from AI bots?
Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.
Scan My Site Free →