Skip to content
Data Broker3 User AgentsAI Training

How to Block Webz.io & Omgili: The AI Data Broker Behind Three Crawlers

Webz.io operates under three different identities — omgili, omgilibot, and webzio-extended — selling web content to AI companies. One Disallow rule isn't enough. Here's how to stop all three.

Quick Block (60 seconds)

Add all three user agents to your robots.txt. Blocking only one leaves two active crawlers still harvesting your site.

# Block Webz.io AI training crawler (current)

User-agent: webzio-extended

Disallow: /

# Block Webzio general crawler (optional — see note below)

User-agent: webzio

Disallow: /

# Block legacy Omgili user agents

User-agent: omgili

Disallow: /

User-agent: omgilibot

Disallow: /

Note on webzio vs webzio-extended: If you only want to block AI training, add webzio-extended only. The standard webzio bot feeds search indexes used by third-party tools — blocking it may reduce referral traffic from platforms that license Webz.io's search data. When in doubt, block both.

Who Is Webz.io (and What Is Omgili)?

Webz.io is an Israeli content intelligence company that crawls the open web and sells structured data to businesses — including AI companies building training datasets. Omgili was an earlier brand and crawler they operated; Webz.io acquired and absorbed it, and the legacy omgili and omgilibot user agents remain active in server logs across the web.

In 2024, Webz.io introduced the “Webzio Duo” — a two-crawler system designed to give content owners more granular control:

CrawlerPurposeBlock to stop AI training?
webzioGeneral web index; powers search toolsOptional
webzio-extendedAI training data collectionYes — block this one
omgilibotLegacy crawler (pre-2024)Yes — still appears in logs
omgiliOlder legacy variantYes — belt-and-suspenders

Unlike first-party AI crawlers like GPTBot (OpenAI) or ClaudeBot (Anthropic) — which crawl for their own models — Webz.io is a third-party data broker. It sells structured web content to any paying customer, which may include multiple AI companies simultaneously. One block stops a pipeline that could feed many models at once.

What Does Webz.io Collect?

Webz.io markets itself as a “web content intelligence” platform. Its products include:

Verify the Block

After updating your robots.txt, verify:

1. Google Search Console Robots.txt Tester

Enter omgilibot as the user agent and test your key URLs — it confirms your rules parse correctly.

2. Check server logs

Grep your access logs for existing Webz.io traffic:

grep -iE 'omgili|omgilibot|webzio' /var/log/nginx/access.log | tail -20

3. Fetch your robots.txt

curl https://yourdomain.com/robots.txt | grep -A2 -i webzio

Confirm Disallow: / appears under each user agent block.

Server-Level Blocking

For stricter enforcement — especially on high-value or paywalled content — add server-level rules that return 403 before the request is processed.

nginx

# In your server {} block

if ($http_user_agent ~* "(omgili|omgilibot|webzio)") {

return 403;

}

Apache (.htaccess)

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} (omgili|omgilibot|webzio) [NC]

RewriteRule .* - [F,L]

Cloudflare WAF Rule

Dashboard → Security → WAF → Custom Rules → Create rule:

Field: User Agent

Operator: contains (apply for each)

Values: omgili · omgilibot · webzio-extended

Action: Block

Or use a single regex rule: (omgili|omgilibot|webzio-extended)

Next.js Middleware

// middleware.ts

import { NextRequest, NextResponse } from 'next/server';

const BLOCKED = /omgili|omgilibot|webzio/i;

export function middleware(req: NextRequest) {

const ua = req.headers.get('user-agent') ?? '';

if (BLOCKED.test(ua)) return new NextResponse(null, { status: 403 });

return NextResponse.next();

}

Frequently Asked Questions

Does Webz.io respect robots.txt?+

Yes — Webz.io officially states that both Webzio and Webzio-Extended respect robots.txt Disallow directives. The legacy Omgilibot also documented robots.txt compliance. That said, Webz.io is a commercial data broker with paying clients, so for high-value or paywalled content, combining robots.txt with server-level blocking is advisable.

What's the difference between webzio and webzio-extended?+

Webz.io explicitly designed 'webzio-extended' as its AI training data crawler, separate from its general search index crawler ('webzio'). If you only want to prevent AI training use, block webzio-extended. Blocking 'webzio' may additionally reduce referral traffic from search tools that license Webz.io's data. For full protection, block both.

Should I block omgili, omgilibot, and webzio-extended — or just one?+

Block all three. Omgilibot is the legacy user agent (pre-2024) still appearing in crawl logs. Omgili is an even older variant. Webzio-extended is the current AI-specific crawler. Because Webz.io rotated user agent identities over time, you need all three Disallow directives for comprehensive coverage.

Is Webz.io the same as Omgili?+

Yes. Webz.io is the company behind Omgili. Omgili was an independent web content intelligence service that Webz.io acquired. The Omgilibot user agent is now legacy — replaced by the 'Webzio Duo' in 2024 — but legacy infrastructure still uses the old user agents.

Does blocking Webz.io affect Google or my SEO?+

No. Webz.io operates entirely separately from Google, Bing, and other search engines. Blocking omgili, omgilibot, and webzio-extended in robots.txt has zero effect on Googlebot or your search rankings.

Related Guides

See Which AI Bots Are Hitting Your Site

Open Shadow scans your site and shows you exactly which AI crawlers are active — so you can block the right ones, not just guess.

Run Free AI Bot Scan

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.

Scan My Site Free →

Related Guides