Parsing robots.txt for 10 AI Crawlers: Wildcards, Partial Blocks, Line Numbers

robots.txt parsing looks like a weekend job. It is a flat text file. Each line is a directive. You split on the colon, match the user agent, check whether a path is disallowed. How hard can it be.

Then you start feeding it real files. You hit a group that opens with three User-agent lines and one rule block. You hit a Disallow: /*? that means more than its author thought. You hit a file that 404s over HTTPS but loads over HTTP. You hit comments mid-line, mixed casing, and a Disallow: with nothing after it. The weekend job grows teeth.

We built the AI Crawler Checker to answer one narrow question well: for a given domain, which of the major AI crawlers can read it, and which cannot. We grade against ten specific user agents:

GPTBot: ChatGPT and OpenAI, training and search

ChatGPT-User: ChatGPT live browsing

robots.txt parsing looks like a weekend job. It is a flat text file. Each line is a directive. You split on the colon, match the user agent, check whether a path is disallowed. How hard can it be.

We built the AI Crawler Checker to answer one narrow question well: for a given domain, which of the major AI crawlers can read it, and which cannot. We grade against ten specific user agents:

GPTBot: ChatGPT and OpenAI, training and search

ChatGPT-User: ChatGPT live browsing

Parsing robots.txt for 10 AI Crawlers: Wildcards, Partial Blocks, Line Numbers

Parsing robots.txt for 10 AI Crawlers: Wildcards, Partial Blocks, Line Numbers

Related reading

The New Information Borders

Your robots.txt says GPTBot is welcome. Your server says 403.

llms.txt for AI Discoverability: Should You Add It?

What AI Crawlers Actually Do to a Small Blog: 9 Days of Logs

Your recurring scraper is re-downloading data that didn't change. Here's the…

I spent 3 days writing regexes. Then I asked an AI to do it in 10 minutes.

Related reading

The New Information Borders

Your robots.txt says GPTBot is welcome. Your server says 403.

llms.txt for AI Discoverability: Should You Add It?

What AI Crawlers Actually Do to a Small Blog: 9 Days of Logs

Your recurring scraper is re-downloading data that didn't change. Here's the…

I spent 3 days writing regexes. Then I asked an AI to do it in 10 minutes.