Robots Exclusion Protocol (robots.txt)

# Robots Exclusion Protocol (robots.txt)

The IETF-standard file at /robots.txt that tells crawlers which paths they may fetch — and where AEO sites get the training-vs-retrieval split wrong.

By AgentSite · 3 min read · Updated 2026-05-23

The Robots Exclusion Protocol is the IETF-standard file at `/robots.txt` that tells crawlers which paths they may fetch. RFC 9309 formalized the protocol in September 2022, codifying the convention Martijn Koster proposed in 1994. For AEO, the file is where the training-bot-versus-retrieval-bot split happens — get the split wrong and you disappear from retrieval engines.

## What the spec says

RFC 9309 ("Robots Exclusion Protocol") describes itself as a specification that "extends the 'Robots Exclusion Protocol' method originally defined by Martijn Koster in 1994 for service owners to control how content served by their services may be accessed, if at all, by automatic clients known as crawlers" ([RFC 9309, IETF 2022](https://datatracker.ietf.org/doc/html/rfc9309)). The standardization added formal definitions, error-handling procedures, and caching guidelines to the original convention.

The file is plain text at the root of the site. Each block names a `User-agent:` and lists `Allow:` and `Disallow:` rules. The most-specific matching rule for a given user-agent wins; if no matching rule exists, the path is allowed by default.

## Training bot vs. retrieval bot

Major AI vendors run multiple bots with distinct purposes:

- **Training crawlers** (GPTBot, ClaudeBot, Google-Extended) read pages to build the model's parametric knowledge. Opt-outs against these reduce what the next model release knows about you.
- **Retrieval crawlers** (OAI-SearchBot, Claude-SearchBot, PerplexityBot) read pages at user-query time and produce live citations. Opt-outs against these directly cost you visibility in answer-engine responses.
- **User-initiated fetches** (ChatGPT-User, Claude-User) run when a human pastes a URL into the chat surface.

The naive "block all AI" robots.txt that became popular in 2023-2024 catches all three populations. The result is a site invisible to live retrieval — the answer-engine citation surface that's most directly tied to actual referral traffic and brand mentions.

## The default-block problem

Cloudflare made the wrong-default worse in July 2024 by launching a one-click toggle that blocks AI bots at the edge for every site on its managed protection plan ([Cloudflare, July 2024](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/)). Many operators flipped it on, forgot, and stopped showing up in citation pools they meant to be in. The blocked traffic is at meaningful scale — Vercel measured 569 million GPTBot fetches in a single month ([Vercel, Dec 2024](https://vercel.com/blog/the-rise-of-the-ai-crawler)).

## A working pattern

A robots.txt that fits AEO:

```
# Block training crawlers (no referral traffic)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Allow retrieval crawlers (these produce live citations)
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /
```

The operator chose to forgo training-data inclusion in exchange for full retrieval-bot reachability. The opposite trade — allow training, block retrieval — is almost always a mistake.

## Where this fits

robots.txt is Layer 1 infrastructure in the [five-layer AEO model](/five-layer-aeo). It pairs with the other Layer-1 failure modes catalogued in [SSR-junk and bot walls](/ssr-junk-bot-wall). The longer thesis is in [agent readability](/agent-readability).