What llms.txt is and what it's for

# What llms.txt is and what it's for

A curated markdown index of the site, shaped to fit in an agent's context window — not a sitemap, not a robots.txt, not a training opt-out.

By AgentSite · 6 min read · Updated 2026-05-23

llms.txt is a markdown file at the root of a website that gives AI agents a curated index of the site's most quotable content. Jeremy Howard proposed it on September 3, 2024. Unlike sitemap.xml — a complete machine index for search crawlers — llms.txt is a hand-edited overview shaped to fit inside an agent's context window.

The file lives at `/llms.txt` (the root of the site, same shape as `robots.txt`). It is intended for _retrieval-time_ use: when an agent is answering a user's query and reaches for the site, the file gives it a fast, structured pointer into the parts of the site worth quoting from.

## The format

The spec at [llmstxt.org](https://llmstxt.org/) defines the file as plain markdown with a small set of conventions, paraphrased from the project's own example:

```markdown
# Site title

> One-sentence summary of what the site is.

## Docs

- [Page Title](https://example.com/path/page.md): One-line description of what this page covers.
- [Another Page](https://example.com/path/other.md): One-line description.

## Examples

- [Example A](https://example.com/examples/a.md): What the example demonstrates.

## Optional

- [Resource Z](https://example.com/resource-z): Less critical pointer.
```

Each entry is a markdown link followed by a colon and a brief description. The `## Optional` section signals "load these only if the agent has token budget left." Sections above it are considered load-bearing.

Pointing the links at `/path/page.md` rather than `/path/page` is the convention — agents read markdown more reliably than HTML, and the spec assumes the site also ships per-page markdown mirrors (every page available at both `/path` and `/path.md`).

## How it differs from sitemap.xml

The [sitemaps.org](http://sitemaps.org) protocol predates `llms.txt` by nearly two decades. The Sitemaps spec describes itself as enabling "you to provide details about your pages to search engines… you can provide additional information about site pages beyond just the URLs" ([Sitemaps.org protocol, last updated 2016](https://www.sitemaps.org/protocol.html)). It is XML, comprehensive, and aimed at indexers.

The two files target different consumers:

|  | `sitemap.xml` | `llms.txt` |
| --- | --- | --- |
| Format | XML | Markdown |
| Audience | Search crawlers (Googlebot, Bingbot) | Retrieval-time AI agents (ChatGPT, Claude, Perplexity) |
| Completeness | Every indexable URL | Curated subset of most-quotable pages |
| Size goal | Bounded by indexer limits (~50MB / 50K URLs per file) | Bounded by an LLM context window (a few KB to tens of KB) |
| Editing | Usually auto-generated | Usually hand-edited or templated |
| Granularity | URL plus optional metadata | URL plus a one-line description |

The two coexist. A site that ships both gives search crawlers the complete index they expect and gives agents the curated index they prefer.

## Why agents prefer it

Three forces converge here.

First, agents don't execute JavaScript. Vercel measured 569 million GPTBot requests and 370 million Claude requests in a single month and reported that "none of the major AI crawlers currently render JavaScript" ([Vercel, "The Rise of the AI Crawler," Dec 2024](https://vercel.com/blog/the-rise-of-the-ai-crawler)). The same agents that can't render a SPA also can't crawl a multi-thousand-URL sitemap and form an opinion within a user's query latency. A curated index of, say, fifty links with descriptions is parseable in milliseconds.

Second, markdown is the native medium of the LLM stack. Training corpora are markdown-heavy. Tool calls return markdown. Context windows compress markdown more efficiently than HTML. A markdown index is the same shape the agent's other tools produce.

Third, the descriptions are signal, not noise. A sitemap URL plus `<lastmod>` tells a crawler what to fetch. A markdown link plus a one-line description tells an agent whether the page is relevant _before_ fetching. For a retrieval-time agent paying token cost per fetch, that filter is the value.

## When to ship one

Two practical observations from running content engines that depend on this.

A site with fewer than ~20 pages probably gains more from making each page well-shaped (direct answer in the lede, FAQ schema, etc.) than from a curated index. The index optimization compounds with size.

A site over ~100 pages should treat llms.txt as a real editorial product, not a sitemap dump. The point of the file is to _say what's worth reading_ — listing every page defeats the curation that makes the file valuable to the consumer.

The natural pairing is per-page markdown mirrors: every linked `/path.md` returns a clean markdown version of `/path`, hydration scripts stripped, content preserved. The mirror is what the agent actually reads after the index narrows the candidate set.

## What llms.txt is not

A handful of common confusions worth naming:

-   It is **not** robots.txt. Robots.txt is access policy — what a bot may fetch. llms.txt is a content index — what an agent should read first.
-   It is **not** sitemap.xml. See the table above.
-   It is **not** a training opt-out signal. Training-bot opt-outs go in robots.txt via Disallow directives against `GPTBot`, `ClaudeBot`, and similar. llms.txt has nothing to say about training data.
-   It is **not** automatic indexing. Search engines do not currently advertise that they index llms.txt for ranking. The file targets retrieval-time agents, not crawl-time indexers.

## Where this fits

llms.txt is the canonical Layer 2 artifact in the [five-layer AEO model](/five-layer-aeo): the agent-readable map of the site. Layer 1 ([rendered HTML the bot can read](/ssr-junk-bot-wall)) makes the pages reachable; Layer 2 makes the inventory navigable. Without layer 1 the index points at empty shells; without layer 2 the agent crawls page-by-page and runs out of time or tokens.

The longer thesis for why agent readability is the foundation of AEO at all is in [agent readability](/agent-readability). The detailed API and convention notes for the AgentSite middleware that emits `llms.txt` (and its longer companion `llms-full.txt`) automatically from cached page content live in the [docs](/docs).