Stale llms.txt — agentsite

# Stale llms.txt

Your /llms.txt was auto-generated from every URL on the site and never re-edited. The file became a sitemap dump and stopped doing the one thing the format exists for — curation.

By AgentSite · 4 min read · Updated 2026-05-24

A stale `llms.txt` exists at the right URL and parses as markdown but doesn't do the editorial job the format is for. A build script dumped every URL on the site into it once, nobody edits it, and it reads like a sitemap rendered in markdown. The agent loses the curation signal that's the point of shipping the file.

## What it looks like

Four typical variants:

- **Sitemap-in-markdown.** Every URL from `sitemap.xml` rewritten as a bullet list under one or two headings. No descriptions, or descriptions auto-generated from the page title and stripped of meaning ("About — About page").
- **One-shot generation that never re-ran.** The file was generated when the site had 40 pages. The site now has 400. The other 360 don't appear; the 40 that do are partly stale URLs that 404.
- **Auto-generated from frontmatter with no editorial filter.** Every blog post, every tag page, every author page, every paginated archive — all land in the file with equal weight. The 200-word tag page sits next to the 3,000-word canonical with no visible difference.
- **Two-section files where `## Optional` is empty.** The spec's `Optional` section signals "load these only if you have token budget." When everything is load-bearing, nothing is.

In each case the file passes a "does `/llms.txt` exist?" check while failing the purpose check.

## How to detect it

Three quick measurements against your own file:

```bash
URL='https://example.com/llms.txt'
curl -sS "$URL" | wc -c # total bytes
curl -sS "$URL" | grep -c '^- \[' # link count
curl -sS "$URL" | awk '/^- \[/{print length($0)}' | sort -n | uniq -c | head
```

Signal interpretation:

- **Bytes greater than ~50 KB.** The file is probably past the budget agents will spend on it. The spec frames the file as small enough to load eagerly; a sitemap dump isn't.
- **Link count over a few hundred.** You probably emitted every URL on the site rather than curating.
- **Descriptions all the same length or all missing.** The descriptions are autogenerated, not editorial.

The qualitative check is harder but more diagnostic: read your own `llms.txt` as if you were an agent answering a user query in your category. If the file does not tell you which three or five pages to read first, it is not doing its job.

## Why it costs you citations

The format itself defines the goal. The [llms.txt specification](https://llmstxt.org/) describes the file as "information that can help LLMs use a website at inference time" and notes that it is "designed to coexist with current web standards," explicitly distinct from `sitemap.xml`. The [sitemaps.org protocol](https://www.sitemaps.org/protocol.html) targets a different consumer: search-engine crawlers that want every indexable URL with last-modification metadata. The two files are not substitutes — a sitemap rendered in markdown is the worst of both.

The agent reading your file has a fixed token budget for that fetch. Vercel measured 569 million GPTBot fetches and 370 million Claude fetches in a single month and reported that none of the major AI crawlers render JavaScript ([Vercel, Dec 2024](https://vercel.com/blog/the-rise-of-the-ai-crawler)). That same fetch-economy logic applies here: a few-KB curated index burns a small slice of context and points the agent at the three pages worth reading next. A 200-KB sitemap dump burns the same context window with no editorial signal — the agent either truncates and reads a random prefix, or skips the file entirely and falls back to the sitemap it was already going to crawl.

Per the [May 2026 adoption probe](/llms-txt-field-report-2026-05) of ten major AI and developer-infrastructure sites, the four that ship a real file all ship it on the documentation surface, in the small-curated shape the spec describes — not as a generated dump of every URL. Vercel's docs file is 167 KB but it's a curated docs index with descriptions, not a sitemap clone; Next.js ships a 7.7 KB file. Both are editorial.

The downstream consequence for AEO is that the curation signal becomes a negative signal. An agent that fetches `/llms.txt` and finds 8,000 undifferentiated links learns the same thing it would learn from a missing file — that nobody on your side has done the work of saying which pages are worth quoting. The fetch was free; the answer was empty.

## How to fix it

The architectural fix is to treat `llms.txt` as an editorial product, not a build artifact:

1. **Start from a hand-picked allowlist.** A page belongs in `llms.txt` only if you would proactively recommend it as the answer to a question in your category. Most sites have somewhere between ten and two hundred such pages. The rest belong in `sitemap.xml`, not here.
2. **Write real descriptions.** One line per link, written by a human (or generated by a model and reviewed by a human). The description is what tells the agent whether to fetch the linked page. "About the team — About page" is not a description.
3. **Use the `## Optional` section.** The spec defines it for "less critical pointers." Anything you'd hate to lose from the citation graph goes above the section header. Anything you'd defend if pressed but isn't load-bearing goes below.
4. **Re-edit when the catalog grows.** A `dateModified` on the file itself, surfaced in your build, gives you a forcing function. If the file hasn't been edited in six months and you've shipped 30 pages since, the file is stale.

Per-page markdown mirrors (`/path` also served at `/path.md`) are the natural companion. The index narrows the agent's candidate set; the mirror is what it actually reads.

## Where AgentSite fits

The AgentSite render service generates `llms.txt` and `llms-full.txt` from cached page content with size targets, deduplication, and a configurable inclusion filter. The default emits a curated index sized to fit comfortably in a single retrieval-time fetch, not a dump of every cached URL. For sites that hand-author the file, AgentSite leaves it alone — the generator runs only when no file exists at the path. Customers who want a hybrid (auto-generated body + hand-curated header) configure that through the rules engine. The detection above is what the AEO Report uses to flag the dump-shaped file as a problem.

## Related problems

- [What llms.txt is and what it's for](/llms-txt) — the canonical definition of the format this page describes the failure mode of.
- [llms.txt adoption across 10 sites](/llms-txt-field-report-2026-05) — what real-world files in the wild look like.
- [The five layers of AEO](/five-layer-aeo) — the structural map this problem sits in (Layer 2).
- The full catalog: [AEO problems](/aeo-problems).