The AEO problems catalog

# The AEO problems catalog

Twelve ways a site silently stops getting cited by AI engines, grouped by which layer of the AEO stack breaks. Most sites have at least three of these right now and don't know it.

By AgentSite · 5 min read · Updated 2026-05-24

This is the catalog of ways your site can fail to be cited by AI engines. Most sites have at least three of these right now, silently, because failed bot reads don't surface in analytics. The catalog groups failures by which layer of [the five-layer AEO model](/five-layer-aeo) they break.

Each linked entry is the definitive treatment for that problem. Entries marked _(on the roadmap)_ have catalog rows here today and full pages on the way. If you're trying to figure out which one is yours, [run your AEO score](/score) — it diagnoses against most of these in 90 seconds.

## Layer 1 — Crawler can't reach your content

-   **SPA empty shell.** Your single-page app renders client-side; AI crawlers see `<div id="app"></div>` and leave. Vercel measured 569 million GPTBot fetches and 370 million Claude fetches in a single month, none of them executing JavaScript ([Vercel, Dec 2024](https://vercel.com/blog/the-rise-of-the-ai-crawler)). Treated in depth in [SSR-junk and bot walls](/ssr-junk-bot-wall).
-   **Bot wall.** Your CDN returns 403 to identified AI crawlers (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot) while serving 200 to browsers. Usually a default that nobody turned on intentionally. See [SSR-junk and bot walls](/ssr-junk-bot-wall).
-   **SSR-junk.** Your server-rendered HTML is 60-90% script, 1-5% visible text. The crawler reads the page and finds almost no prose. See [SSR-junk and bot walls](/ssr-junk-bot-wall).
-   **Block-all-AI robots.txt.** Your robots.txt blocks training crawlers (GPTBot, ClaudeBot, Google-Extended) and retrieval crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) in one sweep. The training block is policy. The retrieval block costs you the citation. See [Robots Exclusion Protocol](/robots-exclusion-protocol).
-   **Soft 404.** Your SPA returns HTTP 200 for missing pages instead of a real 404. Domain quality drops; crawlers cache empty shells as if they were content. _(On the roadmap.)_
-   **Hash routing.** Your SPA uses `#/path` URLs; bots see one URL (`/`) regardless of how many routes you ship. _(On the roadmap.)_

## Layer 2 — Bot reaches pages but can't navigate the site

-   **No `/llms.txt`.** Your site has no curated markdown index for AI agents. Sitemap.xml exists but is too large to fit a context window. See [What llms.txt is and what it's for](/llms-txt).
-   **Stale `/llms.txt`.** Your `/llms.txt` is auto-generated from every URL on the site; it became a sitemap dump. The whole point of the file is curation, and a dump defeats the curation. See [stale llms.txt](/stale-llms-txt).

## Layer 3 — Schema doesn't say what kind of thing each page is

-   **No FAQPage schema on Q&A content.** You have FAQ sections in the body but no `FAQPage` JSON-LD. Agents that lift verbatim Q&A pairs walk past you. Cloudflare reported 39% of the top one million sites being accessed by AI bots in a single month ([Cloudflare, July 2024](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/)) — every one of those bots is reading whatever schema you ship. See [FAQ schema](/faq-schema).
-   **Schema-content mismatch.** Your JSON-LD says one thing (Article with author "Jane Doe") and the page body says another (no author byline). Engines that parse both flag the mismatch and distrust both. _(On the roadmap.)_

## Layer 4 — Content reaches the bot but isn't worth quoting

-   **Buried answer.** Your first paragraph isn't an answer; it's an introduction. The agent extracts from the top of the body and finds setup, not substance. See [Direct answer](/direct-answer).
-   **Vague attribution.** Your body says "studies show" and "experts say" without naming the study or the expert. The Princeton GEO paper measured "Cite Sources" as one of the top three tactics, with +30-40% measured lift ([Aggarwal et al., KDD 2024](https://arxiv.org/abs/2311.09735)). Vague attribution gives you the cost of citation prose with none of the lift. See [Statistics and citations](/statistics-citations).
-   **Date inflation.** You bumped `dateModified` without changing the content. Major engines detect the no-content-change case and discount future date signals from your domain. See [Content recency](/content-recency).

## Layer 5 — Off-site signals aren't there

-   **Unmeasured citation.** You don't run a mention probe against the engines that matter for your category. You're working on AEO blind — you don't know which pages get cited, by which engine, for which prompts, or which competitors come up instead. Without measurement, every other layer is faith-based. _(On the roadmap.)_

## What this catalog is for

It's a reference. If you're searching for "why isn't ChatGPT citing my site" or "how do I tell if my SPA is invisible to AI crawlers," this is the page that names what's actually wrong. The deeper pages explain how to detect each problem, why it costs you citations, and how to fix it.

The audience is two halves: developers and vibe-coders who can read a curl probe, and marketers and content teams who need to know which problem belongs to which team. Each linked page speaks to both.

## What to do next

[Run your AEO score](/score) and the report will name which of these problems your site has. Eight dimensions, five layers, plus a live citation probe across the engines that matter for your category.

The thesis on why agent readability is the foundation of AEO at all is in [agent readability](/agent-readability).