SSR-junk and bot walls

# SSR-junk and bot walls

Two ways a server-rendered site silently disappears from AI citations — and the curl recipe that finds either one in under a minute.

By AgentSite · 5 min read · Updated 2026-05-23

Server-side rendering does not guarantee an AI crawler can read your page. Two failure modes look identical to a browser and silently break Layer 1 of AEO: **SSR-junk**, where the rendered HTML is mostly inline script and almost no readable text; and a **bot wall**, where AI crawlers get 403 while browsers get 200. Both are common.

Neither failure surfaces in the customer's analytics, in the customer's monitoring, or in any AI-visibility dashboard. The crawler reaches the page, finds nothing it can use, and leaves. There is no error to investigate — the page simply stops appearing in citations.

## SSR-junk

A page is SSR-junk when the HTML returned to a non-JavaScript client renders correctly under inspection (the headings, the paragraphs, the schema all appear in DevTools after the JS boots) but the _bytes on the wire_ are 60-90% script, hydration JSON, or framework runtime, and the visible-text portion is a few percent of total payload.

The byte composition matters because major AI crawlers do not execute JavaScript. Vercel measured 569 million GPTBot requests and 370 million Claude requests in a single month and reported that "none of the major AI crawlers currently render JavaScript" ([Vercel, "The Rise of the AI Crawler," Dec 2024](https://vercel.com/blog/the-rise-of-the-ai-crawler)). A bot fetching an SSR-junk page receives the HTML, scans for extractable content, and finds a small fraction of the bytes are actually prose. Some extractors recover those words; many do not. The page passes a "view source" smoke test and still loses citations.

The diagnostic is byte composition, not headless rendering. Open the raw response with a non-JS client and count:

```bash
URL='https://example.com/'

# Composition: total bytes, script bytes, visible-text bytes, word count.
curl -sS -o /tmp/x.html \
  -A 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36' \
  "$URL"

python3 - <<'PY'
import re
src = open('/tmp/x.html').read()
scripts = re.findall(r'<script\b[^>]*>.*?</script>', src, re.S | re.I)
ss = sum(len(s) for s in scripts)
stripped = re.sub(
    r'<script\b[^>]*>.*?</script>|<style\b[^>]*>.*?</style>|<noscript\b[^>]*>.*?</noscript>',
    '', src, flags=re.S | re.I)
text = re.sub(r'\s+', ' ', re.sub(r'<[^>]+>', ' ', stripped)).strip()
print(f"total={len(src):,}  scripts={ss:,} ({ss * 100 / len(src):.1f}%)  "
      f"text={len(text):,} ({len(text) * 100 / len(src):.1f}%)  "
      f"words={len(text.split()):,}")
PY
```

A healthy page on a marketing route returns single-digit script percentage and visible-text percentages in the high tens or higher, with word counts in the hundreds to thousands. An SSR-junk page returns script percentages in the 50-70 range and visible-text in the low single digits.

## Bot walls

A bot wall is the same URL returning different status codes to different user-agents. Browsers get 200. Googlebot usually gets 200. Identified AI crawlers — GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot — get 403, a challenge page, or a redirect to a "are you a robot" check. The wall is usually configured at the CDN edge (Cloudflare's Super Bot Fight Mode, AWS WAF rules, custom nginx blocks) and the operator may not even know it's on.

The robots.txt convention is the public side of bot identification: every well-behaved crawler sends a user-agent string that names the bot and links to a reference URL, per the Robots Exclusion Protocol formalized in [RFC 9309](https://datatracker.ietf.org/doc/html/rfc9309) (IETF, 2022). The same identification that lets a site _opt out_ via `Disallow:` rules also lets a CDN _block_ the bot outright before any robots.txt logic runs.

Bot walls are widespread because they were turned on as anti-scraping defaults during the LLM-training-data panic of 2024. Cloudflare reported in July 2024 that AI bots had accessed roughly 39% of the top one million Internet properties in a single month, with GPTBot reaching 35.46% of them ([Cloudflare, "Declaring Your AIndependence," July 2024](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/)). The same announcement introduced a one-click toggle for blocking those crawlers, which many operators turned on, did not reconsider, and now ship into 2026 — including operators whose own marketing promises AI citation.

The diagnostic is user-agent rotation. Send the same URL with browser, Googlebot, and three AI crawler user-agents, compare status codes:

```bash
URL='https://example.com/'

for ua in \
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36' \
  'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' \
  'Mozilla/5.0 (compatible; GPTBot/1.2; +https://openai.com/gptbot)' \
  'Mozilla/5.0 (compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot)' \
  'Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected])' \
  'Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)'
do
  curl -sS -A "$ua" -o /dev/null \
    -w "%{http_code}  %{size_download}B  $ua\n" "$URL"
done
```

A site without a bot wall returns the same status code across all six user-agents, with byte counts equal or close. A site with a bot wall returns 200 to the first two and 403 (or a much smaller response — a challenge page) to the rest.

## Why both fail Layer 1

The [five-layer AEO model](/five-layer-aeo) treats crawler access as the gate that determines whether the next four layers exist. SSR-junk fails Layer 1 by giving the crawler bytes that contain almost no extractable content. A bot wall fails Layer 1 by giving the crawler no bytes at all. In both cases, the schema graph, the `llms.txt`, the [direct-answer paragraph](/direct-answer), and the content quality work above all evaluate to zero — the bot never sees them.

The shared symptom is silence. Citation rates go to zero on the specific engine whose crawler is being walled, and the site operator has no built-in signal that anything is wrong. The diagnostic above is the only way to find either pattern from outside the request path.

## Fixing each

For SSR-junk: serve a pre-extracted markdown twin of each page in the first few kilobytes of the HTML body, before the hydration scripts. The extractable content is then in front of the script payload regardless of where the framework places it. Bot-side, this is identical to the page a browser renders post-hydration; the difference is bytes-on-the-wire ordering.

For bot walls: rotate the CDN's anti-bot rules to allowlist verified AI crawlers. Anthropic, OpenAI, and Perplexity publish IP-range JSON files for crawler verification; Cloudflare's "Verified bots" category covers most of them once toggled. The deeper fix is to treat training-bot blocking and citation-bot blocking as separate policy decisions — a robots.txt that disallows `GPTBot` for training data still allows `OAI-SearchBot` for live-query retrieval, and getting that nuance right is the difference between protecting training-data optionality and ceding the citation.

The longer treatment of why citation is its own optimization target sits in the [AEO essay](/aeo). The two failure modes above are the most common reasons a technically-correct site stops appearing in answers it should be the canonical source for.