Soft 404 — agentsite

# Soft 404

Your server says "200 OK" while the body says "page not found." Crawlers cache the missing page as if it were real content, and your domain's quality signal drops as the cache accumulates.

By AgentSite · 4 min read · Updated 2026-05-24

A soft 404 is an HTTP 200 response for a page that doesn't really exist — the server says "everything's fine" while the body says "page not found." Search engines and AI crawlers cache the empty shell as if it were content. The domain's quality signal drops, and missing URLs get indexed as real pages.

## What it looks like

A user mistypes a URL: `/produtcs` instead of `/products`. The server's catch-all route returns the SPA's `index.html` with HTTP 200. The React app mounts, no route matches, and its 404 component renders. The visible body says "Page Not Found." The HTTP status code says 200.

Common variants:

- **SPA with a catch-all.** Server returns `index.html` for any unknown path with status 200; the SPA's client-side router shows a 404 component.
- **Framework default.** Some server-rendered frameworks render a 404 page but forget to set the status code.
- **CDN fallback.** A misconfigured CDN serves the homepage for any path that 404s at origin.
- **Empty-state page.** A search-results or filter page that returns "no results found" at HTTP 200 for combinations that don't match anything.

In each case, the page presents an error to a human reader and "success" to anything reading the status code.

## How to detect it

Pick a URL you're sure doesn't exist and check the status:

```bash
URL='https://example.com/this-definitely-does-not-exist-abc123'
curl -sS -o /dev/null -w "%{http_code}\n" "$URL"
```

A real 404 returns 404. A soft 404 returns 200. If you get 200 on a URL you just invented, you have soft 404s.

Then check what the body actually shows:

```bash
curl -sS "$URL" | grep -ic 'not found\|404\|does not exist\|page does not'
```

If that returns greater than zero — body text contains "not found" while status is 200 — that's the soft-404 signature.

## Why it costs you citations

Google flags this directly: "Search Console will show a `soft 404` error" when a 2xx status code returns content suggesting an error or empty page ([Google, HTTP errors documentation](https://developers.google.com/search/docs/crawling-indexing/http-network-errors)).

Three downstream costs:

1. **Domain quality dilution.** Crawlers index the "page" as real content. When hundreds of soft 404s accumulate in the index, the site reads as low-quality to every engine that aggregates page count + quality signals.
2. **Stale URLs in the citation pool.** A page that should 404 but returns 200 stays eligible for citation. An AI engine that cites a soft-404 URL sends the user to a "page not found" message — a bad first impression that's hard to attribute back to the citation engine.
3. **Crawl budget waste.** Vercel measured 569 million GPTBot fetches and 370 million Claude fetches across their network in a single month ([Vercel, Dec 2024](https://vercel.com/blog/the-rise-of-the-ai-crawler)). Cloudflare reported AI bots accessing 39% of the top one million Internet properties in the same window ([Cloudflare, July 2024](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/)). Every soft 404 a bot fetches is real content it didn't fetch.

## How to fix it

For server-rendered sites (Next.js SSR, Nuxt, Rails, Django, Express):

- Detect unknown routes server-side and return HTTP 404 with the error body.
- Set `Cache-Control: no-store` on 404 responses so downstream caches don't pin them.

For SPAs that serve their shell from disk:

- The server's catch-all is the trap. Replace it with a route table that returns 404 for unknown paths and reaches for `index.html` only on known routes.
- If using a CDN, configure custom error pages that return real 404 status codes, not the SPA shell with 200.

For static-site generators (Astro, SvelteKit-static, Jekyll):

- Pre-generate a `404.html` and configure the host to serve it with HTTP 404. Most static hosts (Netlify, Vercel, Cloudflare Pages) ship a setting for this; the default sometimes returns 200.

## Where AgentSite fits

The AgentSite render service detects soft-404 patterns at render time via a two-signal heuristic (title or first line contains "404"/"not found" AND body under ~2 KB of text). When detected, the bundle is marked `notFound: true` and the response carries `X-Agentsite-Status: not_found` so downstream callers don't cache the page as real content. That catches the symptom for cached-render reads; the structural fix at your origin is what stops the bot from being told "200 OK" in the first place — see "How to fix it" above.

## Related problems

- [SSR-junk and bot walls](/ssr-junk-bot-wall) — the other Layer-1 patterns that look fine to humans and fail for bots.
- [The five layers of AEO](/five-layer-aeo) — the structural map this problem sits inside.
- The full catalog: [AEO problems](/aeo-problems).