Files
deepdrft/product-notes/phase-23-seo-crawl-directives.md
T

371 lines
25 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 23 — SEO Crawl Directives (sitemap.xml, robots.txt, CMS noindex)
Product spec. Status: **design / framing — implementation-ready pending Daniel's open-question calls.**
Author: product-designer. Date: 2026-06-23. **No code has been written by this doc.**
Phase 23 is the **endpoint/file-shaped follow-on** to Phase 22's per-page `SeoHead` component. Phase 22 flagged
these three as "adjacent but separate concerns" (`product-notes/phase-22-seo-metadata-component.md §7`): they
are a different *unit of work* — server-side endpoints and static files that tell crawlers **which** pages exist
and **whether** to crawl them at all, as opposed to the per-page head surface that tells crawlers **what each
page is**. Phase 22 is the *content* of discoverability; Phase 23 is the *directives* layer above it.
Three items, each independently shippable:
1. **`sitemap.xml`** on the public host — a generated sitemap enumerating every indexable public URL.
2. **`robots.txt`** on the public host — allow + sitemap pointer in Production, `Disallow: /` everywhere else.
3. **CMS `noindex`** on `DeepDrftManager` — the admin app must never be indexed. The **one** item touching the CMS.
---
## 1. The environment gate is the through-line (read this first)
Phase 22 established the rule that **every non-production environment must be uncrawlable** — the beta/staging
host must not appear in search results, and a stray crawl of staging must not dilute or duplicate the production
site. Phase 22 expressed this for *page-level robots meta* via `SeoEnvironment` (a `[PersistentState]` bridge
seeded from `IWebHostEnvironment.IsProduction()`, because `SeoHead` renders in the **WASM** component graph and
WASM has no `IWebHostEnvironment`).
**Phase 23's three items all run server-side only** (endpoints and static files, never the WASM render tree), so
they read the gate the simplest possible way: **`IWebHostEnvironment.IsProduction()` injected directly.** They do
**not** need the `SeoEnvironment` PersistentState bridge — that bridge exists *solely* to ferry the flag across
the server→WASM seam, which these never cross. This is the correct reuse: same source of truth
(`IWebHostEnvironment.IsProduction()`, the exact predicate `App.razor` already seeds `SeoEnvironment` from), no
parallel gate invented, and no PersistentState plumbing where it isn't needed.
| Concern | Renders where | Gate mechanism |
|---|---|---|
| Phase 22 `SeoHead` robots meta | WASM component graph | `SeoEnvironment` `[PersistentState]` bridge (server seed → WASM read) |
| Phase 23 sitemap / robots / CMS | server-side endpoint or static file | `IWebHostEnvironment.IsProduction()` injected directly |
**Invariant E1 (the non-negotiable):** in any non-production environment, `robots.txt` is `Disallow: /` and the
sitemap is either not served or empty. A crawler must see a closed door on beta before it sees a single URL.
The fail-safe default (matching Phase 22's `SeoEnvironment` fail-safe-to-`noindex`) is **closed**: if environment
resolution is ever ambiguous, behave as non-production (disallow).
---
## 2. The architecture seam (where this code lives, and what it must not become)
Per the project convention (root `CLAUDE.md`; `DeepDrftPublic/CLAUDE.md`): **the public host owns thin HTTP
boundaries; domain logic lives in `*.Services` libraries or `DeepDrftAPI`.** Generated XML/text is a *rendering*
of data the host already has access to — it belongs in a **thin endpoint on `DeepDrftPublic`**, and any list
logic it needs must **reuse the existing release read**, not re-implement enumeration.
- **`sitemap.xml`** is *not* a pass-through proxy like `ReleaseProxyController` (which relays JSON verbatim). It
**enumerates** releases and **transforms** them into a different media type (XML). So it is a new endpoint that
*calls* the upstream `GET api/release` paged read (server-to-server via the existing `"DeepDrft.API"` named
`HttpClient`, the same client SSR prerender already uses — no proxy hop, no new data-layer code, no schema
change) and walks the pages to build the URL set. **C5 from Phase 22 holds:** no new API endpoint on
`DeepDrftAPI`, no schema change — the existing `PagedResult<ReleaseDto>` read is sufficient (it carries
`EntryKey`, `Medium`, and `ReleaseDate` — everything a `<url>` entry needs).
- **The URL composition reuses Phase 22's seams, not new ones:** absolute origin from `SeoOptions.BaseUrl`
(`https://deepdrft.com` — config, because the origin can't be derived behind the nginx proxy), and per-release
detail paths from `ReleaseRoutes.DetailHref(entryKey, medium)` (the single source of truth the Cut/Session/Mix
pages, the player bar, and `SharePopover` all already use). The sitemap thereby lists the *exact* canonical
URLs `SeoHead` emits as `<link rel="canonical">` — by construction, not by coincidence.
> **Seam note for staff-engineer.** `SeoOptions` and `ReleaseRoutes` currently live in `DeepDrftPublic.Client`
> (`Common/`). A server-side endpoint on `DeepDrftPublic` (the host) references the client assembly already (it
> loads `DeepDrftPublic.Client._Imports` as an additional WASM assembly and shares the static `Startup`), so the
> host can read these types. Confirm the reference direction at implementation; if `SeoOptions.BaseUrl` is not
> cleanly reachable from a host controller, the minimal move is to source `BaseUrl` from the same config the
> client `SeoOptions` is seeded from (it is a non-secret brand constant — `appsettings.json`, per Phase 22 §4.1),
> **not** to duplicate the constant. This is a wiring detail, not a design fork.
---
## 3. Item 1 — `sitemap.xml`
### 3.1 Mechanism and location
A new thin endpoint on `DeepDrftPublic` serving `GET /sitemap.xml` with content-type `application/xml`. It is an
endpoint (not a static file and not a Razor component) because the URL set is **dynamic** — it must include every
release detail URL, which changes as releases are added. A static file would go stale the moment a release lands.
Recommended placement: a small `SitemapController` (or a minimal-API endpoint in `Program.cs`) alongside the
existing proxy controllers in `DeepDrftPublic/Controllers/`. It is a host concern (HTTP surface + rendering),
exactly the layer the proxy controllers occupy. It injects `IWebHostEnvironment` (the gate) and
`IHttpClientFactory` (to call `"DeepDrft.API"`), mirroring `ReleaseProxyController`'s constructor shape.
### 3.2 What it enumerates
The indexable public URL set, all absolutized against `SeoOptions.BaseUrl`:
- **Static roots:** `/` (home), `/about`, and the four browse surfaces `/cuts`, `/sessions`, `/mixes`,
`/archive`. These are a fixed list (a small in-endpoint constant array, or — cleaner — derived from the same
nav index the site already maintains; see OQ-S3).
- **Every release detail URL:** walk `GET api/release?page=N&pageSize=…` until `PageNumber * PageSize >=
TotalCount`, and for each `ReleaseDto` emit `BaseUrl + ReleaseRoutes.DetailHref(dto.EntryKey, dto.Medium)` —
i.e. `/cuts/{key}`, `/sessions/{key}`, `/mixes/{key}`. No `medium` filter on the query (we want all media in
one pass); a generous `pageSize` (e.g. 100200) keeps the walk to a handful of round-trips even for a large
catalogue.
### 3.3 XML shape
Standard sitemaps.org `urlset`:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://deepdrft.com/</loc></url>
<url><loc>https://deepdrft.com/about</loc></url>
<url><loc>https://deepdrft.com/cuts</loc></url>
<!-- … browse roots … -->
<url>
<loc>https://deepdrft.com/mixes/3f2a9c…</loc>
<lastmod>2026-05-12</lastmod> <!-- optional; from ReleaseDate — see OQ-S2 -->
</url>
<!-- … one <url> per release … -->
</urlset>
```
- `<loc>` is required and must be a fully-qualified absolute URL (the reason `BaseUrl` is mandatory).
- `<lastmod>` is **optional** and recommended from `ReleaseDto.ReleaseDate` (W3C date format `YYYY-MM-DD`) **for
release URLs only** — static roots have no natural lastmod and omit it. See **OQ-S2** (ReleaseDate is the
*release* date, not a content-modified date — it is a reasonable proxy but not strictly correct; the safe call
is to include it, as a stale-but-plausible lastmod is better than none and crawlers treat it as a hint).
- **No** `<changefreq>` / `<priority>` — both are widely ignored by Google and add noise. Omit them.
### 3.4 Failure posture
The endpoint must degrade gracefully — a sitemap that 500s trains crawlers to stop fetching it. If the upstream
`api/release` walk fails partway, **emit what was gathered** (static roots are always available; partial release
set is better than none) and log the failure. Never 500 the sitemap. (Mirrors `ReleaseProxyController`'s
philosophy of not collapsing valid-but-partial states, adapted to "always return a well-formed document.")
### 3.5 Acceptance criteria (sitemap)
- **AC-S1 — Valid + complete.** `GET /sitemap.xml` (in Production) returns well-formed `urlset` XML that
validates against the sitemaps.org schema and contains: the 6 static roots **and** exactly one `<url>` per
non-deleted release, addressed by `ReleaseRoutes.DetailHref` (so every `<loc>` equals the page's canonical).
- **AC-S2 — Absolute URLs.** Every `<loc>` is `https://deepdrft.com/…` (config origin, not a relative path, not
a proxy-derived host).
- **AC-S3 — Pagination walk is exhaustive.** A catalogue larger than one page is fully enumerated (no releases
dropped at a page boundary); a catalogue of zero releases yields a valid sitemap of just the static roots.
- **AC-S4 — Environment-gated.** In a non-production environment, `/sitemap.xml` is either not served (404) or
served empty/`Disallow`-consistent — it must never advertise beta release URLs to a crawler (E1). Recommend
**404 in non-production** (simplest; nothing references it because the non-prod `robots.txt` carries no
`Sitemap:` line — see Item 2).
- **AC-S5 — Resilient.** An upstream `api/release` failure yields a well-formed sitemap of the static roots (and
any releases gathered before the failure), logged — never a 500.
---
## 4. Item 2 — `robots.txt`
### 4.1 Mechanism and location — the static-vs-endpoint tradeoff (flagged)
`robots.txt` must express the environment gate (`Disallow: /` on beta, allow + sitemap pointer in Production). A
**static file** in `wwwroot/` **cannot** do this — it serves identical bytes in every environment. So the
content is environment-dependent and wants a **tiny endpoint** (`GET /robots.txt`, content-type `text/plain`),
injecting `IWebHostEnvironment` for the gate.
Three options, with the recommendation:
- **(a) Endpoint `GET /robots.txt` [RECOMMENDED].** A few lines of code in the same place as the sitemap
endpoint; reads `IWebHostEnvironment.IsProduction()`; emits the production or non-production body. Single source
of truth for the gate, co-located with the sitemap, no infra dependency. The body is trivial.
- **(b) Static file + reverse-proxy rule.** Ship a production `robots.txt` in `wwwroot/` and have nginx serve a
`Disallow: /` variant (or block the file) on the beta host. **Cons:** splits the gate across app + nginx config
(two places to reason about, two places to get wrong); the beta protection lives in infra the app can't test;
Daniel would maintain an nginx rule per environment. Rejected unless Daniel specifically wants robots managed at
the proxy layer.
- **(c) Static file only.** Cannot express the gate at all — would either crawl-allow beta (violates E1) or
disallow production. **Rejected outright.**
The endpoint (a) is the natural sibling to the sitemap endpoint and keeps E1 in one testable place. Note the
ordering subtlety from `DeepDrftPublic/CLAUDE.md`: static-file middleware runs before component/controller
mapping, so **if** a literal `wwwroot/robots.txt` ever exists it would shadow the endpoint — the endpoint
approach requires that no static `robots.txt` is shipped (a one-line thing to verify, called out so it isn't
tripped over).
### 4.2 Content
**Production:**
```
User-agent: *
Allow: /
Sitemap: https://deepdrft.com/sitemap.xml
```
**Every non-production environment (beta/staging):**
```
User-agent: *
Disallow: /
```
- The `Sitemap:` line uses the absolute `SeoOptions.BaseUrl` origin (same config source as the sitemap's
`<loc>`s) — it is the one documented way to point crawlers at the sitemap without submitting it manually.
- The non-production body carries **no** `Sitemap:` line (consistent with AC-S4's "don't advertise beta URLs").
- Consider whether to additionally `Disallow: /FramePlayer` and the `api/*` proxy paths in Production (OQ-R2) —
the embed iframe and the JSON/stream proxy endpoints are not pages worth crawling.
### 4.3 Acceptance criteria (robots)
- **AC-R1 — Production allows + points.** `GET /robots.txt` on the production host returns `Allow: /` and a
`Sitemap: https://deepdrft.com/sitemap.xml` line.
- **AC-R2 — Beta disallows everything.** `GET /robots.txt` on any non-production host returns `User-agent: *` +
`Disallow: /` and **no** `Sitemap:` line (E1).
- **AC-R3 — Single gate.** The Production-vs-beta distinction is driven by `IWebHostEnvironment.IsProduction()` —
the same predicate as the sitemap and as Phase 22's `SeoEnvironment` seed — not a second config flag.
- **AC-R4 — `text/plain`.** Correct content-type; no BOM/HTML wrapper.
---
## 5. Item 3 — CMS `noindex` (the one CMS-touching item)
**This is the only Phase 23 item that touches `DeepDrftManager`.** Scoped, minimal, admin-chrome-only — **no
functional change** to any CMS page, no service/API/data change. `DeepDrftManager` is an authenticated admin app
that must never appear in any search index, in any environment (it has no "production is fine to index" case —
the CMS is *always* `noindex`, unlike the public site whose gate flips per environment).
### 5.1 Mechanism — defense in depth, cheapest-robust
Two layers; recommend **both** because they fail independently and the cost is trivial:
- **(a) `robots.txt` on the CMS host [primary].** A `Disallow: /` `robots.txt` served at the CMS root. Because the
CMS is *always* uncrawlable (no environment gate), this can be the **simplest possible static file** in the CMS
`wwwroot/` — no endpoint, no environment logic:
```
User-agent: *
Disallow: /
```
This is the cleanest single move and differs from the public `robots.txt` precisely because there is no
per-environment branch to express.
- **(b) Blanket `<meta name="robots" content="noindex,nofollow">` in the CMS layout `<head>` [belt-and-braces].**
A static meta tag in the CMS app's root `App.razor`/host `<head>` (the CMS's analogue of the public
`App.razor`'s static head block). This protects against the case where a crawler reaches a deep CMS URL that
`robots.txt` disallow doesn't *de-index* (robots disallow prevents *crawling*, but a URL linked from elsewhere
can still be *indexed* without crawling; an on-page `noindex` is what actually keeps it out of the index). It is
a single static line in the CMS host head — no per-page wiring, no component, no `SeoHead` port (the CMS does
**not** get Phase 22's component; this is one blanket tag).
Layer (a) is the floor; layer (b) is the robust ceiling. Together they cost a static file plus one `<head>` line.
### 5.2 Why the CMS does *not* reuse Phase 22's `SeoHead` / `SeoEnvironment`
Phase 22 C1/C9 explicitly kept the CMS out of scope ("Zero changes to `DeepDrftManager`"). Phase 23 makes the
**one** deliberate, minimal exception — but it does **not** drag the public component graph into the CMS. The CMS
need is a single constant directive ("never index"), not a parameterized per-page head surface; porting `SeoHead`
(a `DeepDrftPublic.Client` WASM component) into the server-rendered CMS would be wildly disproportionate. The
blanket meta + static robots is the right-sized answer. (And `SeoEnvironment`'s per-environment flip is
irrelevant here — the CMS is `noindex` in *all* environments, including production.)
### 5.3 Acceptance criteria (CMS noindex)
- **AC-C1 — CMS robots disallows.** `GET /robots.txt` on the CMS host returns `User-agent: *` + `Disallow: /`.
- **AC-C2 — Every CMS page carries `noindex`.** Any CMS page's prerendered `<head>` contains
`<meta name="robots" content="noindex,nofollow">` (the blanket layout tag), including the public-facing
`/account/login` and `/account/register` routes (which render in the lean `CmsHomeLayout`) and the home splash.
Confirm the meta lands in whichever head block both layouts inherit (the CMS host `App.razor`), so a
layout-specific head doesn't leave a route uncovered.
- **AC-C3 — No functional change.** No CMS page's behavior, auth gate, layout, or data path changes — the diff is
a static `robots.txt` and a static `<meta>` line. (Aligns with Phase 22 AC9's spirit, now scoped as the
intentional CMS exception.)
- **AC-C4 — Always-on (no env gate).** The CMS `noindex` holds in production too — it is unconditional, unlike the
public site.
---
## 6. Wave decomposition
These are **largely independent** — three separate surfaces with one shared concept (the env gate) and one shared
config value (`BaseUrl`). The dependency graph is shallow.
- **23.1 — Public env-gate primitives + `robots.txt` endpoint (cold-start, shared seam).** Stand up the
server-side `IWebHostEnvironment`-gated endpoint pattern on `DeepDrftPublic` and ship `GET /robots.txt`
(Production allow+sitemap-pointer / non-prod `Disallow: /`). This is the smallest item and it establishes the
**shared gate + BaseUrl wiring** that 23.2 also uses, so doing it first de-risks the seam. Resolves the
static-vs-endpoint call (OQ-R1). **Cold-start; nothing depends on it being done first except that 23.2 reuses
the same gate wiring.**
- **23.2 — `sitemap.xml` endpoint.** The release-enumeration walk over `GET api/release` + XML emission +
`ReleaseRoutes`/`BaseUrl` absolutization + the env gate (404 in non-prod). The largest item. **Shares the gate
+ BaseUrl wiring with 23.1** (do 23.1 first or co-develop; they touch the same controller area). The
`Sitemap:` line in 23.1's production `robots.txt` points at this — so 23.1's production body assumes 23.2 exists
(harmless if 23.2 lands slightly later: a `Sitemap:` pointer to a not-yet-built URL just 404s until it does).
- **23.3 — CMS `noindex` (the CMS-side item).** Static `robots.txt` (`Disallow: /`) in the `DeepDrftManager`
`wwwroot/` + blanket `<meta name="robots" content="noindex,nofollow">` in the CMS host `<head>`. **Fully
independent — touches only `DeepDrftManager`, shares nothing with 23.1/23.2, can run in parallel from day one.**
**Dependency shape:** `23.1 → 23.2` (shared gate/BaseUrl wiring + the `Sitemap:` pointer relationship); **23.3 ∥**
(parallel, independent, different app). The cold-start item is **23.1** (it proves the gate seam the public side
leans on); **23.3** can run start-to-finish alongside either.
**Validation (folded into each wave's ACs, not a separate wave):** the items are small enough that a dedicated
validation wave is overkill — each wave carries its own ACs (S/R/C above). A single end-of-phase check that
exercises the production-vs-beta matrix for all three (Google Search Console / a `curl` against both hosts, plus
the sitemaps.org validator) is worth doing once 23.123.3 land.
---
## 7. Open questions for Daniel (product/infra calls, not implementation detail)
### Sitemap
- **OQ-S1 — Browse variants vs. canonical roots.** The sitemap lists the **canonical** browse roots (`/cuts`,
`/sessions`, `/mixes`, `/archive`). Phase 11 put Archive filters in the URL (`/archive?q=&medium=&genre=`).
**Recommend: do NOT enumerate filtered/paginated variants** — they are filtered *views* of the same release set,
not distinct content, and listing them invites duplicate-content dilution. The per-release detail URLs carry the
indexable content; the browse roots are navigational. `[Daniel decision — recommendation: canonical roots only]`
- **OQ-S2 — `lastmod` source.** Use `ReleaseDto.ReleaseDate` as the release URLs' `<lastmod>`? It is the *release*
date, not a content-last-modified date (a re-edited description or replaced cover would not bump it). **Recommend:
include it** — a plausible-but-imperfect lastmod is a useful crawl hint and strictly better than omitting it; the
alternative (a true content-modified timestamp) would need a schema column that doesn't exist (would violate
C5/no-schema-change). Static roots omit `lastmod`. `[Daniel decision — recommendation: ReleaseDate, accept the
imprecision]`
- **OQ-S3 — Static-root list source.** Hardcode the 6 static roots in the endpoint, or derive from the site's nav
index (`DeepDrftPublic.Client/Layout/Pages.cs` `AllPages`)? **Recommend: hardcode for v1** (the indexable-roots
set is *not* the same as the nav set — e.g. `/FramePlayer` is a nav-absent route that must stay out, and a new
nav entry isn't automatically sitemap-worthy), with a code comment to revisit if the set grows. Deriving couples
the sitemap to nav decisions in a way that can silently leak or drop URLs. `[Daniel decision — recommendation:
explicit list]`
### robots
- **OQ-R1 — Endpoint vs. static + nginx (§4.1).** **Recommend the endpoint** (single testable gate, co-located
with the sitemap). Confirm, or — if Daniel prefers robots managed at the reverse-proxy layer — the static +
nginx-rule variant (b), accepting the split gate. `[Daniel decision — recommendation: endpoint]`
- **OQ-R2 — Disallow non-page routes in Production?** Should the production `robots.txt` additionally
`Disallow: /FramePlayer` (the embed iframe) and/or `Disallow: /api/` (the proxy JSON/stream paths)? **Recommend:
yes for `/FramePlayer`** (an embed shell is not a destination page and would be thin/duplicate content if
crawled), **optional for `/api/`** (proxy paths return JSON/bytes, not HTML — crawlers mostly self-skip, but an
explicit disallow is tidy). `[Daniel decision — low stakes]`
### CMS
- **OQ-C1 — Both layers or just robots? (§5.1)** **Recommend both** (static `Disallow: /` robots **and** the
blanket `noindex` meta) — they fail independently and the combined cost is a file + one line; robots-disallow
alone does not de-index a URL discovered via an external link, which is exactly what the on-page `noindex`
closes. Confirm, or accept robots-only if the meta line is judged not worth the one CMS `<head>` touch. `[Daniel
decision — recommendation: both]`
### Cross-cutting
- **OQ-X1 — Is `https://deepdrft.com` the confirmed canonical origin?** This is Phase 22's OQ1, still load-bearing
here: every `<loc>`, the `Sitemap:` line, all assume `SeoOptions.BaseUrl = https://deepdrft.com`. If that value
was confirmed when Phase 22 landed (COMPLETED.md §22 shows it shipped as `https://deepdrft.com`), this is
closed — flagged only so the dependency is explicit. `[Likely closed — confirm BaseUrl is final]`
---
## 8. Cross-references (read before implementing)
- `product-notes/phase-22-seo-metadata-component.md` — the parent spec; §7 "Adjacent but separate concerns"
flagged all three Phase 23 items; the `SeoOptions.BaseUrl` / `ReleaseRoutes` / `SeoEnvironment` seams Phase 23
reuses are defined here.
- `COMPLETED.md §22` — what Phase 22 actually landed (the `SeoEnvironment` env gate, `SeoOptions.BaseUrl =
https://deepdrft.com`, the `ReleaseRoutes`-based canonical the sitemap must match).
- `DeepDrftPublic/Controllers/ReleaseProxyController.cs` — the thin-proxy shape and the `"DeepDrft.API"` named
client the sitemap endpoint reuses to walk releases (server-to-server, no proxy hop). **Note the distinction:**
the sitemap endpoint *enumerates + transforms*, it does not relay verbatim like this proxy.
- `DeepDrftPublic/CLAUDE.md` — the host's "thin HTTP boundary, no domain logic" contract; the middleware ordering
(static files before controller mapping — relevant to the robots endpoint-vs-static-file shadowing note); the
`IWebHostEnvironment` availability server-side.
- `DeepDrftPublic.Client/Common/ReleaseRoutes.cs` — `DetailHref(entryKey, medium)`, the single source of truth for
per-release detail URLs; every sitemap `<loc>` for a release goes through it.
- `DeepDrftPublic/Components/App.razor` — where `SeoEnvironment.IsProduction` is seeded from
`IWebHostEnvironment.IsProduction()` (lines 3848); the Phase 23 endpoints read the **same** predicate directly.
- `DeepDrftAPI/Controllers/ReleaseController.cs` `GET api/release` — the paged `PagedResult<ReleaseDto>` read the
sitemap walks (returns `Items`, `TotalCount`, `PageNumber`, `PageSize`; `ReleaseDto` carries `EntryKey`,
`Medium`, `ReleaseDate`). No change to this endpoint (C5).
- `DeepDrftManager` host `App.razor` / `wwwroot/` — where Item 3's CMS robots file and blanket `noindex` meta land
(the one CMS-touching surface).
- sitemaps.org `0.9` schema + Google's "Manage your sitemaps" / robots.txt docs — the validation targets (AC-S1,
AC-R*).