docs: spec Phase 23 — SEO crawl directives (sitemap.xml, robots.txt, CMS noindex)

This commit is contained in:
daniel-c-harvey
2026-06-23 07:10:20 -04:00
parent 33383cd675
commit 9a4b79d377
2 changed files with 390 additions and 0 deletions
+20
View File
@@ -653,6 +653,26 @@ convention.** None block 21.1.
---
## Phase 23 — SEO Crawl Directives (sitemap.xml, robots.txt, CMS noindex)
The endpoint/file-shaped follow-on to Phase 22's per-page `SeoHead` component (landed 2026-06-23, `COMPLETED.md §22`). Phase 22 flagged these three as "adjacent but separate concerns" (`product-notes/phase-22-seo-metadata-component.md §7`): they are a different *unit of work* — server-side endpoints and static files that tell crawlers **which** pages exist and **whether** to crawl at all, vs. the per-page head surface that says **what each page is**. Phase 22 is the *content* of discoverability; Phase 23 is the *directives* layer above it. Full design, contracts, acceptance criteria, and open questions: `product-notes/phase-23-seo-crawl-directives.md`.
**The environment gate is the through-line.** Phase 22 established the rule that **every non-production environment must be uncrawlable** (beta/staging must not be indexed). Phase 22 expressed this for WASM-rendered page robots-meta via the `SeoEnvironment` `[PersistentState]` bridge. **Phase 23's three items all run server-side only** (endpoints + static files, never the WASM render tree), so they read the gate the simplest way: **`IWebHostEnvironment.IsProduction()` injected directly** — the same predicate `App.razor` seeds `SeoEnvironment` from, no PersistentState bridge needed because nothing crosses the server→WASM seam. Invariant E1 (fail-safe closed): in any non-production environment, `robots.txt` is `Disallow: /` and the sitemap is not served (or empty).
**Architecture seam (per project convention).** Generated XML/text belongs in a **thin endpoint on `DeepDrftPublic`**, with list logic **reusing the existing release read** — no new `DeepDrftAPI` endpoint, no schema change (Phase 22 C5 holds). The sitemap endpoint *enumerates + transforms* (it is NOT a verbatim proxy like `ReleaseProxyController`): it walks `GET api/release` paged (server-to-server via the existing `"DeepDrft.API"` named client) and emits XML, absolutizing each `<loc>` via `SeoOptions.BaseUrl` (`https://deepdrft.com`) + `ReleaseRoutes.DetailHref(entryKey, medium)` — so every sitemap URL equals the page's `SeoHead` canonical by construction. The CMS item is the **one** deliberate, minimal exception to Phase 22 C1 ("zero CMS changes"): admin-chrome-only, no functional/service/API/data change.
Sequenced as three largely-independent waves; the only coupling is a shared env-gate + `BaseUrl` wiring between the two public items.
- **23.1 — Public env-gate primitives + `robots.txt` endpoint (cold-start, shared seam).** Stand up the `IWebHostEnvironment`-gated server-side endpoint pattern on `DeepDrftPublic` and ship `GET /robots.txt` (Production: `Allow: /` + `Sitemap:` pointer; non-prod: `Disallow: /`). Smallest item; establishes the **shared gate + BaseUrl wiring** 23.2 reuses, so it de-risks the seam. Resolves the static-vs-endpoint call (recommend **endpoint** — single testable gate; a static file can't express the per-environment branch). **Cold-start.**
- **23.2 — `sitemap.xml` endpoint.** The release-enumeration walk over `GET api/release` (paginate until `PageNumber * PageSize >= TotalCount`) + sitemaps.org `urlset` emission + `ReleaseRoutes`/`BaseUrl` absolutization + the env gate (404 in non-prod). Static roots: `/`, `/about`, `/cuts`, `/sessions`, `/mixes`, `/archive`; plus one `<url>` per release (`/cuts|sessions|mixes/{key}`), optional `<lastmod>` from `ReleaseDate`. Resilient — a partial/empty release set yields a well-formed doc, never a 500. **Shares the gate + BaseUrl wiring with 23.1** (do 23.1 first or co-develop; same controller area); the production `robots.txt`'s `Sitemap:` line points here (harmless if 23.2 lands slightly later).
- **23.3 — CMS `noindex` (the one CMS-touching item; fully parallel).** Static `robots.txt` (`Disallow: /` — no env branch; the CMS is *always* uncrawlable, including in production) in the `DeepDrftManager` `wwwroot/`, **plus** a blanket `<meta name="robots" content="noindex,nofollow">` in the CMS host `<head>` (defense in depth: robots-disallow prevents crawling but on-page `noindex` is what de-indexes a URL discovered via an external link). The CMS does **not** get Phase 22's `SeoHead` — one blanket directive, not a parameterized component. **Fully independent — touches only `DeepDrftManager`, can run start-to-finish from day one.**
**Dependency shape:** `23.1 → 23.2` (shared gate/BaseUrl wiring + the `Sitemap:` pointer); **23.3 ∥** (parallel, independent, different app). Cold-start is **23.1**. A single end-of-phase production-vs-beta matrix check (Search Console / `curl` both hosts + sitemaps.org validator) is folded into the waves' ACs rather than a separate validation wave.
**Open questions for Daniel (spec §7) — recommendations stated, none block 23.1:** OQ-S1 sitemap lists canonical browse roots only, **not** filtered/paginated variants (recommend: roots only — variants are views, not content); OQ-S2 `<lastmod>` from `ReleaseDate` (recommend: include it, accepting that it is the release date, not a content-modified date — a true modified timestamp would need a schema column, violating C5); OQ-S3 static-root list hardcoded vs. derived from nav (recommend: explicit list — indexable-roots ≠ nav set, e.g. `/FramePlayer` must stay out); OQ-R1 robots endpoint vs. static+nginx (recommend: endpoint); OQ-R2 also `Disallow: /FramePlayer` (recommend: yes) and `/api/` (optional) in Production; OQ-C1 CMS both layers vs. robots-only (recommend: both); OQ-X1 confirm `https://deepdrft.com` is the final canonical origin (likely closed — shipped with Phase 22).
---
## Working with this file