Files
deepdrft/product-notes/phase-23-seo-crawl-directives.md

25 KiB
Raw Permalink Blame History

Phase 23 — SEO Crawl Directives (sitemap.xml, robots.txt, CMS noindex)

Product spec. Status: design / framing — implementation-ready pending Daniel's open-question calls. Author: product-designer. Date: 2026-06-23. No code has been written by this doc.

Phase 23 is the endpoint/file-shaped follow-on to Phase 22's per-page SeoHead component. Phase 22 flagged these three as "adjacent but separate concerns" (product-notes/phase-22-seo-metadata-component.md §7): they are a different unit of work — server-side endpoints and static files that tell crawlers which pages exist and whether to crawl them at all, as opposed to the per-page head surface that tells crawlers what each page is. Phase 22 is the content of discoverability; Phase 23 is the directives layer above it.

Three items, each independently shippable:

  1. sitemap.xml on the public host — a generated sitemap enumerating every indexable public URL.
  2. robots.txt on the public host — allow + sitemap pointer in Production, Disallow: / everywhere else.
  3. CMS noindex on DeepDrftManager — the admin app must never be indexed. The one item touching the CMS.

1. The environment gate is the through-line (read this first)

Phase 22 established the rule that every non-production environment must be uncrawlable — the beta/staging host must not appear in search results, and a stray crawl of staging must not dilute or duplicate the production site. Phase 22 expressed this for page-level robots meta via SeoEnvironment (a [PersistentState] bridge seeded from IWebHostEnvironment.IsProduction(), because SeoHead renders in the WASM component graph and WASM has no IWebHostEnvironment).

Phase 23's three items all run server-side only (endpoints and static files, never the WASM render tree), so they read the gate the simplest possible way: IWebHostEnvironment.IsProduction() injected directly. They do not need the SeoEnvironment PersistentState bridge — that bridge exists solely to ferry the flag across the server→WASM seam, which these never cross. This is the correct reuse: same source of truth (IWebHostEnvironment.IsProduction(), the exact predicate App.razor already seeds SeoEnvironment from), no parallel gate invented, and no PersistentState plumbing where it isn't needed.

Concern Renders where Gate mechanism
Phase 22 SeoHead robots meta WASM component graph SeoEnvironment [PersistentState] bridge (server seed → WASM read)
Phase 23 sitemap / robots / CMS server-side endpoint or static file IWebHostEnvironment.IsProduction() injected directly

Invariant E1 (the non-negotiable): in any non-production environment, robots.txt is Disallow: / and the sitemap is either not served or empty. A crawler must see a closed door on beta before it sees a single URL. The fail-safe default (matching Phase 22's SeoEnvironment fail-safe-to-noindex) is closed: if environment resolution is ever ambiguous, behave as non-production (disallow).


2. The architecture seam (where this code lives, and what it must not become)

Per the project convention (root CLAUDE.md; DeepDrftPublic/CLAUDE.md): the public host owns thin HTTP boundaries; domain logic lives in *.Services libraries or DeepDrftAPI. Generated XML/text is a rendering of data the host already has access to — it belongs in a thin endpoint on DeepDrftPublic, and any list logic it needs must reuse the existing release read, not re-implement enumeration.

  • sitemap.xml is not a pass-through proxy like ReleaseProxyController (which relays JSON verbatim). It enumerates releases and transforms them into a different media type (XML). So it is a new endpoint that calls the upstream GET api/release paged read (server-to-server via the existing "DeepDrft.API" named HttpClient, the same client SSR prerender already uses — no proxy hop, no new data-layer code, no schema change) and walks the pages to build the URL set. C5 from Phase 22 holds: no new API endpoint on DeepDrftAPI, no schema change — the existing PagedResult<ReleaseDto> read is sufficient (it carries EntryKey, Medium, and ReleaseDate — everything a <url> entry needs).
  • The URL composition reuses Phase 22's seams, not new ones: absolute origin from SeoOptions.BaseUrl (https://deepdrft.com — config, because the origin can't be derived behind the nginx proxy), and per-release detail paths from ReleaseRoutes.DetailHref(entryKey, medium) (the single source of truth the Cut/Session/Mix pages, the player bar, and SharePopover all already use). The sitemap thereby lists the exact canonical URLs SeoHead emits as <link rel="canonical"> — by construction, not by coincidence.

Seam note for staff-engineer. SeoOptions and ReleaseRoutes currently live in DeepDrftPublic.Client (Common/). A server-side endpoint on DeepDrftPublic (the host) references the client assembly already (it loads DeepDrftPublic.Client._Imports as an additional WASM assembly and shares the static Startup), so the host can read these types. Confirm the reference direction at implementation; if SeoOptions.BaseUrl is not cleanly reachable from a host controller, the minimal move is to source BaseUrl from the same config the client SeoOptions is seeded from (it is a non-secret brand constant — appsettings.json, per Phase 22 §4.1), not to duplicate the constant. This is a wiring detail, not a design fork.


3. Item 1 — sitemap.xml

3.1 Mechanism and location

A new thin endpoint on DeepDrftPublic serving GET /sitemap.xml with content-type application/xml. It is an endpoint (not a static file and not a Razor component) because the URL set is dynamic — it must include every release detail URL, which changes as releases are added. A static file would go stale the moment a release lands.

Recommended placement: a small SitemapController (or a minimal-API endpoint in Program.cs) alongside the existing proxy controllers in DeepDrftPublic/Controllers/. It is a host concern (HTTP surface + rendering), exactly the layer the proxy controllers occupy. It injects IWebHostEnvironment (the gate) and IHttpClientFactory (to call "DeepDrft.API"), mirroring ReleaseProxyController's constructor shape.

3.2 What it enumerates

The indexable public URL set, all absolutized against SeoOptions.BaseUrl:

  • Static roots: / (home), /about, and the four browse surfaces /cuts, /sessions, /mixes, /archive. These are a fixed list (a small in-endpoint constant array, or — cleaner — derived from the same nav index the site already maintains; see OQ-S3).
  • Every release detail URL: walk GET api/release?page=N&pageSize=… until PageNumber * PageSize >= TotalCount, and for each ReleaseDto emit BaseUrl + ReleaseRoutes.DetailHref(dto.EntryKey, dto.Medium) — i.e. /cuts/{key}, /sessions/{key}, /mixes/{key}. No medium filter on the query (we want all media in one pass); a generous pageSize (e.g. 100200) keeps the walk to a handful of round-trips even for a large catalogue.

3.3 XML shape

Standard sitemaps.org urlset:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url><loc>https://deepdrft.com/</loc></url>
  <url><loc>https://deepdrft.com/about</loc></url>
  <url><loc>https://deepdrft.com/cuts</loc></url>
  <!-- … browse roots … -->
  <url>
    <loc>https://deepdrft.com/mixes/3f2a9c…</loc>
    <lastmod>2026-05-12</lastmod>   <!-- optional; from ReleaseDate — see OQ-S2 -->
  </url>
  <!-- … one <url> per release … -->
</urlset>
  • <loc> is required and must be a fully-qualified absolute URL (the reason BaseUrl is mandatory).
  • <lastmod> is optional and recommended from ReleaseDto.ReleaseDate (W3C date format YYYY-MM-DD) for release URLs only — static roots have no natural lastmod and omit it. See OQ-S2 (ReleaseDate is the release date, not a content-modified date — it is a reasonable proxy but not strictly correct; the safe call is to include it, as a stale-but-plausible lastmod is better than none and crawlers treat it as a hint).
  • No <changefreq> / <priority> — both are widely ignored by Google and add noise. Omit them.

3.4 Failure posture

The endpoint must degrade gracefully — a sitemap that 500s trains crawlers to stop fetching it. If the upstream api/release walk fails partway, emit what was gathered (static roots are always available; partial release set is better than none) and log the failure. Never 500 the sitemap. (Mirrors ReleaseProxyController's philosophy of not collapsing valid-but-partial states, adapted to "always return a well-formed document.")

3.5 Acceptance criteria (sitemap)

  • AC-S1 — Valid + complete. GET /sitemap.xml (in Production) returns well-formed urlset XML that validates against the sitemaps.org schema and contains: the 6 static roots and exactly one <url> per non-deleted release, addressed by ReleaseRoutes.DetailHref (so every <loc> equals the page's canonical).
  • AC-S2 — Absolute URLs. Every <loc> is https://deepdrft.com/… (config origin, not a relative path, not a proxy-derived host).
  • AC-S3 — Pagination walk is exhaustive. A catalogue larger than one page is fully enumerated (no releases dropped at a page boundary); a catalogue of zero releases yields a valid sitemap of just the static roots.
  • AC-S4 — Environment-gated. In a non-production environment, /sitemap.xml is either not served (404) or served empty/Disallow-consistent — it must never advertise beta release URLs to a crawler (E1). Recommend 404 in non-production (simplest; nothing references it because the non-prod robots.txt carries no Sitemap: line — see Item 2).
  • AC-S5 — Resilient. An upstream api/release failure yields a well-formed sitemap of the static roots (and any releases gathered before the failure), logged — never a 500.

4. Item 2 — robots.txt

4.1 Mechanism and location — the static-vs-endpoint tradeoff (flagged)

robots.txt must express the environment gate (Disallow: / on beta, allow + sitemap pointer in Production). A static file in wwwroot/ cannot do this — it serves identical bytes in every environment. So the content is environment-dependent and wants a tiny endpoint (GET /robots.txt, content-type text/plain), injecting IWebHostEnvironment for the gate.

Three options, with the recommendation:

  • (a) Endpoint GET /robots.txt [RECOMMENDED]. A few lines of code in the same place as the sitemap endpoint; reads IWebHostEnvironment.IsProduction(); emits the production or non-production body. Single source of truth for the gate, co-located with the sitemap, no infra dependency. The body is trivial.
  • (b) Static file + reverse-proxy rule. Ship a production robots.txt in wwwroot/ and have nginx serve a Disallow: / variant (or block the file) on the beta host. Cons: splits the gate across app + nginx config (two places to reason about, two places to get wrong); the beta protection lives in infra the app can't test; Daniel would maintain an nginx rule per environment. Rejected unless Daniel specifically wants robots managed at the proxy layer.
  • (c) Static file only. Cannot express the gate at all — would either crawl-allow beta (violates E1) or disallow production. Rejected outright.

The endpoint (a) is the natural sibling to the sitemap endpoint and keeps E1 in one testable place. Note the ordering subtlety from DeepDrftPublic/CLAUDE.md: static-file middleware runs before component/controller mapping, so if a literal wwwroot/robots.txt ever exists it would shadow the endpoint — the endpoint approach requires that no static robots.txt is shipped (a one-line thing to verify, called out so it isn't tripped over).

4.2 Content

Production:

User-agent: *
Allow: /
Sitemap: https://deepdrft.com/sitemap.xml

Every non-production environment (beta/staging):

User-agent: *
Disallow: /
  • The Sitemap: line uses the absolute SeoOptions.BaseUrl origin (same config source as the sitemap's <loc>s) — it is the one documented way to point crawlers at the sitemap without submitting it manually.
  • The non-production body carries no Sitemap: line (consistent with AC-S4's "don't advertise beta URLs").
  • Consider whether to additionally Disallow: /FramePlayer and the api/* proxy paths in Production (OQ-R2) — the embed iframe and the JSON/stream proxy endpoints are not pages worth crawling.

4.3 Acceptance criteria (robots)

  • AC-R1 — Production allows + points. GET /robots.txt on the production host returns Allow: / and a Sitemap: https://deepdrft.com/sitemap.xml line.
  • AC-R2 — Beta disallows everything. GET /robots.txt on any non-production host returns User-agent: * + Disallow: / and no Sitemap: line (E1).
  • AC-R3 — Single gate. The Production-vs-beta distinction is driven by IWebHostEnvironment.IsProduction() — the same predicate as the sitemap and as Phase 22's SeoEnvironment seed — not a second config flag.
  • AC-R4 — text/plain. Correct content-type; no BOM/HTML wrapper.

5. Item 3 — CMS noindex (the one CMS-touching item)

This is the only Phase 23 item that touches DeepDrftManager. Scoped, minimal, admin-chrome-only — no functional change to any CMS page, no service/API/data change. DeepDrftManager is an authenticated admin app that must never appear in any search index, in any environment (it has no "production is fine to index" case — the CMS is always noindex, unlike the public site whose gate flips per environment).

5.1 Mechanism — defense in depth, cheapest-robust

Two layers; recommend both because they fail independently and the cost is trivial:

  • (a) robots.txt on the CMS host [primary]. A Disallow: / robots.txt served at the CMS root. Because the CMS is always uncrawlable (no environment gate), this can be the simplest possible static file in the CMS wwwroot/ — no endpoint, no environment logic:
    User-agent: *
    Disallow: /
    
    This is the cleanest single move and differs from the public robots.txt precisely because there is no per-environment branch to express.
  • (b) Blanket <meta name="robots" content="noindex,nofollow"> in the CMS layout <head> [belt-and-braces]. A static meta tag in the CMS app's root App.razor/host <head> (the CMS's analogue of the public App.razor's static head block). This protects against the case where a crawler reaches a deep CMS URL that robots.txt disallow doesn't de-index (robots disallow prevents crawling, but a URL linked from elsewhere can still be indexed without crawling; an on-page noindex is what actually keeps it out of the index). It is a single static line in the CMS host head — no per-page wiring, no component, no SeoHead port (the CMS does not get Phase 22's component; this is one blanket tag).

Layer (a) is the floor; layer (b) is the robust ceiling. Together they cost a static file plus one <head> line.

5.2 Why the CMS does not reuse Phase 22's SeoHead / SeoEnvironment

Phase 22 C1/C9 explicitly kept the CMS out of scope ("Zero changes to DeepDrftManager"). Phase 23 makes the one deliberate, minimal exception — but it does not drag the public component graph into the CMS. The CMS need is a single constant directive ("never index"), not a parameterized per-page head surface; porting SeoHead (a DeepDrftPublic.Client WASM component) into the server-rendered CMS would be wildly disproportionate. The blanket meta + static robots is the right-sized answer. (And SeoEnvironment's per-environment flip is irrelevant here — the CMS is noindex in all environments, including production.)

5.3 Acceptance criteria (CMS noindex)

  • AC-C1 — CMS robots disallows. GET /robots.txt on the CMS host returns User-agent: * + Disallow: /.
  • AC-C2 — Every CMS page carries noindex. Any CMS page's prerendered <head> contains <meta name="robots" content="noindex,nofollow"> (the blanket layout tag), including the public-facing /account/login and /account/register routes (which render in the lean CmsHomeLayout) and the home splash. Confirm the meta lands in whichever head block both layouts inherit (the CMS host App.razor), so a layout-specific head doesn't leave a route uncovered.
  • AC-C3 — No functional change. No CMS page's behavior, auth gate, layout, or data path changes — the diff is a static robots.txt and a static <meta> line. (Aligns with Phase 22 AC9's spirit, now scoped as the intentional CMS exception.)
  • AC-C4 — Always-on (no env gate). The CMS noindex holds in production too — it is unconditional, unlike the public site.

6. Wave decomposition

These are largely independent — three separate surfaces with one shared concept (the env gate) and one shared config value (BaseUrl). The dependency graph is shallow.

  • 23.1 — Public env-gate primitives + robots.txt endpoint (cold-start, shared seam). Stand up the server-side IWebHostEnvironment-gated endpoint pattern on DeepDrftPublic and ship GET /robots.txt (Production allow+sitemap-pointer / non-prod Disallow: /). This is the smallest item and it establishes the shared gate + BaseUrl wiring that 23.2 also uses, so doing it first de-risks the seam. Resolves the static-vs-endpoint call (OQ-R1). Cold-start; nothing depends on it being done first except that 23.2 reuses the same gate wiring.
  • 23.2 — sitemap.xml endpoint. The release-enumeration walk over GET api/release + XML emission + ReleaseRoutes/BaseUrl absolutization + the env gate (404 in non-prod). The largest item. **Shares the gate
    • BaseUrl wiring with 23.1** (do 23.1 first or co-develop; they touch the same controller area). The Sitemap: line in 23.1's production robots.txt points at this — so 23.1's production body assumes 23.2 exists (harmless if 23.2 lands slightly later: a Sitemap: pointer to a not-yet-built URL just 404s until it does).
  • 23.3 — CMS noindex (the CMS-side item). Static robots.txt (Disallow: /) in the DeepDrftManager wwwroot/ + blanket <meta name="robots" content="noindex,nofollow"> in the CMS host <head>. Fully independent — touches only DeepDrftManager, shares nothing with 23.1/23.2, can run in parallel from day one.

Dependency shape: 23.1 → 23.2 (shared gate/BaseUrl wiring + the Sitemap: pointer relationship); 23.3 ∥ (parallel, independent, different app). The cold-start item is 23.1 (it proves the gate seam the public side leans on); 23.3 can run start-to-finish alongside either.

Validation (folded into each wave's ACs, not a separate wave): the items are small enough that a dedicated validation wave is overkill — each wave carries its own ACs (S/R/C above). A single end-of-phase check that exercises the production-vs-beta matrix for all three (Google Search Console / a curl against both hosts, plus the sitemaps.org validator) is worth doing once 23.123.3 land.


7. Open questions for Daniel (product/infra calls, not implementation detail)

Sitemap

  • OQ-S1 — Browse variants vs. canonical roots. The sitemap lists the canonical browse roots (/cuts, /sessions, /mixes, /archive). Phase 11 put Archive filters in the URL (/archive?q=&medium=&genre=). Recommend: do NOT enumerate filtered/paginated variants — they are filtered views of the same release set, not distinct content, and listing them invites duplicate-content dilution. The per-release detail URLs carry the indexable content; the browse roots are navigational. [Daniel decision — recommendation: canonical roots only]
  • OQ-S2 — lastmod source. Use ReleaseDto.ReleaseDate as the release URLs' <lastmod>? It is the release date, not a content-last-modified date (a re-edited description or replaced cover would not bump it). Recommend: include it — a plausible-but-imperfect lastmod is a useful crawl hint and strictly better than omitting it; the alternative (a true content-modified timestamp) would need a schema column that doesn't exist (would violate C5/no-schema-change). Static roots omit lastmod. [Daniel decision — recommendation: ReleaseDate, accept the imprecision]
  • OQ-S3 — Static-root list source. Hardcode the 6 static roots in the endpoint, or derive from the site's nav index (DeepDrftPublic.Client/Layout/Pages.cs AllPages)? Recommend: hardcode for v1 (the indexable-roots set is not the same as the nav set — e.g. /FramePlayer is a nav-absent route that must stay out, and a new nav entry isn't automatically sitemap-worthy), with a code comment to revisit if the set grows. Deriving couples the sitemap to nav decisions in a way that can silently leak or drop URLs. [Daniel decision — recommendation: explicit list]

robots

  • OQ-R1 — Endpoint vs. static + nginx (§4.1). Recommend the endpoint (single testable gate, co-located with the sitemap). Confirm, or — if Daniel prefers robots managed at the reverse-proxy layer — the static + nginx-rule variant (b), accepting the split gate. [Daniel decision — recommendation: endpoint]
  • OQ-R2 — Disallow non-page routes in Production? Should the production robots.txt additionally Disallow: /FramePlayer (the embed iframe) and/or Disallow: /api/ (the proxy JSON/stream paths)? Recommend: yes for /FramePlayer (an embed shell is not a destination page and would be thin/duplicate content if crawled), optional for /api/ (proxy paths return JSON/bytes, not HTML — crawlers mostly self-skip, but an explicit disallow is tidy). [Daniel decision — low stakes]

CMS

  • OQ-C1 — Both layers or just robots? (§5.1) Recommend both (static Disallow: / robots and the blanket noindex meta) — they fail independently and the combined cost is a file + one line; robots-disallow alone does not de-index a URL discovered via an external link, which is exactly what the on-page noindex closes. Confirm, or accept robots-only if the meta line is judged not worth the one CMS <head> touch. [Daniel decision — recommendation: both]

Cross-cutting

  • OQ-X1 — Is https://deepdrft.com the confirmed canonical origin? This is Phase 22's OQ1, still load-bearing here: every <loc>, the Sitemap: line, all assume SeoOptions.BaseUrl = https://deepdrft.com. If that value was confirmed when Phase 22 landed (COMPLETED.md §22 shows it shipped as https://deepdrft.com), this is closed — flagged only so the dependency is explicit. [Likely closed — confirm BaseUrl is final]

8. Cross-references (read before implementing)

  • product-notes/phase-22-seo-metadata-component.md — the parent spec; §7 "Adjacent but separate concerns" flagged all three Phase 23 items; the SeoOptions.BaseUrl / ReleaseRoutes / SeoEnvironment seams Phase 23 reuses are defined here.
  • COMPLETED.md §22 — what Phase 22 actually landed (the SeoEnvironment env gate, SeoOptions.BaseUrl = https://deepdrft.com, the ReleaseRoutes-based canonical the sitemap must match).
  • DeepDrftPublic/Controllers/ReleaseProxyController.cs — the thin-proxy shape and the "DeepDrft.API" named client the sitemap endpoint reuses to walk releases (server-to-server, no proxy hop). Note the distinction: the sitemap endpoint enumerates + transforms, it does not relay verbatim like this proxy.
  • DeepDrftPublic/CLAUDE.md — the host's "thin HTTP boundary, no domain logic" contract; the middleware ordering (static files before controller mapping — relevant to the robots endpoint-vs-static-file shadowing note); the IWebHostEnvironment availability server-side.
  • DeepDrftPublic.Client/Common/ReleaseRoutes.csDetailHref(entryKey, medium), the single source of truth for per-release detail URLs; every sitemap <loc> for a release goes through it.
  • DeepDrftPublic/Components/App.razor — where SeoEnvironment.IsProduction is seeded from IWebHostEnvironment.IsProduction() (lines 3848); the Phase 23 endpoints read the same predicate directly.
  • DeepDrftAPI/Controllers/ReleaseController.cs GET api/release — the paged PagedResult<ReleaseDto> read the sitemap walks (returns Items, TotalCount, PageNumber, PageSize; ReleaseDto carries EntryKey, Medium, ReleaseDate). No change to this endpoint (C5).
  • DeepDrftManager host App.razor / wwwroot/ — where Item 3's CMS robots file and blanket noindex meta land (the one CMS-touching surface).
  • sitemaps.org 0.9 schema + Google's "Manage your sitemaps" / robots.txt docs — the validation targets (AC-S1, AC-R*).