25 KiB
Phase 23 — SEO Crawl Directives (sitemap.xml, robots.txt, CMS noindex)
Product spec. Status: design / framing — implementation-ready pending Daniel's open-question calls. Author: product-designer. Date: 2026-06-23. No code has been written by this doc.
Phase 23 is the endpoint/file-shaped follow-on to Phase 22's per-page SeoHead component. Phase 22 flagged
these three as "adjacent but separate concerns" (product-notes/phase-22-seo-metadata-component.md §7): they
are a different unit of work — server-side endpoints and static files that tell crawlers which pages exist
and whether to crawl them at all, as opposed to the per-page head surface that tells crawlers what each
page is. Phase 22 is the content of discoverability; Phase 23 is the directives layer above it.
Three items, each independently shippable:
sitemap.xmlon the public host — a generated sitemap enumerating every indexable public URL.robots.txton the public host — allow + sitemap pointer in Production,Disallow: /everywhere else.- CMS
noindexonDeepDrftManager— the admin app must never be indexed. The one item touching the CMS.
1. The environment gate is the through-line (read this first)
Phase 22 established the rule that every non-production environment must be uncrawlable — the beta/staging
host must not appear in search results, and a stray crawl of staging must not dilute or duplicate the production
site. Phase 22 expressed this for page-level robots meta via SeoEnvironment (a [PersistentState] bridge
seeded from IWebHostEnvironment.IsProduction(), because SeoHead renders in the WASM component graph and
WASM has no IWebHostEnvironment).
Phase 23's three items all run server-side only (endpoints and static files, never the WASM render tree), so
they read the gate the simplest possible way: IWebHostEnvironment.IsProduction() injected directly. They do
not need the SeoEnvironment PersistentState bridge — that bridge exists solely to ferry the flag across
the server→WASM seam, which these never cross. This is the correct reuse: same source of truth
(IWebHostEnvironment.IsProduction(), the exact predicate App.razor already seeds SeoEnvironment from), no
parallel gate invented, and no PersistentState plumbing where it isn't needed.
| Concern | Renders where | Gate mechanism |
|---|---|---|
Phase 22 SeoHead robots meta |
WASM component graph | SeoEnvironment [PersistentState] bridge (server seed → WASM read) |
| Phase 23 sitemap / robots / CMS | server-side endpoint or static file | IWebHostEnvironment.IsProduction() injected directly |
Invariant E1 (the non-negotiable): in any non-production environment, robots.txt is Disallow: / and the
sitemap is either not served or empty. A crawler must see a closed door on beta before it sees a single URL.
The fail-safe default (matching Phase 22's SeoEnvironment fail-safe-to-noindex) is closed: if environment
resolution is ever ambiguous, behave as non-production (disallow).
2. The architecture seam (where this code lives, and what it must not become)
Per the project convention (root CLAUDE.md; DeepDrftPublic/CLAUDE.md): the public host owns thin HTTP
boundaries; domain logic lives in *.Services libraries or DeepDrftAPI. Generated XML/text is a rendering
of data the host already has access to — it belongs in a thin endpoint on DeepDrftPublic, and any list
logic it needs must reuse the existing release read, not re-implement enumeration.
sitemap.xmlis not a pass-through proxy likeReleaseProxyController(which relays JSON verbatim). It enumerates releases and transforms them into a different media type (XML). So it is a new endpoint that calls the upstreamGET api/releasepaged read (server-to-server via the existing"DeepDrft.API"namedHttpClient, the same client SSR prerender already uses — no proxy hop, no new data-layer code, no schema change) and walks the pages to build the URL set. C5 from Phase 22 holds: no new API endpoint onDeepDrftAPI, no schema change — the existingPagedResult<ReleaseDto>read is sufficient (it carriesEntryKey,Medium, andReleaseDate— everything a<url>entry needs).- The URL composition reuses Phase 22's seams, not new ones: absolute origin from
SeoOptions.BaseUrl(https://deepdrft.com— config, because the origin can't be derived behind the nginx proxy), and per-release detail paths fromReleaseRoutes.DetailHref(entryKey, medium)(the single source of truth the Cut/Session/Mix pages, the player bar, andSharePopoverall already use). The sitemap thereby lists the exact canonical URLsSeoHeademits as<link rel="canonical">— by construction, not by coincidence.
Seam note for staff-engineer.
SeoOptionsandReleaseRoutescurrently live inDeepDrftPublic.Client(Common/). A server-side endpoint onDeepDrftPublic(the host) references the client assembly already (it loadsDeepDrftPublic.Client._Importsas an additional WASM assembly and shares the staticStartup), so the host can read these types. Confirm the reference direction at implementation; ifSeoOptions.BaseUrlis not cleanly reachable from a host controller, the minimal move is to sourceBaseUrlfrom the same config the clientSeoOptionsis seeded from (it is a non-secret brand constant —appsettings.json, per Phase 22 §4.1), not to duplicate the constant. This is a wiring detail, not a design fork.
3. Item 1 — sitemap.xml
3.1 Mechanism and location
A new thin endpoint on DeepDrftPublic serving GET /sitemap.xml with content-type application/xml. It is an
endpoint (not a static file and not a Razor component) because the URL set is dynamic — it must include every
release detail URL, which changes as releases are added. A static file would go stale the moment a release lands.
Recommended placement: a small SitemapController (or a minimal-API endpoint in Program.cs) alongside the
existing proxy controllers in DeepDrftPublic/Controllers/. It is a host concern (HTTP surface + rendering),
exactly the layer the proxy controllers occupy. It injects IWebHostEnvironment (the gate) and
IHttpClientFactory (to call "DeepDrft.API"), mirroring ReleaseProxyController's constructor shape.
3.2 What it enumerates
The indexable public URL set, all absolutized against SeoOptions.BaseUrl:
- Static roots:
/(home),/about, and the four browse surfaces/cuts,/sessions,/mixes,/archive. These are a fixed list (a small in-endpoint constant array, or — cleaner — derived from the same nav index the site already maintains; see OQ-S3). - Every release detail URL: walk
GET api/release?page=N&pageSize=…untilPageNumber * PageSize >= TotalCount, and for eachReleaseDtoemitBaseUrl + ReleaseRoutes.DetailHref(dto.EntryKey, dto.Medium)— i.e./cuts/{key},/sessions/{key},/mixes/{key}. Nomediumfilter on the query (we want all media in one pass); a generouspageSize(e.g. 100–200) keeps the walk to a handful of round-trips even for a large catalogue.
3.3 XML shape
Standard sitemaps.org urlset:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://deepdrft.com/</loc></url>
<url><loc>https://deepdrft.com/about</loc></url>
<url><loc>https://deepdrft.com/cuts</loc></url>
<!-- … browse roots … -->
<url>
<loc>https://deepdrft.com/mixes/3f2a9c…</loc>
<lastmod>2026-05-12</lastmod> <!-- optional; from ReleaseDate — see OQ-S2 -->
</url>
<!-- … one <url> per release … -->
</urlset>
<loc>is required and must be a fully-qualified absolute URL (the reasonBaseUrlis mandatory).<lastmod>is optional and recommended fromReleaseDto.ReleaseDate(W3C date formatYYYY-MM-DD) for release URLs only — static roots have no natural lastmod and omit it. See OQ-S2 (ReleaseDate is the release date, not a content-modified date — it is a reasonable proxy but not strictly correct; the safe call is to include it, as a stale-but-plausible lastmod is better than none and crawlers treat it as a hint).- No
<changefreq>/<priority>— both are widely ignored by Google and add noise. Omit them.
3.4 Failure posture
The endpoint must degrade gracefully — a sitemap that 500s trains crawlers to stop fetching it. If the upstream
api/release walk fails partway, emit what was gathered (static roots are always available; partial release
set is better than none) and log the failure. Never 500 the sitemap. (Mirrors ReleaseProxyController's
philosophy of not collapsing valid-but-partial states, adapted to "always return a well-formed document.")
3.5 Acceptance criteria (sitemap)
- AC-S1 — Valid + complete.
GET /sitemap.xml(in Production) returns well-formedurlsetXML that validates against the sitemaps.org schema and contains: the 6 static roots and exactly one<url>per non-deleted release, addressed byReleaseRoutes.DetailHref(so every<loc>equals the page's canonical). - AC-S2 — Absolute URLs. Every
<loc>ishttps://deepdrft.com/…(config origin, not a relative path, not a proxy-derived host). - AC-S3 — Pagination walk is exhaustive. A catalogue larger than one page is fully enumerated (no releases dropped at a page boundary); a catalogue of zero releases yields a valid sitemap of just the static roots.
- AC-S4 — Environment-gated. In a non-production environment,
/sitemap.xmlis either not served (404) or served empty/Disallow-consistent — it must never advertise beta release URLs to a crawler (E1). Recommend 404 in non-production (simplest; nothing references it because the non-prodrobots.txtcarries noSitemap:line — see Item 2). - AC-S5 — Resilient. An upstream
api/releasefailure yields a well-formed sitemap of the static roots (and any releases gathered before the failure), logged — never a 500.
4. Item 2 — robots.txt
4.1 Mechanism and location — the static-vs-endpoint tradeoff (flagged)
robots.txt must express the environment gate (Disallow: / on beta, allow + sitemap pointer in Production). A
static file in wwwroot/ cannot do this — it serves identical bytes in every environment. So the
content is environment-dependent and wants a tiny endpoint (GET /robots.txt, content-type text/plain),
injecting IWebHostEnvironment for the gate.
Three options, with the recommendation:
- (a) Endpoint
GET /robots.txt[RECOMMENDED]. A few lines of code in the same place as the sitemap endpoint; readsIWebHostEnvironment.IsProduction(); emits the production or non-production body. Single source of truth for the gate, co-located with the sitemap, no infra dependency. The body is trivial. - (b) Static file + reverse-proxy rule. Ship a production
robots.txtinwwwroot/and have nginx serve aDisallow: /variant (or block the file) on the beta host. Cons: splits the gate across app + nginx config (two places to reason about, two places to get wrong); the beta protection lives in infra the app can't test; Daniel would maintain an nginx rule per environment. Rejected unless Daniel specifically wants robots managed at the proxy layer. - (c) Static file only. Cannot express the gate at all — would either crawl-allow beta (violates E1) or disallow production. Rejected outright.
The endpoint (a) is the natural sibling to the sitemap endpoint and keeps E1 in one testable place. Note the
ordering subtlety from DeepDrftPublic/CLAUDE.md: static-file middleware runs before component/controller
mapping, so if a literal wwwroot/robots.txt ever exists it would shadow the endpoint — the endpoint
approach requires that no static robots.txt is shipped (a one-line thing to verify, called out so it isn't
tripped over).
4.2 Content
Production:
User-agent: *
Allow: /
Sitemap: https://deepdrft.com/sitemap.xml
Every non-production environment (beta/staging):
User-agent: *
Disallow: /
- The
Sitemap:line uses the absoluteSeoOptions.BaseUrlorigin (same config source as the sitemap's<loc>s) — it is the one documented way to point crawlers at the sitemap without submitting it manually. - The non-production body carries no
Sitemap:line (consistent with AC-S4's "don't advertise beta URLs"). - Consider whether to additionally
Disallow: /FramePlayerand theapi/*proxy paths in Production (OQ-R2) — the embed iframe and the JSON/stream proxy endpoints are not pages worth crawling.
4.3 Acceptance criteria (robots)
- AC-R1 — Production allows + points.
GET /robots.txton the production host returnsAllow: /and aSitemap: https://deepdrft.com/sitemap.xmlline. - AC-R2 — Beta disallows everything.
GET /robots.txton any non-production host returnsUser-agent: *+Disallow: /and noSitemap:line (E1). - AC-R3 — Single gate. The Production-vs-beta distinction is driven by
IWebHostEnvironment.IsProduction()— the same predicate as the sitemap and as Phase 22'sSeoEnvironmentseed — not a second config flag. - AC-R4 —
text/plain. Correct content-type; no BOM/HTML wrapper.
5. Item 3 — CMS noindex (the one CMS-touching item)
This is the only Phase 23 item that touches DeepDrftManager. Scoped, minimal, admin-chrome-only — no
functional change to any CMS page, no service/API/data change. DeepDrftManager is an authenticated admin app
that must never appear in any search index, in any environment (it has no "production is fine to index" case —
the CMS is always noindex, unlike the public site whose gate flips per environment).
5.1 Mechanism — defense in depth, cheapest-robust
Two layers; recommend both because they fail independently and the cost is trivial:
- (a)
robots.txton the CMS host [primary]. ADisallow: /robots.txtserved at the CMS root. Because the CMS is always uncrawlable (no environment gate), this can be the simplest possible static file in the CMSwwwroot/— no endpoint, no environment logic:This is the cleanest single move and differs from the publicUser-agent: * Disallow: /robots.txtprecisely because there is no per-environment branch to express. - (b) Blanket
<meta name="robots" content="noindex,nofollow">in the CMS layout<head>[belt-and-braces]. A static meta tag in the CMS app's rootApp.razor/host<head>(the CMS's analogue of the publicApp.razor's static head block). This protects against the case where a crawler reaches a deep CMS URL thatrobots.txtdisallow doesn't de-index (robots disallow prevents crawling, but a URL linked from elsewhere can still be indexed without crawling; an on-pagenoindexis what actually keeps it out of the index). It is a single static line in the CMS host head — no per-page wiring, no component, noSeoHeadport (the CMS does not get Phase 22's component; this is one blanket tag).
Layer (a) is the floor; layer (b) is the robust ceiling. Together they cost a static file plus one <head> line.
5.2 Why the CMS does not reuse Phase 22's SeoHead / SeoEnvironment
Phase 22 C1/C9 explicitly kept the CMS out of scope ("Zero changes to DeepDrftManager"). Phase 23 makes the
one deliberate, minimal exception — but it does not drag the public component graph into the CMS. The CMS
need is a single constant directive ("never index"), not a parameterized per-page head surface; porting SeoHead
(a DeepDrftPublic.Client WASM component) into the server-rendered CMS would be wildly disproportionate. The
blanket meta + static robots is the right-sized answer. (And SeoEnvironment's per-environment flip is
irrelevant here — the CMS is noindex in all environments, including production.)
5.3 Acceptance criteria (CMS noindex)
- AC-C1 — CMS robots disallows.
GET /robots.txton the CMS host returnsUser-agent: *+Disallow: /. - AC-C2 — Every CMS page carries
noindex. Any CMS page's prerendered<head>contains<meta name="robots" content="noindex,nofollow">(the blanket layout tag), including the public-facing/account/loginand/account/registerroutes (which render in the leanCmsHomeLayout) and the home splash. Confirm the meta lands in whichever head block both layouts inherit (the CMS hostApp.razor), so a layout-specific head doesn't leave a route uncovered. - AC-C3 — No functional change. No CMS page's behavior, auth gate, layout, or data path changes — the diff is
a static
robots.txtand a static<meta>line. (Aligns with Phase 22 AC9's spirit, now scoped as the intentional CMS exception.) - AC-C4 — Always-on (no env gate). The CMS
noindexholds in production too — it is unconditional, unlike the public site.
6. Wave decomposition
These are largely independent — three separate surfaces with one shared concept (the env gate) and one shared
config value (BaseUrl). The dependency graph is shallow.
- 23.1 — Public env-gate primitives +
robots.txtendpoint (cold-start, shared seam). Stand up the server-sideIWebHostEnvironment-gated endpoint pattern onDeepDrftPublicand shipGET /robots.txt(Production allow+sitemap-pointer / non-prodDisallow: /). This is the smallest item and it establishes the shared gate + BaseUrl wiring that 23.2 also uses, so doing it first de-risks the seam. Resolves the static-vs-endpoint call (OQ-R1). Cold-start; nothing depends on it being done first except that 23.2 reuses the same gate wiring. - 23.2 —
sitemap.xmlendpoint. The release-enumeration walk overGET api/release+ XML emission +ReleaseRoutes/BaseUrlabsolutization + the env gate (404 in non-prod). The largest item. **Shares the gate- BaseUrl wiring with 23.1** (do 23.1 first or co-develop; they touch the same controller area). The
Sitemap:line in 23.1's productionrobots.txtpoints at this — so 23.1's production body assumes 23.2 exists (harmless if 23.2 lands slightly later: aSitemap:pointer to a not-yet-built URL just 404s until it does).
- BaseUrl wiring with 23.1** (do 23.1 first or co-develop; they touch the same controller area). The
- 23.3 — CMS
noindex(the CMS-side item). Staticrobots.txt(Disallow: /) in theDeepDrftManagerwwwroot/+ blanket<meta name="robots" content="noindex,nofollow">in the CMS host<head>. Fully independent — touches onlyDeepDrftManager, shares nothing with 23.1/23.2, can run in parallel from day one.
Dependency shape: 23.1 → 23.2 (shared gate/BaseUrl wiring + the Sitemap: pointer relationship); 23.3 ∥
(parallel, independent, different app). The cold-start item is 23.1 (it proves the gate seam the public side
leans on); 23.3 can run start-to-finish alongside either.
Validation (folded into each wave's ACs, not a separate wave): the items are small enough that a dedicated
validation wave is overkill — each wave carries its own ACs (S/R/C above). A single end-of-phase check that
exercises the production-vs-beta matrix for all three (Google Search Console / a curl against both hosts, plus
the sitemaps.org validator) is worth doing once 23.1–23.3 land.
7. Open questions for Daniel (product/infra calls, not implementation detail)
Sitemap
- OQ-S1 — Browse variants vs. canonical roots. The sitemap lists the canonical browse roots (
/cuts,/sessions,/mixes,/archive). Phase 11 put Archive filters in the URL (/archive?q=&medium=&genre=). Recommend: do NOT enumerate filtered/paginated variants — they are filtered views of the same release set, not distinct content, and listing them invites duplicate-content dilution. The per-release detail URLs carry the indexable content; the browse roots are navigational.[Daniel decision — recommendation: canonical roots only] - OQ-S2 —
lastmodsource. UseReleaseDto.ReleaseDateas the release URLs'<lastmod>? It is the release date, not a content-last-modified date (a re-edited description or replaced cover would not bump it). Recommend: include it — a plausible-but-imperfect lastmod is a useful crawl hint and strictly better than omitting it; the alternative (a true content-modified timestamp) would need a schema column that doesn't exist (would violate C5/no-schema-change). Static roots omitlastmod.[Daniel decision — recommendation: ReleaseDate, accept the imprecision] - OQ-S3 — Static-root list source. Hardcode the 6 static roots in the endpoint, or derive from the site's nav
index (
DeepDrftPublic.Client/Layout/Pages.csAllPages)? Recommend: hardcode for v1 (the indexable-roots set is not the same as the nav set — e.g./FramePlayeris a nav-absent route that must stay out, and a new nav entry isn't automatically sitemap-worthy), with a code comment to revisit if the set grows. Deriving couples the sitemap to nav decisions in a way that can silently leak or drop URLs.[Daniel decision — recommendation: explicit list]
robots
- OQ-R1 — Endpoint vs. static + nginx (§4.1). Recommend the endpoint (single testable gate, co-located
with the sitemap). Confirm, or — if Daniel prefers robots managed at the reverse-proxy layer — the static +
nginx-rule variant (b), accepting the split gate.
[Daniel decision — recommendation: endpoint] - OQ-R2 — Disallow non-page routes in Production? Should the production
robots.txtadditionallyDisallow: /FramePlayer(the embed iframe) and/orDisallow: /api/(the proxy JSON/stream paths)? Recommend: yes for/FramePlayer(an embed shell is not a destination page and would be thin/duplicate content if crawled), optional for/api/(proxy paths return JSON/bytes, not HTML — crawlers mostly self-skip, but an explicit disallow is tidy).[Daniel decision — low stakes]
CMS
- OQ-C1 — Both layers or just robots? (§5.1) Recommend both (static
Disallow: /robots and the blanketnoindexmeta) — they fail independently and the combined cost is a file + one line; robots-disallow alone does not de-index a URL discovered via an external link, which is exactly what the on-pagenoindexcloses. Confirm, or accept robots-only if the meta line is judged not worth the one CMS<head>touch.[Daniel decision — recommendation: both]
Cross-cutting
- OQ-X1 — Is
https://deepdrft.comthe confirmed canonical origin? This is Phase 22's OQ1, still load-bearing here: every<loc>, theSitemap:line, all assumeSeoOptions.BaseUrl = https://deepdrft.com. If that value was confirmed when Phase 22 landed (COMPLETED.md §22 shows it shipped ashttps://deepdrft.com), this is closed — flagged only so the dependency is explicit.[Likely closed — confirm BaseUrl is final]
8. Cross-references (read before implementing)
product-notes/phase-22-seo-metadata-component.md— the parent spec; §7 "Adjacent but separate concerns" flagged all three Phase 23 items; theSeoOptions.BaseUrl/ReleaseRoutes/SeoEnvironmentseams Phase 23 reuses are defined here.COMPLETED.md §22— what Phase 22 actually landed (theSeoEnvironmentenv gate,SeoOptions.BaseUrl = https://deepdrft.com, theReleaseRoutes-based canonical the sitemap must match).DeepDrftPublic/Controllers/ReleaseProxyController.cs— the thin-proxy shape and the"DeepDrft.API"named client the sitemap endpoint reuses to walk releases (server-to-server, no proxy hop). Note the distinction: the sitemap endpoint enumerates + transforms, it does not relay verbatim like this proxy.DeepDrftPublic/CLAUDE.md— the host's "thin HTTP boundary, no domain logic" contract; the middleware ordering (static files before controller mapping — relevant to the robots endpoint-vs-static-file shadowing note); theIWebHostEnvironmentavailability server-side.DeepDrftPublic.Client/Common/ReleaseRoutes.cs—DetailHref(entryKey, medium), the single source of truth for per-release detail URLs; every sitemap<loc>for a release goes through it.DeepDrftPublic/Components/App.razor— whereSeoEnvironment.IsProductionis seeded fromIWebHostEnvironment.IsProduction()(lines 38–48); the Phase 23 endpoints read the same predicate directly.DeepDrftAPI/Controllers/ReleaseController.csGET api/release— the pagedPagedResult<ReleaseDto>read the sitemap walks (returnsItems,TotalCount,PageNumber,PageSize;ReleaseDtocarriesEntryKey,Medium,ReleaseDate). No change to this endpoint (C5).DeepDrftManagerhostApp.razor/wwwroot/— where Item 3's CMS robots file and blanketnoindexmeta land (the one CMS-touching surface).- sitemaps.org
0.9schema + Google's "Manage your sitemaps" / robots.txt docs — the validation targets (AC-S1, AC-R*).