Files
deepdrft/product-notes/phase-21-windowed-streaming-buffer.md
T

27 KiB
Raw Blame History

Phase 21 — Windowed Streaming Buffer (bounded client memory for long streams)

Product spec. Status: design / framing — implementation-ready pending Daniel's open-question calls. Author: product-designer. Date: 2026-06-23. No code has been written by this doc. Surface: public listener site only (DeepDrftPublic.Client player stack + DeepDrftPublic TypeScript audio interop). No CMS (DeepDrftManager) change. No data-model or schema change. The one server touch is reuse, not new surface: the existing DeepDrftAPI HTTP Range: bytes=X- partial-content primitive (Phase 4, landed) is the load-bearing dependency; this phase adds no new API endpoint.

Sequencing dependency (Daniel, 2026-06-23): Phase 18 (Opus Low-Data Streaming) comes BEFORE this phase. Format support — specifically the derived Ogg Opus fullband 320 low-data delivery path (product-notes/phase-18-opus-low-data-streaming.md) — is a prerequisite that sequences ahead of windowing. Phase 21's windowing must work across both delivery formats (lossless WAV and Opus). Its C5 invariant below already anticipated this ("must not foreclose MP3/FLAC"); Opus is now the concrete VBR/containerized driver of C5. Windowing an Opus stream uses the decoder's approximate byte↔time mapping (OpusFormatDecoder.calculateByteOffset — Ogg-page interpolation), exactly the C5 case — not the exact CBR-WAV byteRate math. Build the window machinery format-agnostically (§2 C3/C5) so it inherits Opus for free.


1. Goal

Bound the client memory a playing track consumes to a small, configurable forward window — independent of total stream length — so a 1 GB+ DJ MIX (Phase 9 Mix medium: a single long track) plays without the whole decoded PCM accumulating in the browser.

The defect, stated precisely. The network path already streams in adaptive 1664 KB chunks (StreamingAudioPlayerService.StreamAudioWithEarlyPlayback) — that part is fine. The accumulation is on the decode side: PlaybackScheduler holds private buffers: AudioBuffer[] and never evicts ("Supports pause/resume/seek by retaining all buffers" — its own doc comment). Every 64 KB segment the StreamDecoder decodes is pushed via addBuffer() and kept for the life of the track. Decoded PCM is larger than the compressed-or-raw source in memory (Web Audio AudioBuffer is 32-bit float per sample per channel — a 16-bit stereo WAV roughly doubles in size once decoded), so a 1 GB WAV becomes ~2 GB of retained AudioBuffer float data. That is the OOM.

One-line framing: today the player decodes the whole track into memory and keeps it; Phase 21 makes it keep only a sliding forward window and discard what has already played, refilling on demand from the Range primitive it already uses for seek.


2. Constraints / invariants (the contract that must hold)

These are non-negotiable. The §3.5 streaming seam (root CLAUDE.md "Streaming-first audio playback"; CONTEXT.md §3.5) is called the most architecturally load-bearing part of the playback path by both docs. This phase modifies that seam — so the contract it must preserve is spelled out here.

  • C1 — The seek-beyond-buffer Range path is the substrate, kept intact. Phase 4 landed HTTP Range: bytes={offset}-206 Partial Content end to end (client TrackMediaClientDeepDrftPublic proxy → DeepDrftAPI), and StreamDecoder.reinitializeForRangeContinuation retains the parsed format header on a continuation body (no re-parse). Windowed refill is a generalization of this exact path (§3.1) — it must not require a second, divergent fetch mechanism.
  • C2 — Playback start latency unchanged. Today playback starts as soon as a configurable minimum buffer count is queued (header-derived duration, not full-file). The window model must keep first-audio latency at parity — bounding memory must not reintroduce a fetch-then-play stall.
  • C3 — The format-decoder abstraction is untouched. IFormatDecoder owns all format-specific byte math; AudioPlayer.createFormatDecoder already dispatches on Content-Type (WAV/MP3/FLAC decoders all wired today — verified 2026-06-23; an OpusFormatDecoder joins them in Phase 18). Windowing lives in the format-agnostic layer (PlaybackScheduler eviction + StreamDecoder/player refill orchestration); it must add no format-specific branches. A future wired MP3/FLAC decoder inherits windowing for free.
  • C4 — Read-only playback only. This is a memory-management change, not a UX change. No new user-visible control, no change to seek/transport semantics beyond what the listener already experiences. Seek must still feel identical.
  • C5 — Must window both delivery formats (WAV lossless AND Opus low-data). Byte↔time mapping for refill is exact and cheap for WAV (CBR: byteRate from the header). For VBR/containerized formats it is approximate (the decoders carry TOC/SEEKTABLE/Ogg-page seek math). Phase 18 (Opus) is sequenced before this phase and is the concrete driver here: an Ogg Opus 320 stream is VBR and page-paged, so its calculateByteOffset is an approximate page-interpolation, not exact-offset. The window machinery must express refill purely in terms of the decoder's existing calculateByteOffset, so the same code windows WAV exactly and Opus approximately — no WAV-special-cased offset math in the window layer. (MP3/FLAC decoders are already wired in the registry too — the registry dispatches on content-type today; an OpusFormatDecoder joins them in Phase 18.)
  • C6 — No regression to the single-instance JS decoder concurrency guarantees. The current code is careful that only one streaming loop touches the single JS StreamDecoder at a time (DrainActiveStreamingTaskAsync, the _streamingCancellation identity dance). Windowed refill introduces more mid-stream fetches; it must route through the same drain/cancellation discipline, not around it.
  • C7 — The Mix visualizer's data source is independent and must stay that way. The Phase 10/12 WebGL2 lava visualizer renders from a preprocessed high-res waveform datum fetched per-track (GET api/track/{entryKey}/waveform/high-res), not from live decoded PCM. Confirmed: evicting played AudioBuffers cannot starve the visualizer — it never read them. The window model is invisible to the visualizer. (This is the canonical 1 GB case and the case that proves the eviction is safe.)

3. Architectural shape

3.0 The mental model

A track's audio is a byte range [0, fileLength) on disk. At any moment the listener is at playback position P (seconds → byte offset via the format decoder). The player should hold decoded AudioBuffers only for a bounded window roughly [P - back, P + ahead]:

  • forward fill (ahead) — enough decoded lookahead that playback never starves (covers the existing 500 ms scheduler lookahead plus network jitter headroom);
  • back-retain (back) — a small amount of already-played audio kept so a short seek-back does not trigger a network refetch;
  • evict — anything older than P - back is dropped (AudioBuffer references released → GC reclaims the float data);
  • refill — when forward decoded lookahead drops below a low-water mark, fetch+decode more from the current byte position; when the window's tail is evicted and the listener seeks back past it, refetch that region via the Range primitive (the seek-beyond-buffer path, run backwards).

This is a ring/sliding-window buffer keyed on playback position, driven by high/low-water marks — the standard bounded-producer/bounded-consumer pattern, transplanted onto the decode→schedule seam.

3.1 Why this is a generalization of seek-beyond-buffer, not a new mechanism

The seek-beyond-buffer path already does every primitive the window needs, just triggered manually and one-shot:

Window operation Existing seek-beyond-buffer machinery it reuses
Discard buffers, keep offset PlaybackScheduler.clearForSeek() + setPlaybackOffset() (clears buffers, retains the absolute-time anchor)
Fetch from a byte offset TrackMediaClient.GetTrackMedia(key, byteOffset)Range: bytes=X- → 206
Decode a header-less body StreamDecoder.reinitializeForRangeContinuation(remainingByteLength)
Map time → byte offset StreamDecoder.calculateByteOffset()IFormatDecoder.calculateByteOffset()
Single-loop safety on refetch _streamingCancellation swap + DrainActiveStreamingTaskAsync()

The difference is eviction does not exist yet (the scheduler only ever clear()s wholesale) and refill is one-shot (a seek, not a continuous low-water-triggered loop). So the new work is two seams: a partial-evict on the scheduler, and a position-driven refill controller on the player. The fetch/decode/offset plumbing is reused verbatim.

3.2 The three candidate directions

Per file convention the alternatives are recorded; the recommendation follows.

Direction A — Sliding window on the existing single forward stream (recommended). Keep the current model where the C# loop reads one forward HTTP stream and pumps chunks into the JS decoder. Add two things: (1) PlaybackScheduler gains partial eviction — drop buffers whose absolute-time end is older than P - back, adjusting its index bookkeeping so getCurrentPosition() and scheduling stay correct against a buffer array that no longer starts at index 0; (2) a back-pressure signal — when forward decoded lookahead exceeds the high-water mark, the C# loop pauses reading the HTTP stream (stops calling ReadAsync) until playback drains it below low-water, then resumes. Memory is bounded by high-water + back-retain. Seek-back beyond the retained window falls through to the existing seek-beyond-buffer path unchanged. Why recommended: smallest change to the load-bearing seam; reuses the live forward stream (no extra connections in the common case); eviction and back-pressure are the only genuinely new mechanisms, and both are local (one to the scheduler, one to the read loop). Back-pressure via "stop reading the socket" is exactly how TCP flow control already wants to behave — pausing ReadAsync lets the kernel window close; we are not fighting the transport.

Direction B — Discrete window segments, each its own Range fetch. Treat the file as fixed-size byte segments (e.g. 4 MB). Hold N decoded segments around P; fetch the next/previous segment via a fresh Range request as the window slides; discard the far segment. No live long-lived forward stream — every window is an independent 206. Why not (default): turns one connection into many short Range requests (more proxy hops through DeepDrftPublic, more server-side WavOffsetService-style header synthesis, more places a fetch can fail mid-stream — worsening the §1.6 error surface), and the byte↔time segment math must be exact at every boundary. It is the cleaner model for true random-access (and the better base if seeking-heavy usage dominates), so keep it as the fallback if Direction A's back-pressure proves leaky in practice. Borrowed prior art: HLS/DASH segment windows and the MSE SourceBuffer.remove() eviction model — this is how every production HTML5 adaptive player bounds memory. We are doing the hand-rolled equivalent because the stack is a bespoke Web Audio graph, not <media> + MSE.

Direction C — Adopt MediaSource Extensions (MSE) and let the browser manage the buffer. Stop hand-rolling the decode→schedule graph for long tracks; feed the Range stream into a SourceBuffer and let the browser evict via its built-in quota + remove(). Memory management becomes the platform's problem. Why not — RESOLVED, rejected (Daniel, 2026-06-23; see OQ5): MSE does not accept raw WAV/PCM — it wants containerized formats (fragmented MP4/WebM, or MP3/AAC elementary streams). The entire bespoke visualizer/spectrum graph is wired to the Web Audio AudioContext, not a <media> element. Adopting MSE is a rewrite of the playback substrate, not a windowing change. It looked like the real long-term answer once compressed delivery arrived — but Daniel has decided compressed delivery (Phase 18 Opus) will feed the same bespoke graph via the IFormatDecoder seam, so the compressed-delivery move that would have justified MSE happens without surrendering the graph. The bespoke graph is a deliberate long-term commitment; MSE is rejected. Direction A is therefore the permanent destination, not a stopgap that MSE will retire. Recorded as considered-and-declined.

Direction A is the smallest coherent change that hits the headline (bounded memory under a 1 GB stream) while honoring C1C7. It keeps the live forward stream, reuses the seek-beyond-buffer path for the only genuinely random-access case (seek-back past the retained tail), and isolates the two new mechanisms. The final architecture and the exact eviction/back-pressure API are staff-engineer's call at implementation (per file convention); this spec fixes the shape and the invariants, not the method signatures.

3.4 SOLID / road-not-taken rationale

  • SRP, preserved. Eviction is a PlaybackScheduler concern (it already owns buffer storage); refill orchestration is a player-service/StreamDecoder concern (they already own the fetch loop); byte↔time math stays in IFormatDecoder. No responsibility crosses a boundary it does not already own.
  • OCP, via C3/C5. Windowing added in the format-agnostic layer means wiring MP3/FLAC later changes zero window code. The window expresses refill through calculateByteOffset — the one seam the decoders already implement.
  • The seam stays single-writer (C6). Every new refetch routes through the existing cancellation/drain discipline, so "only one loop touches the JS decoder" remains true. This is the rule most likely to be violated by a naive implementation and is called out as a hard invariant.
  • Road not taken — eager full decode with a memory cap that just stops decoding. Tempting (decode until you hit a byte budget, then stop) but it breaks playback of long tracks past the cap entirely — it bounds memory by refusing to play the rest, not by sliding. Rejected: it is a degradation, not a feature.

4. Use cases

  • UC1 — Play a 1 GB+ DJ MIX start to finish (the headline). Memory stays bounded throughout; the listener experiences continuous playback identical to a short track.
  • UC2 — Seek forward within a long track. Already handled by seek-beyond-buffer; under windowing the forward seek clears the window and refills at the target — no behavior change, now with eviction so the pre-seek region does not linger.
  • UC3 — Seek back a few seconds. Served from the back-retain window with no network refetch (the reason back exists).
  • UC4 — Seek back far, past the evicted tail. Falls through to the existing seek-beyond-buffer Range fetch, run toward an earlier offset. (Open question OQ2 — see §6.)
  • UC5 — Pause a long track for a long time. Memory stays at the bounded window size while paused (no continued decode). On resume, forward fill restarts from the low-water trigger.
  • UC6 — Mix detail page with the lava visualizer running. Visualizer reads its preprocessed datum (C7); windowing is invisible to it. Confirmed non-interaction.

5. Interaction with the deferred Phase 1 streaming features

This phase touches the same decoder/scheduler seam as the deferred Phase 1.3/1.4/1.5 items and the 1.6/1.7 robustness items. The interactions, explicitly:

  • 1.3 Preload / prefetch (deferred; preload half). Shares machinery, does not conflict — and should be sequenced after. Preload stages the next track into a second decoder instance during the current track's tail; windowing bounds the current track's forward buffer. They are orthogonal axes (next-track vs. current-track-window), but they compound the memory question: a naive preload of a second 1 GB mix would reintroduce the OOM this phase fixes. Recommendation: land windowing first, so that when preload arrives, the staged next-track decoder is also windowed by construction (it inherits the bounded scheduler). Windowing makes preload safe for long tracks; without it, preload of mixes is a memory hazard.
  • 1.4 Crossfade (deferred). Needs two simultaneous PlaybackScheduler instances briefly overlapping. Both would be windowed instances — the overlap doubles the window size momentarily, not the whole track. Windowing makes crossfade between two long mixes affordable. No reordering needed; 1.4 still gates on 1.3.
  • 1.5 Gapless (deferred). Sample-accurate hand-off of the next track's first buffer at the current track's last buffer. Windowing changes which buffers are retained but not the hand-off mechanism; the only care point is that the current track's final window must not be evicted before the gapless boundary is scheduled. A minor invariant for whoever builds 1.5, not a blocker. Note 1.5's existing WAV-only caveat stands.
  • 1.6 Track-skip on error (deferred). Windowing enlarges the error surface — call this out. Today a fetch failure happens at load (one fetch) or at a user seek (one fetch). Windowed refill issues mid-stream fetches the listener did not initiate; one of those can fail at byte 700 M of a 1 GB mix. So Phase 21 should ship with at least the cheap half of 1.6: a mid-stream refill failure must surface a clear error and not wedge the player (it must not leave playback "running" with a starved scheduler — mirror the playFromPosition end-of-buffer recovery already in PlaybackScheduler). The rich half (byte-scan to next valid frame) stays deferred. Recommendation: fold the minimal refill- failure handling into Phase 21's acceptance criteria (AC6) rather than leaving it entirely to 1.6 — it is created by this phase.
  • 1.7 Safari compatibility (deferred). Windowing adds no new Safari-specific surface beyond what the streaming path already has. The one adjacency: more frequent AudioContext activity during refill should be checked against the older-Safari webkitAudioContext quirks when 1.7 is addressed — note it, do not block on it.

6. Open questions for Daniel (genuine product decisions, not implementation detail)

These are policy calls with user-visible or resource trade-offs — flagged rather than decided here.

  • OQ1 — Window size policy. What bounds the window — a fixed byte/time budget (e.g. "hold at most ~30 s decoded ahead + ~10 s behind"), or a configurable memory budget (e.g. "≤ N MB of decoded PCM") that derives the time window from the stream's byte rate? Recommend a time-based forward window + small time-based back-retain as the primary knob (intuitive, format-portable), with a hard memory ceiling as a secondary guard. The exact numbers are tunable post-landing; Daniel picks the policy axis. [Daniel decision]
  • OQ2 — Seek-back past the evicted window. When the listener seeks back earlier than the retained tail, we must refetch (the audio is gone). Acceptable to take the same brief re-buffer the forward seek-beyond-buffer takes today? (Recommend yes — it is the symmetric case and listeners already accept it forward.) Or should back-retain be generous enough that this is rare? [Daniel decision]
  • OQ3 — Configurable total in-flight memory cap. Should there be a single hard byte ceiling on total decoded audio held by the player (a safety net independent of the window-size policy), exposed as a config value? Recommend yes, as a guard rail even if the window policy is time-based — it is the backstop that makes "1 GB stream never OOMs" a guarantee rather than a tuning hope. [Daniel decision]
  • OQ4 — Apply windowing to all tracks, or only long ones? A 3-minute Cut decoded whole is ~3060 MB — harmless today. Windowing everything is simpler (one code path) but adds refill machinery to short tracks that never needed it. Recommend window everything (one path, C6-safe, and short tracks simply never hit a refill because they fit inside the forward window) — but Daniel may prefer a size threshold. [Daniel decision]
  • OQ5 — Is MSE (Direction C) the real destination? — RESOLVED: NO (Daniel, 2026-06-23). Do not adopt MSE. The bespoke Web Audio decode→schedule graph stays — it is bespoke by deliberate choice, a long-term commitment, not a stopgap. Daniel's rationale: the player is intentionally a custom graph, not an HTML <media> element; the compressed-delivery move that would have made MSE tempting is being met instead by Phase 18 (Opus low-data path) feeding the same bespoke graph through the IFormatDecoder seam — so compressed delivery arrives without surrendering the graph. Consequence for this phase: Direction A (the hand-rolled sliding window) is the destination, not a placeholder; invest in it as permanent machinery. It will window both the WAV and the Opus path (the sequencing note at the top). Direction C is recorded as considered and declined per file convention; kept visible so a future reader sees the road not taken and why. [RESOLVED — bespoke graph retained; MSE rejected]

7. Acceptance criteria

  • AC1 (headline) — Bounded memory under a 1 GB stream. Playing a 1 GB+ WAV mix start to finish, the browser tab's retained decoded-audio memory stays bounded to the configured window (not growing toward ~2 GB). Verifiable via browser memory tooling: peak decoded-audio footprint is independent of track length and tracks the window-size policy, not the file size.
  • AC2 — Playback-start latency at parity (C2). First-audio latency for a track is unchanged from pre-windowing (within noise). Windowing does not introduce a fetch-then-play stall.
  • AC3 — Continuous playback, no starvation. A long mix plays edge to edge with no audible gaps, underruns, or stalls under normal network conditions — the forward fill stays ahead of the playhead.
  • AC4 — Seek-back within the window is instant (UC3). A short backward seek into retained audio produces no network request.
  • AC5 — Seek (forward, and back past the window) still works (UC2/UC4). Both resolve via the existing Range path with the same behavior the listener sees today; the pre-seek region is evicted, not retained.
  • AC6 — A mid-stream refill failure degrades cleanly (the 1.6 adjacency). A failed refill fetch surfaces a clear user-visible error and leaves the player in a recoverable state (not a wedged "playing" with a starved scheduler). It must not silently hang.
  • AC7 — The Mix visualizer is unaffected (C7). With the lava visualizer running on a long mix, the visualizer renders identically (it reads the preprocessed datum, never the evicted buffers).
  • AC8 — Single-decoder concurrency invariant holds (C6). Under rapid seek + refill activity, no interleaved ProcessStreamingChunk calls corrupt the single JS decoder (the existing drain/cancel discipline still governs every fetch).

8. Wave decomposition

Dependency shape: 21.1 → 21.2 → 21.3, with 21.4 validating the whole. 21.1 is the cold-start prerequisite and the load-bearing change; the rest layer on it.

  • 21.1 — Partial eviction in PlaybackScheduler (cold-start; the load-bearing change). Give the scheduler the ability to drop already-played buffers and keep its position/index bookkeeping correct against a buffer array that no longer begins at absolute time 0 (today getCurrentPosition, playFromPosition, and the scheduling loop all assume buffers[0] is the track start). This is the hardest correctness work in the phase — the time-anchor math must stay exact through eviction. No refill yet; with eviction alone and the forward read loop unchanged, this is provably memory-bounded for the played region. Independent of the §6 open questions — it can begin immediately; the window sizes (OQ1/OQ3) are parameters fed in later. Settled and cold-start.
  • 21.2 — Back-pressure on the forward read loop (the bound on the unplayed region). Make the C# StreamAudioWithEarlyPlayback loop stop calling ReadAsync when forward decoded lookahead exceeds the high-water mark, and resume below low-water. Together with 21.1, this bounds both the played and unplayed sides — the full memory guarantee (AC1). Must route resume/pause through the existing cancellation-safe single-loop discipline (C6). Depends on 21.1 (eviction must exist so the drained region is reclaimed, not merely un-read).
  • 21.3 — Seek-back-past-window refill (close the random-access case). Wire UC4 — when a backward seek lands earlier than the retained tail, refetch via the existing seek-beyond-buffer Range path pointed at the earlier offset, and the minimal AC6 refill-failure handling. Mostly reuse of the landed seek path; the new work is the trigger (window-miss detection) and the clean-failure path. Depends on 21.1 + 21.2 (needs the window boundaries they define).
  • 21.4 — Validation pass against the 1 GB target (acceptance). Exercise AC1AC8 against a real 1 GB+ mix: memory profiling (AC1), latency parity (AC2), edge-to-edge playback (AC3), the seek matrix (AC4/AC5), induced refill failure (AC6), visualizer-running (AC7), and rapid-seek concurrency (AC8). Largely test/measurement; any break is likely a tuning fix in the 21.1 anchor math or the 21.2 water-marks. Depends on 21.121.3.

9. Cross-references (read before implementing)

  • Root CLAUDE.md "Streaming-first audio playback" / CONTEXT.md §3.5 — the seam this phase modifies; the §2 invariants here restate its contract. Both flag it as the most load-bearing path.
  • PLAN.md Phase 4 (landed) / COMPLETED.md — the HTTP Range bytes=X- primitive this generalizes.
  • PLAN.md Phase 1.3 / 1.4 / 1.5 / 1.6 / 1.7 — the deferred decoder/scheduler-seam features; §5 above reconciles each.
  • PLAN.md Phase 9 — defines the Mix medium (single long track), the canonical 1 GB case.
  • PLAN.md Phase 10 / product-notes/phase-10-mix-visualizer-lava-reframe.md / product-notes/phase-12-waveform-visualizer-generalization.md — establishes the preprocessed per-track high-res waveform datum; the basis for C7 (visualizer does not read live PCM).
  • DeepDrftPublic/Interop/audio/PlaybackScheduler.ts — owns the unbounded buffers: AudioBuffer[]; 21.1 lives here.
  • DeepDrftPublic/Interop/audio/StreamDecoder.tsreinitializeForRangeContinuation, calculateByteOffset; the refill substrate.
  • DeepDrftPublic.Client/Services/StreamingAudioPlayerService.cs — the C# forward read loop (StreamAudioWithEarlyPlayback), the seek-beyond-buffer path (SeekBeyondBuffer), and the cancellation/drain discipline (C6); 21.2/21.3 live here.
  • DeepDrftPublic.Client/Clients/TrackMediaClient.cs — the Range-capable media fetch reused by refill.