Files
deepdrft/product-notes/phase-21-windowed-streaming-buffer.md
T

25 KiB
Raw Blame History

Phase 21 — Windowed Streaming Buffer (bounded client memory for long streams)

Product spec. Status: design / framing — implementation-ready pending Daniel's open-question calls. Author: product-designer. Date: 2026-06-23. No code has been written by this doc. Surface: public listener site only (DeepDrftPublic.Client player stack + DeepDrftPublic TypeScript audio interop). No CMS (DeepDrftManager) change. No data-model or schema change. The one server touch is reuse, not new surface: the existing DeepDrftAPI HTTP Range: bytes=X- partial-content primitive (Phase 4, landed) is the load-bearing dependency; this phase adds no new API endpoint.


1. Goal

Bound the client memory a playing track consumes to a small, configurable forward window — independent of total stream length — so a 1 GB+ DJ MIX (Phase 9 Mix medium: a single long track) plays without the whole decoded PCM accumulating in the browser.

The defect, stated precisely. The network path already streams in adaptive 1664 KB chunks (StreamingAudioPlayerService.StreamAudioWithEarlyPlayback) — that part is fine. The accumulation is on the decode side: PlaybackScheduler holds private buffers: AudioBuffer[] and never evicts ("Supports pause/resume/seek by retaining all buffers" — its own doc comment). Every 64 KB segment the StreamDecoder decodes is pushed via addBuffer() and kept for the life of the track. Decoded PCM is larger than the compressed-or-raw source in memory (Web Audio AudioBuffer is 32-bit float per sample per channel — a 16-bit stereo WAV roughly doubles in size once decoded), so a 1 GB WAV becomes ~2 GB of retained AudioBuffer float data. That is the OOM.

One-line framing: today the player decodes the whole track into memory and keeps it; Phase 21 makes it keep only a sliding forward window and discard what has already played, refilling on demand from the Range primitive it already uses for seek.


2. Constraints / invariants (the contract that must hold)

These are non-negotiable. The §3.5 streaming seam (root CLAUDE.md "Streaming-first audio playback"; CONTEXT.md §3.5) is called the most architecturally load-bearing part of the playback path by both docs. This phase modifies that seam — so the contract it must preserve is spelled out here.

  • C1 — The seek-beyond-buffer Range path is the substrate, kept intact. Phase 4 landed HTTP Range: bytes={offset}-206 Partial Content end to end (client TrackMediaClientDeepDrftPublic proxy → DeepDrftAPI), and StreamDecoder.reinitializeForRangeContinuation retains the parsed format header on a continuation body (no re-parse). Windowed refill is a generalization of this exact path (§3.1) — it must not require a second, divergent fetch mechanism.
  • C2 — Playback start latency unchanged. Today playback starts as soon as a configurable minimum buffer count is queued (header-derived duration, not full-file). The window model must keep first-audio latency at parity — bounding memory must not reintroduce a fetch-then-play stall.
  • C3 — The format-decoder abstraction is untouched. IFormatDecoder (WAV active; MP3/FLAC implemented, not yet wired) owns all format-specific byte math. Windowing lives in the format-agnostic layer (PlaybackScheduler eviction + StreamDecoder/player refill orchestration); it must add no format-specific branches. A future wired MP3/FLAC decoder inherits windowing for free.
  • C4 — Read-only playback only. This is a memory-management change, not a UX change. No new user-visible control, no change to seek/transport semantics beyond what the listener already experiences. Seek must still feel identical.
  • C5 — WAV-only is the shipping target; the design must not foreclose MP3/FLAC. Byte↔time mapping for refill is exact and cheap for WAV (CBR: byteRate from the header). For VBR formats the mapping is approximate (the decoders already carry TOC/SEEKTABLE seek math). The window machinery must express refill in terms of the decoder's existing calculateByteOffset, so the same code works when those formats are wired — no WAV-special-cased offset math in the window layer.
  • C6 — No regression to the single-instance JS decoder concurrency guarantees. The current code is careful that only one streaming loop touches the single JS StreamDecoder at a time (DrainActiveStreamingTaskAsync, the _streamingCancellation identity dance). Windowed refill introduces more mid-stream fetches; it must route through the same drain/cancellation discipline, not around it.
  • C7 — The Mix visualizer's data source is independent and must stay that way. The Phase 10/12 WebGL2 lava visualizer renders from a preprocessed high-res waveform datum fetched per-track (GET api/track/{entryKey}/waveform/high-res), not from live decoded PCM. Confirmed: evicting played AudioBuffers cannot starve the visualizer — it never read them. The window model is invisible to the visualizer. (This is the canonical 1 GB case and the case that proves the eviction is safe.)

3. Architectural shape

3.0 The mental model

A track's audio is a byte range [0, fileLength) on disk. At any moment the listener is at playback position P (seconds → byte offset via the format decoder). The player should hold decoded AudioBuffers only for a bounded window roughly [P - back, P + ahead]:

  • forward fill (ahead) — enough decoded lookahead that playback never starves (covers the existing 500 ms scheduler lookahead plus network jitter headroom);
  • back-retain (back) — a small amount of already-played audio kept so a short seek-back does not trigger a network refetch;
  • evict — anything older than P - back is dropped (AudioBuffer references released → GC reclaims the float data);
  • refill — when forward decoded lookahead drops below a low-water mark, fetch+decode more from the current byte position; when the window's tail is evicted and the listener seeks back past it, refetch that region via the Range primitive (the seek-beyond-buffer path, run backwards).

This is a ring/sliding-window buffer keyed on playback position, driven by high/low-water marks — the standard bounded-producer/bounded-consumer pattern, transplanted onto the decode→schedule seam.

3.1 Why this is a generalization of seek-beyond-buffer, not a new mechanism

The seek-beyond-buffer path already does every primitive the window needs, just triggered manually and one-shot:

Window operation Existing seek-beyond-buffer machinery it reuses
Discard buffers, keep offset PlaybackScheduler.clearForSeek() + setPlaybackOffset() (clears buffers, retains the absolute-time anchor)
Fetch from a byte offset TrackMediaClient.GetTrackMedia(key, byteOffset)Range: bytes=X- → 206
Decode a header-less body StreamDecoder.reinitializeForRangeContinuation(remainingByteLength)
Map time → byte offset StreamDecoder.calculateByteOffset()IFormatDecoder.calculateByteOffset()
Single-loop safety on refetch _streamingCancellation swap + DrainActiveStreamingTaskAsync()

The difference is eviction does not exist yet (the scheduler only ever clear()s wholesale) and refill is one-shot (a seek, not a continuous low-water-triggered loop). So the new work is two seams: a partial-evict on the scheduler, and a position-driven refill controller on the player. The fetch/decode/offset plumbing is reused verbatim.

3.2 The three candidate directions

Per file convention the alternatives are recorded; the recommendation follows.

Direction A — Sliding window on the existing single forward stream (recommended). Keep the current model where the C# loop reads one forward HTTP stream and pumps chunks into the JS decoder. Add two things: (1) PlaybackScheduler gains partial eviction — drop buffers whose absolute-time end is older than P - back, adjusting its index bookkeeping so getCurrentPosition() and scheduling stay correct against a buffer array that no longer starts at index 0; (2) a back-pressure signal — when forward decoded lookahead exceeds the high-water mark, the C# loop pauses reading the HTTP stream (stops calling ReadAsync) until playback drains it below low-water, then resumes. Memory is bounded by high-water + back-retain. Seek-back beyond the retained window falls through to the existing seek-beyond-buffer path unchanged. Why recommended: smallest change to the load-bearing seam; reuses the live forward stream (no extra connections in the common case); eviction and back-pressure are the only genuinely new mechanisms, and both are local (one to the scheduler, one to the read loop). Back-pressure via "stop reading the socket" is exactly how TCP flow control already wants to behave — pausing ReadAsync lets the kernel window close; we are not fighting the transport.

Direction B — Discrete window segments, each its own Range fetch. Treat the file as fixed-size byte segments (e.g. 4 MB). Hold N decoded segments around P; fetch the next/previous segment via a fresh Range request as the window slides; discard the far segment. No live long-lived forward stream — every window is an independent 206. Why not (default): turns one connection into many short Range requests (more proxy hops through DeepDrftPublic, more server-side WavOffsetService-style header synthesis, more places a fetch can fail mid-stream — worsening the §1.6 error surface), and the byte↔time segment math must be exact at every boundary. It is the cleaner model for true random-access (and the better base if seeking-heavy usage dominates), so keep it as the fallback if Direction A's back-pressure proves leaky in practice. Borrowed prior art: HLS/DASH segment windows and the MSE SourceBuffer.remove() eviction model — this is how every production HTML5 adaptive player bounds memory. We are doing the hand-rolled equivalent because the stack is a bespoke Web Audio graph, not <media> + MSE.

Direction C — Adopt MediaSource Extensions (MSE) and let the browser manage the buffer. Stop hand-rolling the decode→schedule graph for long tracks; feed the Range stream into a SourceBuffer and let the browser evict via its built-in quota + remove(). Memory management becomes the platform's problem. Why not (now, but flag for Daniel): MSE does not accept raw WAV/PCM — it wants containerized formats (fragmented MP4/WebM, or MP3/AAC elementary streams). The current producer is WAV-only, and the entire bespoke visualizer/spectrum graph is wired to the Web Audio AudioContext, not a <media> element. Adopting MSE is a rewrite of the playback substrate, not a windowing change — out of scope for this phase. But it is the real long-term answer and is entangled with Phase 1.2 (non-WAV formats): if DeepDrft moves to a compressed delivery format, MSE becomes viable and could retire the hand-rolled decoder, the seek-beyond-buffer path, and this phase's window machinery in one move. Surfaced as open question OQ5 — not to decide now, but so this phase is built knowing it may be superseded.

Direction A is the smallest coherent change that hits the headline (bounded memory under a 1 GB stream) while honoring C1C7. It keeps the live forward stream, reuses the seek-beyond-buffer path for the only genuinely random-access case (seek-back past the retained tail), and isolates the two new mechanisms. The final architecture and the exact eviction/back-pressure API are staff-engineer's call at implementation (per file convention); this spec fixes the shape and the invariants, not the method signatures.

3.4 SOLID / road-not-taken rationale

  • SRP, preserved. Eviction is a PlaybackScheduler concern (it already owns buffer storage); refill orchestration is a player-service/StreamDecoder concern (they already own the fetch loop); byte↔time math stays in IFormatDecoder. No responsibility crosses a boundary it does not already own.
  • OCP, via C3/C5. Windowing added in the format-agnostic layer means wiring MP3/FLAC later changes zero window code. The window expresses refill through calculateByteOffset — the one seam the decoders already implement.
  • The seam stays single-writer (C6). Every new refetch routes through the existing cancellation/drain discipline, so "only one loop touches the JS decoder" remains true. This is the rule most likely to be violated by a naive implementation and is called out as a hard invariant.
  • Road not taken — eager full decode with a memory cap that just stops decoding. Tempting (decode until you hit a byte budget, then stop) but it breaks playback of long tracks past the cap entirely — it bounds memory by refusing to play the rest, not by sliding. Rejected: it is a degradation, not a feature.

4. Use cases

  • UC1 — Play a 1 GB+ DJ MIX start to finish (the headline). Memory stays bounded throughout; the listener experiences continuous playback identical to a short track.
  • UC2 — Seek forward within a long track. Already handled by seek-beyond-buffer; under windowing the forward seek clears the window and refills at the target — no behavior change, now with eviction so the pre-seek region does not linger.
  • UC3 — Seek back a few seconds. Served from the back-retain window with no network refetch (the reason back exists).
  • UC4 — Seek back far, past the evicted tail. Falls through to the existing seek-beyond-buffer Range fetch, run toward an earlier offset. (Open question OQ2 — see §6.)
  • UC5 — Pause a long track for a long time. Memory stays at the bounded window size while paused (no continued decode). On resume, forward fill restarts from the low-water trigger.
  • UC6 — Mix detail page with the lava visualizer running. Visualizer reads its preprocessed datum (C7); windowing is invisible to it. Confirmed non-interaction.

5. Interaction with the deferred Phase 1 streaming features

This phase touches the same decoder/scheduler seam as the deferred Phase 1.3/1.4/1.5 items and the 1.6/1.7 robustness items. The interactions, explicitly:

  • 1.3 Preload / prefetch (deferred; preload half). Shares machinery, does not conflict — and should be sequenced after. Preload stages the next track into a second decoder instance during the current track's tail; windowing bounds the current track's forward buffer. They are orthogonal axes (next-track vs. current-track-window), but they compound the memory question: a naive preload of a second 1 GB mix would reintroduce the OOM this phase fixes. Recommendation: land windowing first, so that when preload arrives, the staged next-track decoder is also windowed by construction (it inherits the bounded scheduler). Windowing makes preload safe for long tracks; without it, preload of mixes is a memory hazard.
  • 1.4 Crossfade (deferred). Needs two simultaneous PlaybackScheduler instances briefly overlapping. Both would be windowed instances — the overlap doubles the window size momentarily, not the whole track. Windowing makes crossfade between two long mixes affordable. No reordering needed; 1.4 still gates on 1.3.
  • 1.5 Gapless (deferred). Sample-accurate hand-off of the next track's first buffer at the current track's last buffer. Windowing changes which buffers are retained but not the hand-off mechanism; the only care point is that the current track's final window must not be evicted before the gapless boundary is scheduled. A minor invariant for whoever builds 1.5, not a blocker. Note 1.5's existing WAV-only caveat stands.
  • 1.6 Track-skip on error (deferred). Windowing enlarges the error surface — call this out. Today a fetch failure happens at load (one fetch) or at a user seek (one fetch). Windowed refill issues mid-stream fetches the listener did not initiate; one of those can fail at byte 700 M of a 1 GB mix. So Phase 21 should ship with at least the cheap half of 1.6: a mid-stream refill failure must surface a clear error and not wedge the player (it must not leave playback "running" with a starved scheduler — mirror the playFromPosition end-of-buffer recovery already in PlaybackScheduler). The rich half (byte-scan to next valid frame) stays deferred. Recommendation: fold the minimal refill- failure handling into Phase 21's acceptance criteria (AC6) rather than leaving it entirely to 1.6 — it is created by this phase.
  • 1.7 Safari compatibility (deferred). Windowing adds no new Safari-specific surface beyond what the streaming path already has. The one adjacency: more frequent AudioContext activity during refill should be checked against the older-Safari webkitAudioContext quirks when 1.7 is addressed — note it, do not block on it.

6. Open questions for Daniel (genuine product decisions, not implementation detail)

These are policy calls with user-visible or resource trade-offs — flagged rather than decided here.

  • OQ1 — Window size policy. What bounds the window — a fixed byte/time budget (e.g. "hold at most ~30 s decoded ahead + ~10 s behind"), or a configurable memory budget (e.g. "≤ N MB of decoded PCM") that derives the time window from the stream's byte rate? Recommend a time-based forward window + small time-based back-retain as the primary knob (intuitive, format-portable), with a hard memory ceiling as a secondary guard. The exact numbers are tunable post-landing; Daniel picks the policy axis. [Daniel decision]
  • OQ2 — Seek-back past the evicted window. When the listener seeks back earlier than the retained tail, we must refetch (the audio is gone). Acceptable to take the same brief re-buffer the forward seek-beyond-buffer takes today? (Recommend yes — it is the symmetric case and listeners already accept it forward.) Or should back-retain be generous enough that this is rare? [Daniel decision]
  • OQ3 — Configurable total in-flight memory cap. Should there be a single hard byte ceiling on total decoded audio held by the player (a safety net independent of the window-size policy), exposed as a config value? Recommend yes, as a guard rail even if the window policy is time-based — it is the backstop that makes "1 GB stream never OOMs" a guarantee rather than a tuning hope. [Daniel decision]
  • OQ4 — Apply windowing to all tracks, or only long ones? A 3-minute Cut decoded whole is ~3060 MB — harmless today. Windowing everything is simpler (one code path) but adds refill machinery to short tracks that never needed it. Recommend window everything (one path, C6-safe, and short tracks simply never hit a refill because they fit inside the forward window) — but Daniel may prefer a size threshold. [Daniel decision]
  • OQ5 — Is MSE (Direction C) the real destination? Not for this phase, but it bears on how much to invest here. If DeepDrft will move to compressed delivery (Phase 1.2) and MSE within ~a year, Phase 21 should be the minimal Direction-A change (don't gold-plate machinery MSE would retire). If WAV + bespoke graph is the long-term commitment, a more thorough windowing investment is justified. [Daniel steer — informs scope, not a blocker]

7. Acceptance criteria

  • AC1 (headline) — Bounded memory under a 1 GB stream. Playing a 1 GB+ WAV mix start to finish, the browser tab's retained decoded-audio memory stays bounded to the configured window (not growing toward ~2 GB). Verifiable via browser memory tooling: peak decoded-audio footprint is independent of track length and tracks the window-size policy, not the file size.
  • AC2 — Playback-start latency at parity (C2). First-audio latency for a track is unchanged from pre-windowing (within noise). Windowing does not introduce a fetch-then-play stall.
  • AC3 — Continuous playback, no starvation. A long mix plays edge to edge with no audible gaps, underruns, or stalls under normal network conditions — the forward fill stays ahead of the playhead.
  • AC4 — Seek-back within the window is instant (UC3). A short backward seek into retained audio produces no network request.
  • AC5 — Seek (forward, and back past the window) still works (UC2/UC4). Both resolve via the existing Range path with the same behavior the listener sees today; the pre-seek region is evicted, not retained.
  • AC6 — A mid-stream refill failure degrades cleanly (the 1.6 adjacency). A failed refill fetch surfaces a clear user-visible error and leaves the player in a recoverable state (not a wedged "playing" with a starved scheduler). It must not silently hang.
  • AC7 — The Mix visualizer is unaffected (C7). With the lava visualizer running on a long mix, the visualizer renders identically (it reads the preprocessed datum, never the evicted buffers).
  • AC8 — Single-decoder concurrency invariant holds (C6). Under rapid seek + refill activity, no interleaved ProcessStreamingChunk calls corrupt the single JS decoder (the existing drain/cancel discipline still governs every fetch).

8. Wave decomposition

Dependency shape: 21.1 → 21.2 → 21.3, with 21.4 validating the whole. 21.1 is the cold-start prerequisite and the load-bearing change; the rest layer on it.

  • 21.1 — Partial eviction in PlaybackScheduler (cold-start; the load-bearing change). Give the scheduler the ability to drop already-played buffers and keep its position/index bookkeeping correct against a buffer array that no longer begins at absolute time 0 (today getCurrentPosition, playFromPosition, and the scheduling loop all assume buffers[0] is the track start). This is the hardest correctness work in the phase — the time-anchor math must stay exact through eviction. No refill yet; with eviction alone and the forward read loop unchanged, this is provably memory-bounded for the played region. Independent of the §6 open questions — it can begin immediately; the window sizes (OQ1/OQ3) are parameters fed in later. Settled and cold-start.
  • 21.2 — Back-pressure on the forward read loop (the bound on the unplayed region). Make the C# StreamAudioWithEarlyPlayback loop stop calling ReadAsync when forward decoded lookahead exceeds the high-water mark, and resume below low-water. Together with 21.1, this bounds both the played and unplayed sides — the full memory guarantee (AC1). Must route resume/pause through the existing cancellation-safe single-loop discipline (C6). Depends on 21.1 (eviction must exist so the drained region is reclaimed, not merely un-read).
  • 21.3 — Seek-back-past-window refill (close the random-access case). Wire UC4 — when a backward seek lands earlier than the retained tail, refetch via the existing seek-beyond-buffer Range path pointed at the earlier offset, and the minimal AC6 refill-failure handling. Mostly reuse of the landed seek path; the new work is the trigger (window-miss detection) and the clean-failure path. Depends on 21.1 + 21.2 (needs the window boundaries they define).
  • 21.4 — Validation pass against the 1 GB target (acceptance). Exercise AC1AC8 against a real 1 GB+ mix: memory profiling (AC1), latency parity (AC2), edge-to-edge playback (AC3), the seek matrix (AC4/AC5), induced refill failure (AC6), visualizer-running (AC7), and rapid-seek concurrency (AC8). Largely test/measurement; any break is likely a tuning fix in the 21.1 anchor math or the 21.2 water-marks. Depends on 21.121.3.

9. Cross-references (read before implementing)

  • Root CLAUDE.md "Streaming-first audio playback" / CONTEXT.md §3.5 — the seam this phase modifies; the §2 invariants here restate its contract. Both flag it as the most load-bearing path.
  • PLAN.md Phase 4 (landed) / COMPLETED.md — the HTTP Range bytes=X- primitive this generalizes.
  • PLAN.md Phase 1.3 / 1.4 / 1.5 / 1.6 / 1.7 — the deferred decoder/scheduler-seam features; §5 above reconciles each.
  • PLAN.md Phase 9 — defines the Mix medium (single long track), the canonical 1 GB case.
  • PLAN.md Phase 10 / product-notes/phase-10-mix-visualizer-lava-reframe.md / product-notes/phase-12-waveform-visualizer-generalization.md — establishes the preprocessed per-track high-res waveform datum; the basis for C7 (visualizer does not read live PCM).
  • DeepDrftPublic/Interop/audio/PlaybackScheduler.ts — owns the unbounded buffers: AudioBuffer[]; 21.1 lives here.
  • DeepDrftPublic/Interop/audio/StreamDecoder.tsreinitializeForRangeContinuation, calculateByteOffset; the refill substrate.
  • DeepDrftPublic.Client/Services/StreamingAudioPlayerService.cs — the C# forward read loop (StreamAudioWithEarlyPlayback), the seek-beyond-buffer path (SeekBeyondBuffer), and the cancellation/drain discipline (C6); 21.2/21.3 live here.
  • DeepDrftPublic.Client/Clients/TrackMediaClient.cs — the Range-capable media fetch reused by refill.