jam-tracks / REPORT.md
Mina Emadi
added the looping function and changing the key and bpm of portion of the song
b384007

Jam Track Studio β€” Technical Report

1. Overview

Jam Track Studio is a web application that lets musicians upload individual instrument stems, detect the song's BPM and musical key, then shift the pitch and tempo in real time β€” either for the whole song or a selected region. The app runs entirely in the browser for playback and mixing (via the Web Audio API), with a Python backend handling the computationally expensive audio analysis and processing.


2. Architecture

                         Browser (React + Web Audio API)
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  FileUpload β†’ AnalysisDisplay         β”‚
                    β”‚  ControlPanel β†’ TransportBar          β”‚
                    β”‚  StemMixer β†’ Waveform                 β”‚
                    β”‚                                       β”‚
                    β”‚  useSession ←──── REST / WebSocket ──────┐
                    β”‚  useAudioEngine (Web Audio graph)      β”‚  β”‚
                    β”‚  useProcessingProgress (WS listener)   β”‚  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                                                              β”‚
                         Backend (FastAPI + Uvicorn)           β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
                    β”‚  /api/upload          POST            β”‚β—„β”€β”˜
                    β”‚  /api/detect/:id      POST            β”‚
                    β”‚  /api/process/:id     POST + WS       β”‚
                    β”‚  /api/stem/:id/:name  GET (streaming)  β”‚
                    β”‚                                       β”‚
                    β”‚  Services: bpm_detector, key_detector  β”‚
                    β”‚            audio_processor, midi_analyzerβ”‚
                    β”‚  In-memory session store               β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why this split?

  • CPU-intensive work stays on the server. Pitch shifting and time stretching via Rubber Band are heavy operations that benefit from native C++ performance and multi-core parallelism (ProcessPoolExecutor with up to 6 workers). Running these in the browser would be impractically slow.
  • Playback and mixing stay in the browser. Volume, pan, reverb, solo, and mute changes are instant because they manipulate Web Audio API GainNode and StereoPannerNode parameters β€” no network round-trip, no re-encoding.
  • WebSocket for progress. Processing multiple stems takes seconds. Rather than polling, the backend pushes per-stem progress events over a WebSocket so the UI can show a live progress overlay.

3. Backend

3.1 Framework: FastAPI

FastAPI was chosen for:

  • Async I/O β€” WebSocket support and non-blocking file uploads come built-in
  • Pydantic validation β€” request/response schemas with automatic type checking
  • OpenAPI docs β€” auto-generated at /docs for debugging
  • Performance β€” Uvicorn ASGI server handles concurrent requests efficiently

The app runs on port 7860 (Hugging Face Spaces requirement) and serves the built React frontend as static files at the root path.

3.2 Session Model (In-Memory)

Each upload creates a Session object stored in a module-level Python dict:

Session
β”œβ”€β”€ id: UUID
β”œβ”€β”€ stems: {name β†’ StemData(audio, sample_rate)}
β”œβ”€β”€ processed_stems: {name β†’ StemData}          # full-song processed
β”œβ”€β”€ region_processed_stems: {name β†’ StemData}   # region-only processed
β”œβ”€β”€ detected_bpm, detected_key, detected_mode
β”œβ”€β”€ detection_confidence
β”œβ”€β”€ midi_data
β”œβ”€β”€ wav_cache: {cache_key β†’ encoded bytes}
└── created_at (auto-cleaned after 1 hour)

Why in-memory? This is an MVP. There's no user authentication, no persistence requirement, and sessions are short-lived. A background task runs every 10 minutes to delete sessions older than 1 hour.

Why separate processed_stems and region_processed_stems? When the user processes the full song, then selects a region and processes just that portion, we need both versions available: the full-song version for "Play Full Song" and the region slice for looped region playback. Storing them separately avoids one overwriting the other.

3.3 Upload Pipeline

POST /api/upload accepts multipart form data with named stem files (guitar, drums, bass, synth, click_record) and optional MIDI files.

Processing steps:

  1. Validate file types (.wav, .mid/.midi only) and size (max 120 MB per stem)
  2. Read audio via soundfile.read() β†’ numpy float32 arrays
  3. Convert stereo to mono (halves memory, simplifies processing)
  4. Validate all stems share the same sample rate and durations within 1 second
  5. Generate a mix by summing all stems and normalizing to peak 0.95
  6. Parse MIDI files (if provided) via mido.MidiFile
  7. Pre-encode all stems as WAV bytes and cache them for fast first playback

3.4 Stem Serving

GET /api/stem/{session_id}/{stem_name}?processed=true&region=false

The endpoint resolves which version of a stem to serve using a priority chain:

  1. If region=true and processed=true β†’ check region_processed_stems
  2. If processed=true β†’ check processed_stems
  3. Fallback β†’ stems (originals)

Encoded WAV bytes are cached in session.wav_cache keyed by "{stem}_{region|processed|original}" so repeated downloads skip the encoding step.


4. Audio Analysis

4.1 BPM Detection

Primary: Essentia RhythmExtractor2013 (multifeature method)

Essentia is a C++ library with Python bindings purpose-built for music information retrieval. The multifeature method combines multiple rhythm analysis approaches internally for robust BPM estimation.

Steps:

  1. Resample audio to 44100 Hz (Essentia's expected rate)
  2. Run RhythmExtractor2013(method="multifeature") β†’ returns BPM, beat ticks, and confidence
  3. If confidence < 0.5 and a drums stem is available, re-run on the drums stem alone (drums carry the strongest rhythmic signal)
  4. Apply octave error correction: constrain BPM to 50–200 range by doubling or halving. This handles the common case where the algorithm returns 60 BPM for a 120 BPM song (half-time detection)
  5. Clamp confidence to [0, 1]

Fallback: librosa beat tracking

If Essentia is unavailable:

tempo, beat_frames = librosa.beat.beat_track(y=audio, sr=sr)
onset_env = librosa.onset.onset_strength(y=audio, sr=sr)
confidence = np.std(onset_env) / (np.std(onset_env) + 1)

MIDI BPM (highest priority when available): Extracted directly from set_tempo messages: BPM = 60,000,000 / tempo_microseconds. This is exact β€” confidence is 1.0.

4.2 Key Detection

Primary: Essentia Ensemble Voting with 4 Key Profiles

Key detection is inherently ambiguous (relative major/minor, modal ambiguity), so we use an ensemble approach:

Profiles: temperley, krumhansl, edma, bgate

Each profile represents a different statistical model of how pitch classes distribute in tonal music:

  • Temperley β€” optimized for pop/rock
  • Krumhansl β€” classic music cognition research profile
  • EDMA β€” Electronic Dance Music Analysis profile
  • Bgate β€” alternative weighting

For each profile, Essentia's KeyExtractor returns a (key, mode, strength) tuple. All votes are accumulated into a weighted tally. The key with the highest total strength wins.

Bass weighting: If a bass stem is available and its key detection confidence exceeds 0.3, its votes are added at 0.5x weight. Bass notes strongly indicate the harmonic root.

Fallback: librosa chroma-based correlation

Computes a Constant-Q chromagram, averages across time to get a 12-element pitch class profile, then correlates against rotated Temperley major/minor profiles for all 12 keys. The key with highest Pearson correlation wins.

MIDI Key Detection: Builds a pitch class histogram weighted by note_duration * velocity, then runs the same ensemble voting against Temperley profiles.


5. Audio Processing

5.1 Pitch Shifting & Time Stretching

Algorithm: Rubber Band Library

Rubber Band is a high-quality C++ library for audio time-stretching and pitch-shifting. It uses a phase vocoder approach with sophisticated transient detection and phase-locking to minimize artifacts.

Three code paths depending on what's needed:

Change Method
Pitch only pyrubberband.pitch_shift(audio, sr, n_steps)
Tempo only pyrubberband.time_stretch(audio, sr, rate)
Both Rubber Band CLI single-pass (avoids two-pass quality loss)

Stem-specific optimization via --crisp flag:

The --crisp parameter controls how aggressively Rubber Band preserves transients:

Stem Type --crisp Rationale
Drums/percussion 6 (maximum) Drum attacks must be razor-sharp; smeared transients sound unnatural
Bass 3 + --fine Low frequencies need precise handling; --fine uses a higher-resolution filter
Default (guitar, synth, keys) 4 Balanced: preserves attacks without over-sharpening sustained sounds

Fallback: If the rubberband CLI binary isn't installed, the code falls back to two-pass pyrubberband (pitch shift first, then time stretch). This produces slightly lower quality but always works.

5.2 Parallel Processing

All stems are processed simultaneously using Python's ProcessPoolExecutor with up to 6 workers. Each worker runs in a separate process (true parallelism, no GIL limitation) and handles one stem. Progress is reported per-stem via WebSocket as each worker completes.

5.3 Region Processing

When region_start and region_end are provided in the process request:

  1. Each stem's numpy array is sliced: audio[int(start * sr) : int(end * sr)]
  2. Only the slice goes through Rubber Band
  3. Results are stored in session.region_processed_stems (not processed_stems)
  4. The WAV cache is cleared to avoid serving stale data

This means region processing is faster proportional to region length β€” a 5-second region processes much faster than a 3-minute song.

5.4 Mastering Chain

After stems are processed and summed, a mastering chain is applied using Spotify's Pedalboard library (Python bindings to JUCE audio plugins):

Compressor β†’ Limiter β†’ Output
Parameter Value Purpose
Compressor threshold -10 dB Gentle compression; tames peaks without squashing dynamics
Compressor ratio 3:1 Moderate β€” enough to control, not enough to pump
Compressor attack 10 ms Fast enough to catch transients
Compressor release 150 ms Smooth recovery, avoids pumping artifacts
Limiter threshold -1 dB Hard ceiling prevents clipping
Limiter release 100 ms Transparent limiting

6. Frontend

6.1 Framework & Build

  • React 18 β€” component-based UI with hooks for state management
  • Vite β€” near-instant hot module replacement during development, optimized production builds
  • Tailwind CSS β€” utility-first styling with custom blue/purple theme and glassmorphism effects

No state management library (Redux, Zustand) β€” the app's state is simple enough to manage with React's built-in useState and useCallback hooks distributed across three custom hooks.

6.2 Web Audio API Signal Chain

The entire mixer runs in the browser. For each stem:

AudioBufferSourceNode (decoded PCM data)
  β†’ GainNode (per-stem volume, 0–1)
    β†’ DynamicsCompressorNode (per-stem dynamics control)
      β†’ StereoPannerNode (L/R positioning, -1 to +1)
        β”œβ†’ MasterGainNode (direct/dry signal)
        β””β†’ GainNode (reverb send amount)
            β†’ ConvolverNode (synthetic reverb impulse response)
              β†’ MasterGainNode
                β†’ AnalyserNode (64-bar FFT for visualization)
                  β†’ AudioContext.destination (speakers)

Why this graph?

  • Per-stem compressor: Tames dynamics before mixing, prevents one loud stem from dominating. Settings: threshold -24 dB, ratio 4:1, 3ms attack, 250ms release.
  • Stereo panner with defaults: Instruments are pre-panned to a natural stereo image (drums/bass center, guitar slightly left, synth slightly right). Users can override.
  • Convolver reverb: A synthetic impulse response (2-second exponential decay with random noise) creates a natural room reverb. Each stem has its own send amount (default 15%), routed to a shared ConvolverNode.
  • AnalyserNode for visualization: Provides 64-bin frequency data at 60fps, rendered on a canvas as animated gradient bars.

6.3 AudioBuffer Caching

Problem: Switching between full-song and region playback previously re-fetched and re-decoded all stems β€” the same expensive operation as the initial load.

Solution: A persistent bufferCacheRef (React ref) maps cache keys like "drums_full" and "drums_region" to decoded AudioBuffer objects. These survive across loadStems() calls.

  • First load: cache miss β†’ fetch WAV from server β†’ decode β†’ store in cache
  • Subsequent loads: cache hit β†’ skip network + decode, just rebuild the audio graph nodes (instant)
  • Cache invalidation: clearBufferCache('region') is called before loading newly processed region stems; clearBufferCache('full') before loading newly processed full stems. The other tag's cache entries remain valid.

Audio graph nodes (GainNode, CompressorNode, etc.) cannot be reused β€” the Web Audio API requires fresh nodes to be created and wired up each time. But this is cheap (1ms) compared to fetch+decode (500ms–2s per stem).

6.4 Region Selection UI

The TransportBar component implements drag-to-select:

  1. Create region: mousedown on the progress bar starts tracking; mousemove updates regionStart/regionEnd; mouseup finalizes. If the resulting region is < 0.1s, it's treated as a click-to-seek instead.
  2. Resize region: If mousedown lands near (within 8px of) a handle, only that edge is dragged.
  3. Numeric inputs: Clicking the displayed start/end times opens an editable M:SS.T text field for precise entry.
  4. Visual: A yellow semi-transparent overlay spans the region. Two yellow handle bars mark the edges.

Coordinate systems: When in region playback mode, duration reflects the sliced clip length (e.g., 15 seconds), but the region handles must remain positioned relative to the full song (e.g., at 30s and 45s of a 3-minute song). A barDuration variable resolves this: it uses fullSongDuration when in region mode, duration otherwise. All percentage calculations for region positioning use barDuration.

The playback progress indicator is also mapped correctly: in region mode, currentTime (0 to regionLength) is mapped into the region's position on the full-song bar, so the playhead moves within the highlighted band.

6.5 Playback Modes

Mode What plays Loop Progress bar
full Full song stems (original or processed) No 0% β†’ 100% of song
region Processed region slice Yes Playhead moves within yellow region band

Switching modes:

  • "Apply to Selection" β†’ process region β†’ load region stems β†’ set region mode + loop on
  • "Play Full Song" β†’ stop β†’ load full stems (from cache) β†’ set full mode + loop off
  • "Clear Selection" β†’ clear region state β†’ if was in region mode, switch to full

6.6 Components

Component Responsibility
FileUpload Drag-and-drop or click-to-upload for stems (.wav) and MIDI (.mid) files
AnalysisDisplay Shows detected BPM (with confidence), key, and mode after upload
ControlPanel Key selector (24 keys), BPM slider (50%–200%), quick-shift buttons, quality badges, "Apply Changes" / "Apply to Selection" button
StemMixer Per-stem volume slider, pan knob, reverb amount, solo/mute toggles, reset button
TransportBar Play/pause/stop, progress bar with region selection, numeric time inputs, "Play Full Song" button, loop indicator
Waveform Canvas-based 64-bar FFT frequency visualizer with gradient coloring and glow effects
ProcessingOverlay Full-screen overlay during processing showing per-stem progress from WebSocket

7. Deployment

Docker

FROM python:3.11-slim
# Install: rubberband-cli, libsndfile1, ffmpeg, Node.js 20
# Build frontend β†’ dist/
# Run: uvicorn backend.main:app --host 0.0.0.0 --port 7860

The container includes the Rubber Band CLI binary for optimal pitch/tempo processing. The built React frontend is served as static files by FastAPI, so a single container handles both the API and the UI.


8. API Summary

Endpoint Method Description
/api/upload POST Upload stems + MIDI, creates session
/api/detect/{session_id} POST Run BPM & key detection
/api/process/{session_id} POST Pitch shift + time stretch (full or region)
/api/stems/{session_id} GET List available stems
/api/stem/{session_id}/{stem_name} GET Download a stem as WAV
/api/ws/{session_id} WebSocket Processing progress events
/api/health GET Health check

9. Quality Indicators

The UI provides visual feedback on expected quality based on how far the user shifts from the original:

Pitch:

  • Green (Recommended): 0–4 semitones
  • Yellow (Some quality loss): 5–7 semitones
  • Red (Significant quality loss): 8+ semitones

Tempo:

  • Green (Recommended): 0–20% change
  • Yellow (Some quality loss): 21–40% change
  • Red (Significant quality loss): 40%+ change

These reflect the inherent limitations of time-domain audio processing β€” larger shifts introduce more phase vocoder artifacts.


10. Key Technology Choices β€” Rationale Summary

Choice Why
FastAPI Async, WebSocket-native, Pydantic validation, auto-docs
In-memory sessions MVP simplicity; no database overhead for transient data
Rubber Band Industry-standard pitch/tempo library, stem-type-specific tuning via --crisp
Essentia Purpose-built for MIR; multifeature BPM detection outperforms librosa alone
Ensemble key voting 4 profiles reduce single-profile bias; bass weighting improves harmonic accuracy
Pedalboard mastering Spotify's library wraps JUCE plugins; simple API, professional sound
Web Audio API Zero-latency mixing; no server round-trip for volume/pan/reverb changes
React + Vite Fast development, fast builds, component isolation
Tailwind Rapid UI iteration without writing CSS files
ProcessPoolExecutor True multi-core parallelism for CPU-bound audio processing (bypasses Python GIL)
AudioBuffer cache Eliminates redundant fetch+decode when switching between full/region playback