Jam Track Studio β Technical Report
1. Overview
Jam Track Studio is a web application that lets musicians upload individual instrument stems, detect the song's BPM and musical key, then shift the pitch and tempo in real time β either for the whole song or a selected region. The app runs entirely in the browser for playback and mixing (via the Web Audio API), with a Python backend handling the computationally expensive audio analysis and processing.
2. Architecture
Browser (React + Web Audio API)
ββββββββββββββββββββββββββββββββββββββββ
β FileUpload β AnalysisDisplay β
β ControlPanel β TransportBar β
β StemMixer β Waveform β
β β
β useSession βββββ REST / WebSocket βββββββ
β useAudioEngine (Web Audio graph) β β
β useProcessingProgress (WS listener) β β
ββββββββββββββββββββββββββββββββββββββββ β
β
Backend (FastAPI + Uvicorn) β
ββββββββββββββββββββββββββββββββββββββββ β
β /api/upload POST ββββ
β /api/detect/:id POST β
β /api/process/:id POST + WS β
β /api/stem/:id/:name GET (streaming) β
β β
β Services: bpm_detector, key_detector β
β audio_processor, midi_analyzerβ
β In-memory session store β
ββββββββββββββββββββββββββββββββββββββββ
Why this split?
- CPU-intensive work stays on the server. Pitch shifting and time stretching via Rubber Band are heavy operations that benefit from native C++ performance and multi-core parallelism (
ProcessPoolExecutorwith up to 6 workers). Running these in the browser would be impractically slow. - Playback and mixing stay in the browser. Volume, pan, reverb, solo, and mute changes are instant because they manipulate Web Audio API
GainNodeandStereoPannerNodeparameters β no network round-trip, no re-encoding. - WebSocket for progress. Processing multiple stems takes seconds. Rather than polling, the backend pushes per-stem progress events over a WebSocket so the UI can show a live progress overlay.
3. Backend
3.1 Framework: FastAPI
FastAPI was chosen for:
- Async I/O β WebSocket support and non-blocking file uploads come built-in
- Pydantic validation β request/response schemas with automatic type checking
- OpenAPI docs β auto-generated at
/docsfor debugging - Performance β Uvicorn ASGI server handles concurrent requests efficiently
The app runs on port 7860 (Hugging Face Spaces requirement) and serves the built React frontend as static files at the root path.
3.2 Session Model (In-Memory)
Each upload creates a Session object stored in a module-level Python dict:
Session
βββ id: UUID
βββ stems: {name β StemData(audio, sample_rate)}
βββ processed_stems: {name β StemData} # full-song processed
βββ region_processed_stems: {name β StemData} # region-only processed
βββ detected_bpm, detected_key, detected_mode
βββ detection_confidence
βββ midi_data
βββ wav_cache: {cache_key β encoded bytes}
βββ created_at (auto-cleaned after 1 hour)
Why in-memory? This is an MVP. There's no user authentication, no persistence requirement, and sessions are short-lived. A background task runs every 10 minutes to delete sessions older than 1 hour.
Why separate processed_stems and region_processed_stems? When the user processes the full song, then selects a region and processes just that portion, we need both versions available: the full-song version for "Play Full Song" and the region slice for looped region playback. Storing them separately avoids one overwriting the other.
3.3 Upload Pipeline
POST /api/upload accepts multipart form data with named stem files (guitar, drums, bass, synth, click_record) and optional MIDI files.
Processing steps:
- Validate file types (.wav, .mid/.midi only) and size (max 120 MB per stem)
- Read audio via
soundfile.read()β numpy float32 arrays - Convert stereo to mono (halves memory, simplifies processing)
- Validate all stems share the same sample rate and durations within 1 second
- Generate a mix by summing all stems and normalizing to peak 0.95
- Parse MIDI files (if provided) via
mido.MidiFile - Pre-encode all stems as WAV bytes and cache them for fast first playback
3.4 Stem Serving
GET /api/stem/{session_id}/{stem_name}?processed=true®ion=false
The endpoint resolves which version of a stem to serve using a priority chain:
- If
region=trueandprocessed=trueβ checkregion_processed_stems - If
processed=trueβ checkprocessed_stems - Fallback β
stems(originals)
Encoded WAV bytes are cached in session.wav_cache keyed by "{stem}_{region|processed|original}" so repeated downloads skip the encoding step.
4. Audio Analysis
4.1 BPM Detection
Primary: Essentia RhythmExtractor2013 (multifeature method)
Essentia is a C++ library with Python bindings purpose-built for music information retrieval. The multifeature method combines multiple rhythm analysis approaches internally for robust BPM estimation.
Steps:
- Resample audio to 44100 Hz (Essentia's expected rate)
- Run
RhythmExtractor2013(method="multifeature")β returns BPM, beat ticks, and confidence - If confidence < 0.5 and a drums stem is available, re-run on the drums stem alone (drums carry the strongest rhythmic signal)
- Apply octave error correction: constrain BPM to 50β200 range by doubling or halving. This handles the common case where the algorithm returns 60 BPM for a 120 BPM song (half-time detection)
- Clamp confidence to [0, 1]
Fallback: librosa beat tracking
If Essentia is unavailable:
tempo, beat_frames = librosa.beat.beat_track(y=audio, sr=sr)
onset_env = librosa.onset.onset_strength(y=audio, sr=sr)
confidence = np.std(onset_env) / (np.std(onset_env) + 1)
MIDI BPM (highest priority when available):
Extracted directly from set_tempo messages: BPM = 60,000,000 / tempo_microseconds. This is exact β confidence is 1.0.
4.2 Key Detection
Primary: Essentia Ensemble Voting with 4 Key Profiles
Key detection is inherently ambiguous (relative major/minor, modal ambiguity), so we use an ensemble approach:
Profiles: temperley, krumhansl, edma, bgate
Each profile represents a different statistical model of how pitch classes distribute in tonal music:
- Temperley β optimized for pop/rock
- Krumhansl β classic music cognition research profile
- EDMA β Electronic Dance Music Analysis profile
- Bgate β alternative weighting
For each profile, Essentia's KeyExtractor returns a (key, mode, strength) tuple. All votes are accumulated into a weighted tally. The key with the highest total strength wins.
Bass weighting: If a bass stem is available and its key detection confidence exceeds 0.3, its votes are added at 0.5x weight. Bass notes strongly indicate the harmonic root.
Fallback: librosa chroma-based correlation
Computes a Constant-Q chromagram, averages across time to get a 12-element pitch class profile, then correlates against rotated Temperley major/minor profiles for all 12 keys. The key with highest Pearson correlation wins.
MIDI Key Detection:
Builds a pitch class histogram weighted by note_duration * velocity, then runs the same ensemble voting against Temperley profiles.
5. Audio Processing
5.1 Pitch Shifting & Time Stretching
Algorithm: Rubber Band Library
Rubber Band is a high-quality C++ library for audio time-stretching and pitch-shifting. It uses a phase vocoder approach with sophisticated transient detection and phase-locking to minimize artifacts.
Three code paths depending on what's needed:
| Change | Method |
|---|---|
| Pitch only | pyrubberband.pitch_shift(audio, sr, n_steps) |
| Tempo only | pyrubberband.time_stretch(audio, sr, rate) |
| Both | Rubber Band CLI single-pass (avoids two-pass quality loss) |
Stem-specific optimization via --crisp flag:
The --crisp parameter controls how aggressively Rubber Band preserves transients:
| Stem Type | --crisp | Rationale |
|---|---|---|
| Drums/percussion | 6 (maximum) | Drum attacks must be razor-sharp; smeared transients sound unnatural |
| Bass | 3 + --fine |
Low frequencies need precise handling; --fine uses a higher-resolution filter |
| Default (guitar, synth, keys) | 4 | Balanced: preserves attacks without over-sharpening sustained sounds |
Fallback: If the rubberband CLI binary isn't installed, the code falls back to two-pass pyrubberband (pitch shift first, then time stretch). This produces slightly lower quality but always works.
5.2 Parallel Processing
All stems are processed simultaneously using Python's ProcessPoolExecutor with up to 6 workers. Each worker runs in a separate process (true parallelism, no GIL limitation) and handles one stem. Progress is reported per-stem via WebSocket as each worker completes.
5.3 Region Processing
When region_start and region_end are provided in the process request:
- Each stem's numpy array is sliced:
audio[int(start * sr) : int(end * sr)] - Only the slice goes through Rubber Band
- Results are stored in
session.region_processed_stems(notprocessed_stems) - The WAV cache is cleared to avoid serving stale data
This means region processing is faster proportional to region length β a 5-second region processes much faster than a 3-minute song.
5.4 Mastering Chain
After stems are processed and summed, a mastering chain is applied using Spotify's Pedalboard library (Python bindings to JUCE audio plugins):
Compressor β Limiter β Output
| Parameter | Value | Purpose |
|---|---|---|
| Compressor threshold | -10 dB | Gentle compression; tames peaks without squashing dynamics |
| Compressor ratio | 3:1 | Moderate β enough to control, not enough to pump |
| Compressor attack | 10 ms | Fast enough to catch transients |
| Compressor release | 150 ms | Smooth recovery, avoids pumping artifacts |
| Limiter threshold | -1 dB | Hard ceiling prevents clipping |
| Limiter release | 100 ms | Transparent limiting |
6. Frontend
6.1 Framework & Build
- React 18 β component-based UI with hooks for state management
- Vite β near-instant hot module replacement during development, optimized production builds
- Tailwind CSS β utility-first styling with custom blue/purple theme and glassmorphism effects
No state management library (Redux, Zustand) β the app's state is simple enough to manage with React's built-in useState and useCallback hooks distributed across three custom hooks.
6.2 Web Audio API Signal Chain
The entire mixer runs in the browser. For each stem:
AudioBufferSourceNode (decoded PCM data)
β GainNode (per-stem volume, 0β1)
β DynamicsCompressorNode (per-stem dynamics control)
β StereoPannerNode (L/R positioning, -1 to +1)
ββ MasterGainNode (direct/dry signal)
ββ GainNode (reverb send amount)
β ConvolverNode (synthetic reverb impulse response)
β MasterGainNode
β AnalyserNode (64-bar FFT for visualization)
β AudioContext.destination (speakers)
Why this graph?
- Per-stem compressor: Tames dynamics before mixing, prevents one loud stem from dominating. Settings: threshold -24 dB, ratio 4:1, 3ms attack, 250ms release.
- Stereo panner with defaults: Instruments are pre-panned to a natural stereo image (drums/bass center, guitar slightly left, synth slightly right). Users can override.
- Convolver reverb: A synthetic impulse response (2-second exponential decay with random noise) creates a natural room reverb. Each stem has its own send amount (default 15%), routed to a shared ConvolverNode.
- AnalyserNode for visualization: Provides 64-bin frequency data at 60fps, rendered on a canvas as animated gradient bars.
6.3 AudioBuffer Caching
Problem: Switching between full-song and region playback previously re-fetched and re-decoded all stems β the same expensive operation as the initial load.
Solution: A persistent bufferCacheRef (React ref) maps cache keys like "drums_full" and "drums_region" to decoded AudioBuffer objects. These survive across loadStems() calls.
- First load: cache miss β fetch WAV from server β decode β store in cache
- Subsequent loads: cache hit β skip network + decode, just rebuild the audio graph nodes (instant)
- Cache invalidation:
clearBufferCache('region')is called before loading newly processed region stems;clearBufferCache('full')before loading newly processed full stems. The other tag's cache entries remain valid.
Audio graph nodes (GainNode, CompressorNode, etc.) cannot be reused β the Web Audio API requires fresh nodes to be created and wired up each time. But this is cheap (1ms) compared to fetch+decode (500msβ2s per stem).
6.4 Region Selection UI
The TransportBar component implements drag-to-select:
- Create region: mousedown on the progress bar starts tracking; mousemove updates
regionStart/regionEnd; mouseup finalizes. If the resulting region is < 0.1s, it's treated as a click-to-seek instead. - Resize region: If mousedown lands near (within 8px of) a handle, only that edge is dragged.
- Numeric inputs: Clicking the displayed start/end times opens an editable
M:SS.Ttext field for precise entry. - Visual: A yellow semi-transparent overlay spans the region. Two yellow handle bars mark the edges.
Coordinate systems: When in region playback mode, duration reflects the sliced clip length (e.g., 15 seconds), but the region handles must remain positioned relative to the full song (e.g., at 30s and 45s of a 3-minute song). A barDuration variable resolves this: it uses fullSongDuration when in region mode, duration otherwise. All percentage calculations for region positioning use barDuration.
The playback progress indicator is also mapped correctly: in region mode, currentTime (0 to regionLength) is mapped into the region's position on the full-song bar, so the playhead moves within the highlighted band.
6.5 Playback Modes
| Mode | What plays | Loop | Progress bar |
|---|---|---|---|
full |
Full song stems (original or processed) | No | 0% β 100% of song |
region |
Processed region slice | Yes | Playhead moves within yellow region band |
Switching modes:
- "Apply to Selection" β process region β load region stems β set
regionmode + loop on - "Play Full Song" β stop β load full stems (from cache) β set
fullmode + loop off - "Clear Selection" β clear region state β if was in region mode, switch to full
6.6 Components
| Component | Responsibility |
|---|---|
FileUpload |
Drag-and-drop or click-to-upload for stems (.wav) and MIDI (.mid) files |
AnalysisDisplay |
Shows detected BPM (with confidence), key, and mode after upload |
ControlPanel |
Key selector (24 keys), BPM slider (50%β200%), quick-shift buttons, quality badges, "Apply Changes" / "Apply to Selection" button |
StemMixer |
Per-stem volume slider, pan knob, reverb amount, solo/mute toggles, reset button |
TransportBar |
Play/pause/stop, progress bar with region selection, numeric time inputs, "Play Full Song" button, loop indicator |
Waveform |
Canvas-based 64-bar FFT frequency visualizer with gradient coloring and glow effects |
ProcessingOverlay |
Full-screen overlay during processing showing per-stem progress from WebSocket |
7. Deployment
Docker
FROM python:3.11-slim
# Install: rubberband-cli, libsndfile1, ffmpeg, Node.js 20
# Build frontend β dist/
# Run: uvicorn backend.main:app --host 0.0.0.0 --port 7860
The container includes the Rubber Band CLI binary for optimal pitch/tempo processing. The built React frontend is served as static files by FastAPI, so a single container handles both the API and the UI.
8. API Summary
| Endpoint | Method | Description |
|---|---|---|
/api/upload |
POST | Upload stems + MIDI, creates session |
/api/detect/{session_id} |
POST | Run BPM & key detection |
/api/process/{session_id} |
POST | Pitch shift + time stretch (full or region) |
/api/stems/{session_id} |
GET | List available stems |
/api/stem/{session_id}/{stem_name} |
GET | Download a stem as WAV |
/api/ws/{session_id} |
WebSocket | Processing progress events |
/api/health |
GET | Health check |
9. Quality Indicators
The UI provides visual feedback on expected quality based on how far the user shifts from the original:
Pitch:
- Green (Recommended): 0β4 semitones
- Yellow (Some quality loss): 5β7 semitones
- Red (Significant quality loss): 8+ semitones
Tempo:
- Green (Recommended): 0β20% change
- Yellow (Some quality loss): 21β40% change
- Red (Significant quality loss): 40%+ change
These reflect the inherent limitations of time-domain audio processing β larger shifts introduce more phase vocoder artifacts.
10. Key Technology Choices β Rationale Summary
| Choice | Why |
|---|---|
| FastAPI | Async, WebSocket-native, Pydantic validation, auto-docs |
| In-memory sessions | MVP simplicity; no database overhead for transient data |
| Rubber Band | Industry-standard pitch/tempo library, stem-type-specific tuning via --crisp |
| Essentia | Purpose-built for MIR; multifeature BPM detection outperforms librosa alone |
| Ensemble key voting | 4 profiles reduce single-profile bias; bass weighting improves harmonic accuracy |
| Pedalboard mastering | Spotify's library wraps JUCE plugins; simple API, professional sound |
| Web Audio API | Zero-latency mixing; no server round-trip for volume/pan/reverb changes |
| React + Vite | Fast development, fast builds, component isolation |
| Tailwind | Rapid UI iteration without writing CSS files |
| ProcessPoolExecutor | True multi-core parallelism for CPU-bound audio processing (bypasses Python GIL) |
| AudioBuffer cache | Eliminates redundant fetch+decode when switching between full/region playback |