Spaces:

MusoraProductDepartment
/

jam-tracks

Running

App Files Files Community

jam-tracks / REPORT.md

Mina Emadi

added the looping function and changing the key and bpm of portion of the song

b384007 about 2 months ago

preview code

raw

history blame contribute delete

19.4 kB

Jam Track Studio — Technical Report

1. Overview

Jam Track Studio is a web application that lets musicians upload individual instrument stems, detect the song's BPM and musical key, then shift the pitch and tempo in real time — either for the whole song or a selected region. The app runs entirely in the browser for playback and mixing (via the Web Audio API), with a Python backend handling the computationally expensive audio analysis and processing.

2. Architecture

                         Browser (React + Web Audio API)
                    ┌──────────────────────────────────────┐
                    │  FileUpload → AnalysisDisplay         │
                    │  ControlPanel → TransportBar          │
                    │  StemMixer → Waveform                 │
                    │                                       │
                    │  useSession ←──── REST / WebSocket ──────┐
                    │  useAudioEngine (Web Audio graph)      │  │
                    │  useProcessingProgress (WS listener)   │  │
                    └──────────────────────────────────────┘  │
                                                              │
                         Backend (FastAPI + Uvicorn)           │
                    ┌──────────────────────────────────────┐  │
                    │  /api/upload          POST            │◄─┘
                    │  /api/detect/:id      POST            │
                    │  /api/process/:id     POST + WS       │
                    │  /api/stem/:id/:name  GET (streaming)  │
                    │                                       │
                    │  Services: bpm_detector, key_detector  │
                    │            audio_processor, midi_analyzer│
                    │  In-memory session store               │
                    └──────────────────────────────────────┘

Why this split?

CPU-intensive work stays on the server. Pitch shifting and time stretching via Rubber Band are heavy operations that benefit from native C++ performance and multi-core parallelism (ProcessPoolExecutor with up to 6 workers). Running these in the browser would be impractically slow.
Playback and mixing stay in the browser. Volume, pan, reverb, solo, and mute changes are instant because they manipulate Web Audio API GainNode and StereoPannerNode parameters — no network round-trip, no re-encoding.
WebSocket for progress. Processing multiple stems takes seconds. Rather than polling, the backend pushes per-stem progress events over a WebSocket so the UI can show a live progress overlay.

3. Backend

3.1 Framework: FastAPI

FastAPI was chosen for:

Async I/O — WebSocket support and non-blocking file uploads come built-in
Pydantic validation — request/response schemas with automatic type checking
OpenAPI docs — auto-generated at /docs for debugging
Performance — Uvicorn ASGI server handles concurrent requests efficiently

The app runs on port 7860 (Hugging Face Spaces requirement) and serves the built React frontend as static files at the root path.

3.2 Session Model (In-Memory)

Each upload creates a Session object stored in a module-level Python dict:

Session
├── id: UUID
├── stems: {name → StemData(audio, sample_rate)}
├── processed_stems: {name → StemData}          # full-song processed
├── region_processed_stems: {name → StemData}   # region-only processed
├── detected_bpm, detected_key, detected_mode
├── detection_confidence
├── midi_data
├── wav_cache: {cache_key → encoded bytes}
└── created_at (auto-cleaned after 1 hour)

Why in-memory? This is an MVP. There's no user authentication, no persistence requirement, and sessions are short-lived. A background task runs every 10 minutes to delete sessions older than 1 hour.

Why separate processed_stems and region_processed_stems? When the user processes the full song, then selects a region and processes just that portion, we need both versions available: the full-song version for "Play Full Song" and the region slice for looped region playback. Storing them separately avoids one overwriting the other.

3.3 Upload Pipeline

POST /api/upload accepts multipart form data with named stem files (guitar, drums, bass, synth, click_record) and optional MIDI files.

Processing steps:

Validate file types (.wav, .mid/.midi only) and size (max 120 MB per stem)
Read audio via soundfile.read() → numpy float32 arrays
Convert stereo to mono (halves memory, simplifies processing)
Validate all stems share the same sample rate and durations within 1 second
Generate a mix by summing all stems and normalizing to peak 0.95
Parse MIDI files (if provided) via mido.MidiFile
Pre-encode all stems as WAV bytes and cache them for fast first playback

3.4 Stem Serving

GET /api/stem/{session_id}/{stem_name}?processed=true&region=false

The endpoint resolves which version of a stem to serve using a priority chain:

If region=true and processed=true → check region_processed_stems
If processed=true → check processed_stems
Fallback → stems (originals)

Encoded WAV bytes are cached in session.wav_cache keyed by "{stem}_{region|processed|original}" so repeated downloads skip the encoding step.

4. Audio Analysis

4.1 BPM Detection

Primary: Essentia RhythmExtractor2013 (multifeature method)

Essentia is a C++ library with Python bindings purpose-built for music information retrieval. The multifeature method combines multiple rhythm analysis approaches internally for robust BPM estimation.

Steps:

Resample audio to 44100 Hz (Essentia's expected rate)
Run RhythmExtractor2013(method="multifeature") → returns BPM, beat ticks, and confidence
If confidence < 0.5 and a drums stem is available, re-run on the drums stem alone (drums carry the strongest rhythmic signal)
Apply octave error correction: constrain BPM to 50–200 range by doubling or halving. This handles the common case where the algorithm returns 60 BPM for a 120 BPM song (half-time detection)
Clamp confidence to [0, 1]

Fallback: librosa beat tracking

If Essentia is unavailable:

tempo, beat_frames = librosa.beat.beat_track(y=audio, sr=sr)
onset_env = librosa.onset.onset_strength(y=audio, sr=sr)
confidence = np.std(onset_env) / (np.std(onset_env) + 1)

MIDI BPM (highest priority when available): Extracted directly from set_tempo messages: BPM = 60,000,000 / tempo_microseconds. This is exact — confidence is 1.0.

4.2 Key Detection

Primary: Essentia Ensemble Voting with 4 Key Profiles

Key detection is inherently ambiguous (relative major/minor, modal ambiguity), so we use an ensemble approach:

Profiles: temperley, krumhansl, edma, bgate

Each profile represents a different statistical model of how pitch classes distribute in tonal music:

Temperley — optimized for pop/rock
Krumhansl — classic music cognition research profile
EDMA — Electronic Dance Music Analysis profile
Bgate — alternative weighting

For each profile, Essentia's KeyExtractor returns a (key, mode, strength) tuple. All votes are accumulated into a weighted tally. The key with the highest total strength wins.

Bass weighting: If a bass stem is available and its key detection confidence exceeds 0.3, its votes are added at 0.5x weight. Bass notes strongly indicate the harmonic root.

Fallback: librosa chroma-based correlation

Computes a Constant-Q chromagram, averages across time to get a 12-element pitch class profile, then correlates against rotated Temperley major/minor profiles for all 12 keys. The key with highest Pearson correlation wins.

MIDI Key Detection: Builds a pitch class histogram weighted by note_duration * velocity, then runs the same ensemble voting against Temperley profiles.

5. Audio Processing

5.1 Pitch Shifting & Time Stretching

Algorithm: Rubber Band Library

Rubber Band is a high-quality C++ library for audio time-stretching and pitch-shifting. It uses a phase vocoder approach with sophisticated transient detection and phase-locking to minimize artifacts.

Three code paths depending on what's needed:

Change	Method
Pitch only	`pyrubberband.pitch_shift(audio, sr, n_steps)`
Tempo only	`pyrubberband.time_stretch(audio, sr, rate)`
Both	Rubber Band CLI single-pass (avoids two-pass quality loss)

Stem-specific optimization via --crisp flag:

The --crisp parameter controls how aggressively Rubber Band preserves transients:

Stem Type	--crisp	Rationale
Drums/percussion	6 (maximum)	Drum attacks must be razor-sharp; smeared transients sound unnatural
Bass	3 + `--fine`	Low frequencies need precise handling; `--fine` uses a higher-resolution filter
Default (guitar, synth, keys)	4	Balanced: preserves attacks without over-sharpening sustained sounds

Fallback: If the rubberband CLI binary isn't installed, the code falls back to two-pass pyrubberband (pitch shift first, then time stretch). This produces slightly lower quality but always works.

5.2 Parallel Processing

All stems are processed simultaneously using Python's ProcessPoolExecutor with up to 6 workers. Each worker runs in a separate process (true parallelism, no GIL limitation) and handles one stem. Progress is reported per-stem via WebSocket as each worker completes.

5.3 Region Processing

When region_start and region_end are provided in the process request:

Each stem's numpy array is sliced: audio[int(start * sr) : int(end * sr)]
Only the slice goes through Rubber Band
Results are stored in session.region_processed_stems (not processed_stems)
The WAV cache is cleared to avoid serving stale data

This means region processing is faster proportional to region length — a 5-second region processes much faster than a 3-minute song.

5.4 Mastering Chain

After stems are processed and summed, a mastering chain is applied using Spotify's Pedalboard library (Python bindings to JUCE audio plugins):

Compressor → Limiter → Output

Parameter	Value	Purpose
Compressor threshold	-10 dB	Gentle compression; tames peaks without squashing dynamics
Compressor ratio	3:1	Moderate — enough to control, not enough to pump
Compressor attack	10 ms	Fast enough to catch transients
Compressor release	150 ms	Smooth recovery, avoids pumping artifacts
Limiter threshold	-1 dB	Hard ceiling prevents clipping
Limiter release	100 ms	Transparent limiting

6. Frontend

6.1 Framework & Build

React 18 — component-based UI with hooks for state management
Vite — near-instant hot module replacement during development, optimized production builds
Tailwind CSS — utility-first styling with custom blue/purple theme and glassmorphism effects

No state management library (Redux, Zustand) — the app's state is simple enough to manage with React's built-in useState and useCallback hooks distributed across three custom hooks.

6.2 Web Audio API Signal Chain

The entire mixer runs in the browser. For each stem:

AudioBufferSourceNode (decoded PCM data)
  → GainNode (per-stem volume, 0–1)
    → DynamicsCompressorNode (per-stem dynamics control)
      → StereoPannerNode (L/R positioning, -1 to +1)
        ├→ MasterGainNode (direct/dry signal)
        └→ GainNode (reverb send amount)
            → ConvolverNode (synthetic reverb impulse response)
              → MasterGainNode
                → AnalyserNode (64-bar FFT for visualization)
                  → AudioContext.destination (speakers)

Why this graph?

Per-stem compressor: Tames dynamics before mixing, prevents one loud stem from dominating. Settings: threshold -24 dB, ratio 4:1, 3ms attack, 250ms release.
Stereo panner with defaults: Instruments are pre-panned to a natural stereo image (drums/bass center, guitar slightly left, synth slightly right). Users can override.
Convolver reverb: A synthetic impulse response (2-second exponential decay with random noise) creates a natural room reverb. Each stem has its own send amount (default 15%), routed to a shared ConvolverNode.
AnalyserNode for visualization: Provides 64-bin frequency data at 60fps, rendered on a canvas as animated gradient bars.

6.3 AudioBuffer Caching

Problem: Switching between full-song and region playback previously re-fetched and re-decoded all stems — the same expensive operation as the initial load.

Solution: A persistent bufferCacheRef (React ref) maps cache keys like "drums_full" and "drums_region" to decoded AudioBuffer objects. These survive across loadStems() calls.

First load: cache miss → fetch WAV from server → decode → store in cache
Subsequent loads: cache hit → skip network + decode, just rebuild the audio graph nodes (instant)
Cache invalidation: clearBufferCache('region') is called before loading newly processed region stems; clearBufferCache('full') before loading newly processed full stems. The other tag's cache entries remain valid.

Audio graph nodes (GainNode, CompressorNode, etc.) cannot be reused — the Web Audio API requires fresh nodes to be created and wired up each time. But this is cheap (~~1ms) compared to fetch+decode (~~500ms–2s per stem).

6.4 Region Selection UI

The TransportBar component implements drag-to-select:

Create region: mousedown on the progress bar starts tracking; mousemove updates regionStart/regionEnd; mouseup finalizes. If the resulting region is < 0.1s, it's treated as a click-to-seek instead.
Resize region: If mousedown lands near (within 8px of) a handle, only that edge is dragged.
Numeric inputs: Clicking the displayed start/end times opens an editable M:SS.T text field for precise entry.
Visual: A yellow semi-transparent overlay spans the region. Two yellow handle bars mark the edges.

Coordinate systems: When in region playback mode, duration reflects the sliced clip length (e.g., 15 seconds), but the region handles must remain positioned relative to the full song (e.g., at 30s and 45s of a 3-minute song). A barDuration variable resolves this: it uses fullSongDuration when in region mode, duration otherwise. All percentage calculations for region positioning use barDuration.

The playback progress indicator is also mapped correctly: in region mode, currentTime (0 to regionLength) is mapped into the region's position on the full-song bar, so the playhead moves within the highlighted band.

6.5 Playback Modes

Mode	What plays	Loop	Progress bar
`full`	Full song stems (original or processed)	No	0% → 100% of song
`region`	Processed region slice	Yes	Playhead moves within yellow region band

Switching modes:

"Apply to Selection" → process region → load region stems → set region mode + loop on
"Play Full Song" → stop → load full stems (from cache) → set full mode + loop off
"Clear Selection" → clear region state → if was in region mode, switch to full

6.6 Components

Component	Responsibility
`FileUpload`	Drag-and-drop or click-to-upload for stems (.wav) and MIDI (.mid) files
`AnalysisDisplay`	Shows detected BPM (with confidence), key, and mode after upload
`ControlPanel`	Key selector (24 keys), BPM slider (50%–200%), quick-shift buttons, quality badges, "Apply Changes" / "Apply to Selection" button
`StemMixer`	Per-stem volume slider, pan knob, reverb amount, solo/mute toggles, reset button
`TransportBar`	Play/pause/stop, progress bar with region selection, numeric time inputs, "Play Full Song" button, loop indicator
`Waveform`	Canvas-based 64-bar FFT frequency visualizer with gradient coloring and glow effects
`ProcessingOverlay`	Full-screen overlay during processing showing per-stem progress from WebSocket

7. Deployment

Docker

FROM python:3.11-slim
# Install: rubberband-cli, libsndfile1, ffmpeg, Node.js 20
# Build frontend → dist/
# Run: uvicorn backend.main:app --host 0.0.0.0 --port 7860

The container includes the Rubber Band CLI binary for optimal pitch/tempo processing. The built React frontend is served as static files by FastAPI, so a single container handles both the API and the UI.

8. API Summary

Endpoint	Method	Description
`/api/upload`	POST	Upload stems + MIDI, creates session
`/api/detect/{session_id}`	POST	Run BPM & key detection
`/api/process/{session_id}`	POST	Pitch shift + time stretch (full or region)
`/api/stems/{session_id}`	GET	List available stems
`/api/stem/{session_id}/{stem_name}`	GET	Download a stem as WAV
`/api/ws/{session_id}`	WebSocket	Processing progress events
`/api/health`	GET	Health check

9. Quality Indicators

The UI provides visual feedback on expected quality based on how far the user shifts from the original:

Pitch:

Green (Recommended): 0–4 semitones
Yellow (Some quality loss): 5–7 semitones
Red (Significant quality loss): 8+ semitones

Tempo:

Green (Recommended): 0–20% change
Yellow (Some quality loss): 21–40% change
Red (Significant quality loss): 40%+ change

These reflect the inherent limitations of time-domain audio processing — larger shifts introduce more phase vocoder artifacts.

10. Key Technology Choices — Rationale Summary

Choice	Why
FastAPI	Async, WebSocket-native, Pydantic validation, auto-docs
In-memory sessions	MVP simplicity; no database overhead for transient data
Rubber Band	Industry-standard pitch/tempo library, stem-type-specific tuning via --crisp
Essentia	Purpose-built for MIR; multifeature BPM detection outperforms librosa alone
Ensemble key voting	4 profiles reduce single-profile bias; bass weighting improves harmonic accuracy
Pedalboard mastering	Spotify's library wraps JUCE plugins; simple API, professional sound
Web Audio API	Zero-latency mixing; no server round-trip for volume/pan/reverb changes
React + Vite	Fast development, fast builds, component isolation
Tailwind	Rapid UI iteration without writing CSS files
ProcessPoolExecutor	True multi-core parallelism for CPU-bound audio processing (bypasses Python GIL)
AudioBuffer cache	Eliminates redundant fetch+decode when switching between full/region playback