ML Model Selection & Research
Source Separation Models
Demucs (Chosen)
Developer: Meta AI Research; License: MIT; Model Size: ~350MB; Performance: State-of-the-art (MDX Challenge winner 2021)
Variants:
- htdemucs (4-stem): drums, bass, vocals, other
- htdemucs_6s (6-stem): drums, bass, vocals, guitar, piano, other
- htdemucs_ft (fine-tuned): Better quality, slightly slower
Pros:
- Best quality among open-source models
- Active development
- GPU-accelerated (PyTorch)
- Good documentation
Cons:
- Large model size
- Slower than Spleeter (~30-60s per song)
- Requires ~4GB VRAM
When to Use: MVP and production (quality > speed)
Spleeter (Alternative)
Developer: Deezer; License: MIT; Model Size: ~200MB; Performance: Good, but surpassed by Demucs
Pros:
- Faster than Demucs (~10-20s per song)
- Smaller model
- TensorFlow-based
Cons:
- Lower quality separation
- No longer actively maintained (last update 2020)
- 2-stem and 5-stem models available
When to Use: If speed critical and quality acceptable
X-UMX (Alternative)
Developer: Open-source community; License: MIT; Performance: Comparable to early Demucs versions
Pros:
- Open-source
- Good quality
Cons:
- Slower than both Demucs and Spleeter
- Less documentation
When to Use: Research purposes only
Comparison
| Model | Quality (SDR) | Speed (GPU) | Model Size | Maintenance |
|---|---|---|---|---|
| Demucs v4 | 9.0 dB | 30-60s | 350MB | Active |
| Spleeter | 6.5 dB | 10-20s | 200MB | Abandoned |
| X-UMX | 7.0 dB | 60-90s | 180MB | Low |
SDR = Signal-to-Distortion Ratio (higher is better)
Decision: Use Demucs htdemucs for MVP, consider htdemucs_6s for multi-instrument in Phase 2.
Transcription Models
YourMT3+ (Primary)
Developer: KAIST (Korea Advanced Institute of Science and Technology) License: Apache 2.0 Model Size: ~536MB (YPTF.MoE+Multi checkpoint) Performance: State-of-the-art multi-instrument transcription
Architecture:
- Perceiver-TF encoder with Rotary Position Embeddings (RoPE)
- Mixture of Experts (MoE) feedforward layers (8 experts, top-2)
- Multi-channel T5 decoder for 13 instrument classes
- Float16 precision for GPU optimization
Pros:
- 80-85% note accuracy (vs 70% for basic-pitch)
- Multi-instrument aware (13 instrument classes)
- Handles complex polyphony
- Active development (2024)
- Open-source, well-documented
- Optimized for Apple Silicon MPS (14x speedup with float16)
- Good rhythm and onset detection
Cons:
- Large model size (~536MB download)
- Requires additional setup (model checkpoint download)
- Slower than basic-pitch (~30-40s per song on GPU)
- Higher memory requirements (~1.1GB VRAM)
When to Use: Production (primary transcriber) - Best quality for self-hosted solution
Current Status: Integrated into main backend, enabled by default with automatic fallback
basic-pitch (Fallback)
Developer: Spotify License: Apache 2.0 Model Size: ~30MB Performance: Good polyphonic transcription (70% accuracy)
Pros:
- Handles polyphonic music (multiple simultaneous notes)
- Trained on diverse dataset (30k+ songs)
- Outputs MIDI with velocities
- Fast (~5-10s per stem)
- Active maintenance
- Lightweight, no setup required
Cons:
- Lower accuracy than YourMT3+ (~70% vs 80-85%)
- Rhythm quantization can be off
- Struggles with very dense polyphony
When to Use: Automatic fallback when YourMT3+ unavailable or disabled
MT3 (Music Transformer) - Not Used
Developer: Google Magenta License: Apache 2.0 Model Size: ~500MB Performance: Good, but surpassed by YourMT3+
Why Not Chosen:
- YourMT3+ offers better accuracy
- Similar computational requirements
- YourMT3+ has better documentation and setup
Omnizart - Removed
Developer: MCTLab (Taiwan) License: MIT Status: Removed from codebase (replaced by YourMT3+)
Why Removed:
- Lower accuracy than YourMT3+ (75-80% vs 80-85%)
- More complex setup with multiple models
- Less active development
- Dual-transcription merging added complexity without accuracy gains
Tony (pYIN) - Alternative
Developer: Sonic Visualiser team; Performance: Excellent for monophonic (single note) melody
Pros:
- Very accurate for monophonic transcription
- Fast
- Lightweight
Cons:
- Monophonic only - can't handle chords or polyphony
- Not suitable for piano or guitar
When to Use: Vocal melody extraction only
Comparison
| Model | Polyphonic | Speed (GPU) | Accuracy | Status |
|---|---|---|---|---|
| YourMT3+ | Yes | 30-40s | 80-85% | Primary (Production) |
| basic-pitch | Yes | 5-10s | 70% | Fallback |
| MT3 | Yes | 30-60s | 75-80% | Not used |
| Omnizart | Yes | 15-30s | 75-80% | Removed |
| Tony | No | 2-5s | 90%+ | Vocals only |
Decision: YourMT3+ as primary transcriber with automatic fallback to basic-pitch for reliability.
Model Accuracy Expectations
Realistic Transcription Accuracy (with YourMT3+)
Simple Piano Melody (Twinkle Twinkle):
- Note accuracy: 90-95% (YourMT3+) / 85-90% (basic-pitch)
- Rhythm accuracy: 85-90% (YourMT3+) / 75-80% (basic-pitch)
Classical Piano (Chopin Nocturne):
- Note accuracy: 75-85% (YourMT3+) / 65-75% (basic-pitch)
- Rhythm accuracy: 70-75% (YourMT3+) / 55-65% (basic-pitch)
Jazz Piano (Bill Evans):
- Note accuracy: 70-75% (YourMT3+) / 55-65% (basic-pitch)
- Rhythm accuracy: 60-70% (YourMT3+) / 45-55% (basic-pitch)
Rock/Pop with Band:
- Piano separation: 70-80% (depends on Demucs quality)
- Note accuracy: 70-75% (YourMT3+) / 55-65% (basic-pitch)
Key Insight: YourMT3+ provides 10-15% better accuracy than basic-pitch, but transcription still won't be perfect. Editor is critical for users to fix errors.
Future Model Improvements
Fine-Tuning YourMT3+
Train on piano-specific dataset:
- Collect 1000+ piano YouTube videos with ground truth
- Fine-tune YourMT3+ checkpoint on piano-only data
- Expected improvement: +3-5% accuracy for piano
- Cost: GPU compute for training
Ensemble Models (Not Currently Used)
Previously attempted basic-pitch + omnizart merging:
- Result: Removed due to complexity without significant accuracy gain
- Learning: YourMT3+ alone provides better results than merged basic-pitch + omnizart
- Future: Could revisit with YourMT3+ + MT3 ensemble if needed
Post-Processing
Improve rhythm with music theory rules:
- Quantize to nearest 16th note
- Enforce measure boundaries
- Detect time signature from patterns
- Expected improvement: +10-15% rhythm accuracy
Benchmarks to Track
When testing models, measure:
- Note Onset Accuracy: % of notes detected at correct time
- Pitch Accuracy: % of notes with correct pitch
- Duration Accuracy: % of notes with correct duration
- Harmonic Accuracy: % of chords correctly identified
Tools: mir_eval library (Python)
Next Steps
- Test Demucs + basic-pitch on sample videos
- Measure accuracy and processing time
- Identify failure modes
- Document in Challenges
See Backend Pipeline for implementation details.