Known Technical Challenges & Limitations
Transcription Accuracy
Challenge: Imperfect Note Detection
Problem: ML models make mistakes:
- Missed notes (false negatives)
- Ghost notes (false positives)
- Wrong pitches (especially in dense chords)
- Timing errors (notes start/end at wrong time)
Impact: Users will need to edit ~20-40% of notes for complex music
Ghost Notes - Sustained Note Decay Artifacts: When sustained piano notes fade in the original audio, the ML transcription model can incorrectly detect the fading tail as new note onsets. This creates "ghost notes" appearing in later measures that are actually just the fading remnants of earlier sustained notes.
Solution Implemented (as of recent update):
Velocity Envelope Analysis: Detects and merges false onsets from sustained note decay patterns
- Identifies decreasing velocity sequences (e.g., 80 β 50 β 35) as likely sustain artifacts
- Preserves intentional repeated notes (staccato) with similar velocities
- Configurable via
velocity_decay_threshold,sustain_artifact_gap_ms,min_velocity_similarity
Tempo-Adaptive Thresholds: Adjusts filtering strictness based on detected tempo
- Fast music (>140 BPM): Stricter onset/velocity thresholds to reduce false positives
- Slow music (<80 BPM): More permissive to catch soft dynamics
- Medium tempos: Standard thresholds
Proper MusicXML Ties: Adds tie notation for sustained notes across measure boundaries
- Uses music21's tie.Tie class ('start' and 'stop' markers)
- Improves notation readability and editing experience
Expected Results: 70-90% reduction in ghost notes from sustained note decay, while preserving intentional repeated notes (staccato passages).
General Mitigation:
- Good editor: Make editing fast and intuitive
- Visual feedback: Highlight low-confidence notes
- Set expectations: Tell users transcription is a starting point, not final output
- Pre-processing: Clean audio (denoise, normalize)
- Post-processing: Quantize rhythms, apply music theory rules
Challenge: Polyphonic Complexity
Problem: Piano can play 10+ notes simultaneously. ML models struggle with dense chords.
Examples:
- Rachmaninoff: Many notes, complex voicing β accuracy drops to ~60%
- Stride piano: Bass + chord + melody β bass line often lost
- Fast passages: Lots of notes β rhythm becomes mushy
Mitigation:
- Start with simpler music for MVP testing
- Consider reducing polyphony (keep top 3-4 notes per chord)
- Warn users that complex music will require more editing
Challenge: Rhythm Quantization
Problem: ML models output exact timings (e.g., note starts at 1.237s), but sheet music uses discrete rhythms (quarter, eighth, etc.).
Quantization Errors:
- Swing rhythm becomes triplets
- Rubato (tempo variation) β weird note durations
- Grace notes detected as full notes
Mitigation:
- Quantize to 16th note grid (standard)
- Detect tempo changes before quantizing
- Let users adjust quantization (8th, 16th, 32nd note grid)
Source Separation Quality
Challenge: Instrument Bleed
Problem: Demucs doesn't perfectly isolate instruments. Piano often leaks into "vocals" or "other" stem.
Impact:
- Transcribing "other" stem may include drums/bass artifacts
- Extra notes detected from bleed
Mitigation:
- Only transcribe "other" stem for MVP (assume it's mostly piano)
- Add confidence threshold filtering (ignore low-amplitude notes)
- In Phase 2, use 6-stem model (htdemucs_6s) with dedicated piano stem
Challenge: Mix-Dependent Quality
Problem: Separation quality depends on recording:
- Studio recordings (dry, well-separated) β excellent separation
- Live recordings (reverb, crowd noise) β poor separation
- YouTube rips (compressed, low quality) β degraded separation
Mitigation:
- Test with diverse YouTube videos to set realistic expectations
- Warn users if audio quality is low (detect bitrate < 128kbps)
- Offer "best effort" disclaimer
Processing Time vs. Quality Trade-Off
Challenge: Slow Processing
Current Pipeline:
- Download: 5-15 seconds
- Demucs: 30-60 seconds (GPU) or 8-15 minutes (CPU)
- basic-pitch: 5-10 seconds
- MusicXML: 2-5 seconds
- Total: 1-2 minutes (GPU) or 10-15 minutes (CPU)
User Expectation: "Instant" results (< 10 seconds)
Mitigation:
- Set expectations: Show estimated time (1-2 min) upfront
- Progress updates: WebSocket keeps user engaged
- Optimize: Use GPU, pre-warm workers, batch processing
- Future: Faster models (trade quality for speed)
Challenge: GPU Availability
Problem: GPUs are expensive and scarce.
Costs:
- Modal A10G GPU: ~$0.60/hour
- 1000 jobs/month Γ 1 min/job = 16 hours/month = $10/month
- 10k jobs/month = $100/month
Mitigation:
- MVP: Run on local GPU (free)
- Production: Use serverless GPU (pay-per-use)
- Optimization: Keep GPU warm during peak hours, cold-start otherwise
- Pricing: Charge users for processing (e.g., $0.10/song) or subscription
Copyright & Legal Issues
Challenge: YouTube Content Rights
Problem: Users may transcribe copyrighted music without permission.
Legal Risk: DMCA takedown notices, copyright infringement lawsuits
Mitigation:
- Terms of Service: Users responsible for copyright compliance
- DMCA Safe Harbor: Platform not liable if users misuse (US law)
- No storage: Don't store transcriptions long-term (users download immediately)
- Rate limiting: Prevent mass scraping
- Block known copyright: Detect and block known copyrighted URLs (future)
Note: This is similar to yt-dlp, which doesn't face legal issues as a tool. Users are responsible for their use.
File Format Edge Cases
Challenge: MusicXML Complexity
Problem: MusicXML supports many features that VexFlow doesn't render well:
- Multiple voices per staff
- Cross-staff beaming (piano)
- Complex time signatures (5/8, 7/4)
- Ornaments, trills
Impact: Some MusicXML files won't render correctly
Mitigation:
- MVP: Only generate simple MusicXML (single voice, common time sigs)
- Validation: Warn if MusicXML contains unsupported features
- Fallback: Offer MIDI export if MusicXML fails
Browser Performance
Challenge: Large Scores Slow Down Frontend
Problem: VexFlow renders SVG. Large scores (100+ measures) make DOM huge and slow.
Impact:
- Lag when editing
- Slow scrolling
- High memory usage
Mitigation:
- Pagination: Render one page at a time
- Virtualization: Only render visible measures (like react-window)
- Canvas backend: Use Canvas instead of SVG for better performance
- Simplify: Reduce polyphony (fewer notes per chord)
Metadata Detection
Challenge: Key Signature Detection
Problem: music21's key detection isn't always accurate, especially for:
- Atonal music
- Modal music (Dorian, Phrygian)
- Key changes mid-song
Impact: Wrong key signature displayed, notes have wrong accidentals
Mitigation:
- Use most common key as default (C major, A minor)
- Let users override key in editor
- In Phase 2, train ML model to detect key from audio
Challenge: Time Signature Detection
Problem: basic-pitch outputs MIDI without time signature info. music21 guesses from note patterns, but often wrong.
Impact: Measures have wrong number of beats, bar lines in wrong places
Mitigation:
- Default to 4/4 (most common)
- Let users change time signature
- Detect from beat tracking (librosa) in future
User Experience Challenges
Challenge: Setting Realistic Expectations
Problem: Users expect perfect transcription ("just like a human musician")
Reality: 70-80% accuracy at best, requires editing
Mitigation:
- Onboarding: Show sample before/after (raw vs. edited)
- Marketing: Position as "transcription assistant," not "perfect transcription"
- Tutorial: Teach users how to edit efficiently
Challenge: Complex Editor Learning Curve
Problem: Music notation editing is complex. Users need to learn:
- How to add/delete notes
- How to change durations
- Music theory basics (what is a quarter note?)
Mitigation:
- Tooltips: Show keyboard shortcuts and help text
- Tutorial: Interactive walkthrough on first use
- Presets: Common editing tasks as buttons (fix rhythm, transpose, etc.)
Infrastructure Challenges
Challenge: Cold Starts
Problem: Serverless GPU workers take 10-20 seconds to cold-start (load model into memory)
Impact: First job takes longer, bad UX
Mitigation:
- Pre-warm: Keep 1-2 workers hot during peak hours
- Progress messages: "Starting worker..." so user knows why it's slow
- Model caching: Use volumes to cache models (Modal, RunPod)
Challenge: Scaling Costs
Problem: GPU costs scale linearly with usage. 10k jobs/month = $100/month in GPU costs.
Break-Even Analysis:
- Free tier: Lose money on every job
- $5/month subscription: Need 50 jobs/month to break even
- Pay-per-job ($0.10/song): Break even immediately
Mitigation:
- Freemium: Free tier with limits (5 songs/month), paid for more
- Optimize: Reduce processing time to cut costs
- Sponsors: Ads or sponsors for free users
Next Steps
- Test MVP with diverse YouTube videos to identify which challenges are most critical
- Prioritize fixes based on user feedback
- Document workarounds in user guide
See MVP Scope for what to build first despite these challenges.