System Architecture Overview
High-Level Architecture
graph TB
subgraph Browser["USER BROWSER"]
subgraph Frontend["React Frontend Application"]
VexFlow["VexFlow<br/>(Notation Rendering)"]
Tone["Tone.js<br/>(Playback)"]
State["State Mgmt<br/>(Zustand)"]
end
end
subgraph Backend["BACKEND API SERVER"]
subgraph FastAPI["FastAPI Application"]
REST["REST API<br/>Endpoints"]
WS["WebSocket<br/>Handler"]
FileStorage["File Storage<br/>Manager"]
end
end
subgraph Workers["BACKGROUND WORKERS"]
subgraph Celery["Celery Workers (GPU-enabled)"]
ytdlp["yt-dlp<br/>(Audio Download)"]
Demucs["Demucs<br/>(Source Separation)"]
BasicPitch["basic-pitch<br/>(MIDI Transcribe)"]
MusicXMLGen["MusicXML<br/>Generator"]
end
end
subgraph Storage["STORAGE LAYER"]
TempAudio["Temp Audio<br/>Files (.wav)"]
JobMeta["Job Metadata<br/>(Redis)"]
OutputFiles["Output Files<br/>(.musicxml)"]
end
Frontend <-->|"HTTP/WebSocket"| FastAPI
FastAPI <-->|"Job Queue"| Redis[(Redis)]
Redis <-->|"Pull Tasks"| Celery
ytdlp --> Demucs
Demucs --> BasicPitch
BasicPitch --> MusicXMLGen
Celery -.->|"Store"| Storage
Component Breakdown
1. Frontend Application (React)
Responsibility: User interface for submitting URLs, viewing/editing notation, and playback
Key Libraries:
- VexFlow: Renders music notation as interactive SVG
- Tone.js: Browser-based MIDI playback and synthesis
- React: UI framework with state management
Data Flow:
- User submits YouTube URL
- Receives job ID, polls/streams progress via WebSocket
- Fetches MusicXML result when complete
- Renders notation, enables editing
- Exports to PDF/MIDI/MusicXML
2. Backend API Server (FastAPI)
Responsibility: HTTP API, WebSocket connections, job orchestration
Endpoints:
POST /api/transcribe- Submit URL, queue processing jobGET /api/jobs/{id}- Get job statusWS /api/jobs/{id}/stream- Real-time progress updatesGET /api/scores/{id}- Download MusicXMLGET /api/scores/{id}/midi- Download MIDI
Key Functions:
- YouTube URL validation
- Job creation and status tracking
- WebSocket broadcast to connected clients
- File serving (MusicXML, MIDI, PDF)
3. Background Workers (Celery)
Responsibility: Heavy audio/ML processing asynchronously
Processing Pipeline:
- Download (
yt-dlp): Extract audio from YouTube - Separate (
Demucs): Split into instrument stems (drums, bass, vocals, other) - Transcribe (
basic-pitch): Convert audio β MIDI notes per stem - Generate (music21/custom): MIDI β MusicXML with metadata
Worker Configuration:
- GPU access required for Demucs and basic-pitch
- Task timeout: 10 minutes per job
- Retry logic for transient failures
- Progress updates sent to Redis, broadcasted via WebSocket
4. Storage Layer
Temporary Storage (Local filesystem or S3):
- Downloaded audio files (.m4a, .wav)
- Separated stems (.wav per instrument)
- Intermediate MIDI files
Persistent Storage (Redis):
- Job queue and status
- Progress percentage and current stage
- Error messages and logs
Output Storage (Filesystem/S3):
- Generated MusicXML files
- MIDI exports
- PDF renders (future)
Data Flow: URL β Notation
Step-by-Step User Flow
sequenceDiagram
participant User
participant Frontend
participant API
participant Redis
participant Worker
participant Storage
User->>Frontend: 1. Paste YouTube URL
Frontend->>API: 2. POST /api/transcribe
API->>API: 3. Validate URL, create job ID
API->>Redis: Queue job
API->>Frontend: Return job ID
Frontend->>API: 4. Open WebSocket /api/jobs/{id}/stream
Redis->>Worker: 5. Worker picks up job
Worker->>Worker: Stage 1: Download audio (0-20%)
Worker->>Storage: Save audio.wav
Worker->>Redis: Publish progress: 20%
Redis->>API: Read progress
API->>Frontend: WebSocket update: 20%
Worker->>Worker: Stage 2: Source separation (20-50%)
Worker->>Storage: Save stems (drums, bass, vocals, other)
Worker->>Redis: Publish progress: 50%
Redis->>API: Read progress
API->>Frontend: WebSocket update: 50%
Worker->>Worker: Stage 3: Transcription (50-90%)
Worker->>Worker: basic-pitch converts to MIDI
Worker->>Redis: Publish progress: 90%
Redis->>API: Read progress
API->>Frontend: WebSocket update: 90%
Worker->>Worker: Stage 4: MusicXML generation (90-100%)
Worker->>Storage: Save output.musicxml
Worker->>Redis: Mark job complete
Redis->>API: Job complete notification
API->>Frontend: WebSocket: Job complete
Frontend->>API: 7. GET /api/scores/{id}
API->>Storage: Fetch MusicXML
Storage->>API: Return file
API->>Frontend: Return MusicXML
Frontend->>Frontend: 8. Parse MusicXML
Frontend->>Frontend: VexFlow renders SVG
Frontend->>User: Display editable notation
Technology Stack Summary
| Layer | Technology | Why? |
|---|---|---|
| Frontend Framework | React | Component-based UI, large ecosystem |
| Notation Rendering | VexFlow | Best browser-based music engraving library |
| Audio Playback | Tone.js | WebAudio wrapper with scheduling and synthesis |
| Backend Framework | FastAPI | Async Python, auto-generated API docs, WebSocket support |
| Task Queue | Celery + Redis | Industry standard for async Python jobs |
| Source Separation | Demucs | State-of-the-art audio separation (Meta Research) |
| Transcription | basic-pitch | Spotify's open-source polyphonic transcription |
| Music Notation Format | MusicXML | Universal interchange format for notation software |
See Technology Stack for detailed trade-off analysis.
Key Architectural Decisions
Why Async Processing?
Audio processing is slow (1-2 minutes for a 3-minute song):
- Demucs source separation: ~30-60 seconds (GPU)
- basic-pitch transcription: ~15-30 seconds per stem
- Total: 1-2 minutes minimum
Blocking HTTP requests for 2 minutes is unacceptable UX. Solution:
- Job queue decouples API from processing
- WebSocket provides real-time updates without polling overhead
- Users can close tab and return later
Why Server-Side Processing?
ML models are too large for browsers:
- Demucs model: ~350MB
- basic-pitch model: ~30MB
- GPU acceleration critical for performance
- User devices lack compute power for real-time processing
Why MusicXML Over MIDI?
MIDI is great for playback but lacks notation-specific metadata:
- No staff layout, clef, or articulation marks
- No measure boundaries or bar lines
- No lyrics or dynamics
MusicXML is the standard for music notation interchange:
- Used by Finale, Sibelius, MuseScore, Dorico
- Supports full notation semantics
- Can be converted to MIDI for playback
Why VexFlow Over Alternatives?
VexFlow (chosen):
- Pure JavaScript, no dependencies
- Excellent rendering quality
- Programmatic SVG generation (good for editing)
- Active maintenance
OpenSheetMusicDisplay (alternative):
- More complete MusicXML support
- Less control over rendering
- Harder to build interactive editing on top
Scaling Considerations
MVP (Local Development)
- Single machine with GPU
- Processes one job at a time
- Redis on localhost
- Temp storage on local disk
Production (Future)
- Frontend: Static hosting (Vercel, Netlify, S3+CloudFront)
- API: Containerized FastAPI (AWS ECS, GCP Cloud Run)
- Workers: Serverless GPU (Modal, RunPod) or K8s with GPU nodes
- Storage: S3 for audio/outputs, Redis Cloud for queue
- Scaling: Horizontal scaling of workers based on queue depth
See Deployment Strategy for details.
Dependencies Between Components
graph LR
Frontend -->|HTTP/WebSocket| API[API Server]
API -->|Job queue, status| Redis
API -->|Output files| Storage
Workers -->|Queue, status updates| Redis
Workers -->|Temp files, outputs| Storage
Workers -->|ML Processing| GPU[GPU Resources]
Critical Path: Workers depend on GPU availability. If GPU is busy, jobs queue up.
Security Considerations
- YouTube URL validation: Prevent SSRF attacks, validate domain
- Rate limiting: Prevent abuse (max 10 jobs per IP per hour)
- Job isolation: Sandboxed worker processes
- Temp file cleanup: Delete audio files after processing
- Copyright: User-generated content, platform not liable (DMCA safe harbor)
Error Handling Strategy
- Retryable errors: Network failures, temporary YouTube blocks β retry 3x
- Permanent errors: Invalid URL, age-restricted video, copyright block β fail immediately
- Partial failures: Some stems fail transcription β return partial results
- Timeout: Jobs exceeding 10 minutes killed, status marked failed
Monitoring & Observability (Future)
- Metrics: Job success rate, processing time per stage, queue depth
- Logging: Structured logs per job ID
- Tracing: End-to-end latency from URL submit to MusicXML ready
- Alerts: Worker failures, high queue depth, high error rate
Next Steps
- Read Technology Stack for detailed tech analysis
- Review Audio Processing Pipeline for implementation details
- See MVP Scope for what to build first