Spaces:

calebhan
/

rescored

Sleeping

App Files Files Community

rescored / docs /architecture /deployment.md

calebhan

vocal separation and bytedance integration

e7bf1e6 about 2 months ago

preview code

raw

history blame contribute delete

13.1 kB

	# Deployment Strategy

	## Current Target: Local Development

	The MVP focuses on getting the system working locally before worrying about production infrastructure.

	### Local Development Architecture

	```mermaid
	graph TB
	subgraph DevMachine["Developer Machine (MacOS/Linux)"]
	Frontend["Frontend (React + Vite)<br/>http://localhost:5173"]
	API["Backend API (FastAPI)<br/>http://localhost:8000"]
	Redis["Redis (Queue + Cache)<br/>localhost:6379"]
	Worker["Celery Worker (GPU-enabled)<br/>- Demucs model (~350MB)<br/>- basic-pitch model (~30MB)"]
	Storage["Local Storage<br/>- /tmp/rescored/audio/ (temp)<br/>- /tmp/rescored/outputs/ (MusicXML)"]

	Frontend -->\|HTTP/WS\| API
	API --> Redis
	Redis --> Worker
	Worker -.-> Storage
	end
	```

	### Setup Requirements

	Hardware:
	- GPU: Apple Silicon (M1/M2/M3/M4 with MPS) OR NVIDIA GPU with 4GB+ VRAM
	- Alternative: Run on CPU (10-15x slower, acceptable for development)
	- RAM: 16GB+ recommended
	- Disk: 10GB for models and temp files

	Software:
	- Python 3.10 (required for madmom compatibility)
	- Node.js 18+
	- Redis 7+
	- FFmpeg
	- YouTube cookies (required as of December 2024)

	### Docker Compose Setup (Recommended)

	```yaml
	# docker-compose.yml
	version: '3.8'

	services:
	redis:
	image: redis:7-alpine
	ports:
	- "6379:6379"
	volumes:
	- redis_data:/data

	api:
	build: ./backend
	ports:
	- "8000:8000"
	environment:
	- REDIS_URL=redis://redis:6379
	- STORAGE_PATH=/app/storage
	volumes:
	- ./backend:/app
	- storage:/app/storage
	depends_on:
	- redis

	worker:
	build: ./backend
	command: celery -A tasks worker --loglevel=info
	environment:
	- REDIS_URL=redis://redis:6379
	- STORAGE_PATH=/app/storage
	volumes:
	- ./backend:/app
	- storage:/app/storage
	deploy:
	resources:
	reservations:
	devices:
	- driver: nvidia
	count: 1
	capabilities: [gpu]
	depends_on:
	- redis

	frontend:
	build: ./frontend
	ports:
	- "5173:5173"
	volumes:
	- ./frontend:/app
	environment:
	- VITE_API_URL=http://localhost:8000

	volumes:
	redis_data:
	storage:
	```

	Benefits:
	- One command to start everything: `docker-compose up`
	- Consistent environment across developers
	- GPU passthrough handled automatically
	- Easy cleanup

	Limitations:
	- Slower hot reload than native
	- GPU support requires Docker Desktop on Mac (experimental)

	### Quick Start (Recommended)

	Use the provided shell scripts to start/stop all services:

	```bash
	# From project root
	./start.sh

	# View logs
	tail -f logs/api.log # Backend API
	tail -f logs/worker.log # Celery worker
	tail -f logs/frontend.log # Frontend

	# Stop all services
	./stop.sh
	```

	What `start.sh` does:
	1. Starts Redis (if not already running via Homebrew)
	2. Activates Python 3.10 venv
	3. Starts Backend API (uvicorn) in background
	4. Starts Celery Worker (--pool=solo for macOS) in background
	5. Starts Frontend (npm run dev) in background
	6. Writes all logs to `logs/` directory

	Services available at:
	- Frontend: http://localhost:5173
	- Backend API: http://localhost:8000
	- API Docs: http://localhost:8000/docs

	### Manual Setup (Alternative)

	If you prefer to run services manually in separate terminals:

	Terminal 1 - Redis (macOS with Homebrew):
	```bash
	brew services start redis
	redis-cli ping # Should return PONG
	```

	Terminal 2 - Backend API:
	```bash
	cd backend
	source .venv/bin/activate
	uvicorn main:app --host 0.0.0.0 --port 8000 --reload
	```

	Terminal 3 - Celery Worker:
	```bash
	cd backend
	source .venv/bin/activate
	# Use --pool=solo on macOS to avoid fork() crashes with ML libraries
	celery -A tasks worker --loglevel=info --pool=solo
	```

	Terminal 4 - Frontend:
	```bash
	cd frontend
	npm run dev
	```

	Benefits:
	- Easier debugging (separate terminal per service)
	- More control over each service
	- See output in real-time

	Limitations:
	- Managing 4 terminals
	- Need to manually stop each service

	---

	## Future: Production Deployment

	### Phase 2 - Proof of Concept Deployment

	Goal: Share with friends, small beta test (< 100 users)

	Architecture:
	```mermaid
	graph TB
	Users[Users]
	Users --> Vercel["Vercel<br/>Frontend (React SPA)"]
	Vercel -->\|HTTPS\| Render["Render/Railway<br/>Backend API (FastAPI)"]
	Render --> Upstash["Upstash Redis<br/>Job queue"]
	Upstash --> Modal["Modal<br/>GPU workers (Demucs + basic-pitch)"]
	Modal --> R2["Cloudflare R2<br/>Audio/MusicXML storage"]
	```

	Components:

	\| Service \| Provider \| Why \| Cost (est.) \|
	\|---------\|----------\|-----\|-------------\|
	\| Frontend \| Vercel \| Free tier, great DX \| $0 \|
	\| Backend API \| Render/Railway \| Easy deploy, free tier \| $0-7/month \|
	\| Redis \| Upstash \| Serverless, free tier \| $0 \|
	\| GPU Workers \| Modal \| Pay-per-use GPU \| $0.50/hour GPU time \|
	\| Storage \| Cloudflare R2 \| Cheap, S3-compatible \| $0.015/GB \|

	Estimated Monthly Cost: $10-50 for 100 users doing ~5 transcriptions/month
	- 500 jobs/month × 2 min/job = 16 GPU hours/month = ~$8
	- Storage: 100GB = ~$1.50
	- Backend: Free tier (Render) or $7/month

	Deployment Flow:
	1. Push to `main` branch
	2. Vercel auto-deploys frontend
	3. Render auto-deploys API from Dockerfile
	4. Modal workers pull latest image on invocation

	Limitations:
	- Cold starts (workers take 10-20s to start)
	- No auto-scaling of API (single instance)
	- Limited monitoring

	---

	### Phase 3 - Production Scale

	Goal: Support 1000+ users, high availability

	Architecture:
	```mermaid
	graph TB
	Users[Users]
	Users --> CDN["Cloudflare CDN<br/>Cached static assets"]
	CDN --> ALB["AWS ALB<br/>Load balancer"]
	ALB --> ECS["ECS Fargate<br/>API servers (auto-scaling 2-10)"]
	ECS --> ElastiCache["ElastiCache Redis<br/>Job queue (HA, multi-AZ)"]
	ElastiCache --> GPU["Modal/Runpod<br/>GPU workers (auto-scaling 1-20)"]
	GPU --> S3["S3<br/>Audio/MusicXML storage"]
	ECS --> RDS["RDS PostgreSQL<br/>User accounts, job history"]
	```

	Infrastructure:

	\| Component \| Service \| Scaling \| Cost \|
	\|-----------\|---------\|---------\|------\|
	\| CDN \| Cloudflare \| Global edge caching \| $20/month \|
	\| API \| ECS Fargate \| 2-10 instances, CPU-based autoscaling \| $50-200/month \|
	\| Redis \| ElastiCache \| Multi-AZ, 2 nodes \| $50/month \|
	\| Workers \| Modal \| 1-20 GPU instances, queue-depth scaling \| $500-2000/month \|
	\| Storage \| S3 \| Lifecycle policies (delete after 30 days) \| $50-100/month \|
	\| DB \| RDS PostgreSQL \| Multi-AZ, auto-scaling storage \| $50-100/month \|
	\| Monitoring \| Datadog/Sentry \| Error tracking, metrics \| $50/month \|

	Estimated Monthly Cost: $800-2500 for 10k transcriptions/month

	Features:
	- Auto-scaling: API scales on CPU, workers scale on queue depth
	- High availability: Multi-AZ for DB and Redis
	- Monitoring: Full observability (logs, metrics, traces)
	- Security: VPC, encryption at rest, HTTPS everywhere
	- CI/CD: GitHub Actions, blue-green deployments
	- Rate limiting: Per-user quotas, IP-based throttling

	Deployment Pipeline:
	1. PR opened → GitHub Actions runs tests
	2. Merge to `main` → Docker images built and pushed to ECR
	3. ECS updates task definitions (rolling update)
	4. Modal pulls new worker image on next invocation
	5. Cloudflare cache invalidated for frontend assets

	---

	## GPU Infrastructure Deep Dive

	### Local GPU (Development)

	Supported:
	- NVIDIA GPUs with CUDA 11.8+ support
	- Apple Silicon (MPS backend) - experimental, slower

	Setup:
	```bash
	# Check GPU
	nvidia-smi

	# Install PyTorch with CUDA
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
	```

	Fallback to CPU:
	```python
	# In worker code
	device = "cuda" if torch.cuda.is_available() else "cpu"
	```

	Processing Time:
	- GPU (RTX 3080): ~45 seconds per 3-minute song
	- CPU (M1 Max): ~8 minutes per 3-minute song

	---

	### Serverless GPU (Production)

	Option 1: Modal (Recommended)

	Pros:
	- Fast cold starts (10-20 seconds)
	- Per-second billing
	- No idle GPU cost
	- Great Python support
	- Volumes for model caching

	Cons:
	- Newer platform (less proven)
	- US-only regions currently

	Example Worker:
	```python
	import modal

	stub = modal.Stub("rescored")

	@stub.function(
	gpu="A10G", # NVIDIA A10G (24GB VRAM)
	timeout=600,
	volumes={"/models": modal.Volume.from_name("model-cache")}
	)
	def process_audio(job_id: str, audio_url: str):
	# Demucs + basic-pitch processing
	pass
	```

	Cost: ~$0.60/hour for A10G GPU

	---

	Option 2: RunPod Serverless

	Pros:
	- Cheaper than Modal ($0.30-0.50/hour)
	- More GPU options
	- Global regions

	Cons:
	- Slower cold starts (30-60 seconds)
	- More manual setup

	---

	Option 3: AWS SageMaker/Lambda

	Pros:
	- AWS ecosystem integration
	- Well-documented

	Cons:
	- Expensive for small workloads
	- Slow cold starts
	- More complex setup

	---

	Decision for Production: Start with Modal, evaluate RunPod if cost becomes issue.

	---

	## Storage Strategy

	### MVP (Local)
	- Temp files: `/tmp/rescored/`
	- Cleanup: Manual or cron job

	### Production
	- Temp audio: S3 with 1-day lifecycle policy (delete after processing)
	- Output files: S3 Standard for 30 days, then delete OR:
	- S3 Intelligent-Tiering if keeping long-term
	- Model files: Baked into Docker image or cached volume (Modal)

	S3 Bucket Structure:
	```
	s3://rescored-prod/
	temp-audio/
	{job_id}.wav # Delete after 1 day
	separated-stems/
	{job_id}/
	drums.wav
	bass.wav
	... # Delete after 1 day
	outputs/
	{job_id}.musicxml # Keep for 30 days
	{job_id}.midi # Keep for 30 days
	```

	---

	## Scaling Bottlenecks

	### What Scales Easily
	- Frontend (static assets, CDN)
	- API servers (stateless, horizontal scaling)
	- Redis (managed service auto-scaling)

	### What Doesn't Scale Easily
	- GPU workers: Expensive, limited availability
	- Source separation: CPU/GPU bound, can't optimize much
	- Model loading: Large models (350MB) slow cold starts

	### Mitigation Strategies
	1. Pre-warm workers: Keep 1-2 GPU workers hot during peak hours
	2. Model caching: Use Modal volumes or Docker layers
	3. Queue prioritization: Premium users get faster processing
	4. Job batching: Process multiple songs on same GPU instance (future)
	5. Progressive results: Return piano transcription first, other instruments later

	---

	## Cost Optimization

	### Development
	- Use CPU for small tests (slower but free)
	- Limit worker parallelism to 1

	### Production
	- Lifecycle policies: Delete temp files after 1 day, outputs after 30 days
	- Reserved capacity: If consistent load, reserve GPU instances (50% cheaper)
	- Spot instances: Use for non-urgent jobs (70% cheaper, can be interrupted)
	- CDN caching: Aggressive caching for static assets (frontend, model files)
	- Compression: Gzip API responses, compress audio files before storage

	---

	## Monitoring & Observability

	### Metrics to Track
	- API: Request rate, latency, error rate
	- Workers: Queue depth, processing time per stage, GPU utilization
	- Costs: GPU hours used, storage size, API requests

	### Logging
	- Structured logs: JSON format with job_id, user_id, stage
	- Centralized: CloudWatch, Datadog, or Loki

	### Alerting
	- Worker failures exceeding 5% of jobs
	- Queue depth over 100 jobs (need more workers)
	- GPU utilization below 50% (over-provisioned)
	- API error rate over 1%

	---

	## Security Considerations

	### Local Development
	- No auth needed
	- Redis on localhost only
	- CORS enabled for `localhost:5173`

	### Production
	- HTTPS only: Enforce TLS for API and WebSocket
	- API authentication: JWT tokens for user sessions
	- Rate limiting: 10 jobs per user per hour
	- Input validation: Check YouTube URL format, max video length
	- Secrets management: Use environment variables or AWS Secrets Manager
	- VPC: API and workers in private subnets
	- File scanning: Check uploaded files for malware (if allowing file uploads)

	---

	## Disaster Recovery

	### Backups
	- Redis: Daily snapshots to S3
	- PostgreSQL: Automated daily backups (RDS), 7-day retention
	- Code: GitHub (already version controlled)
	- Models: Re-downloadable, no backup needed

	### Incident Response
	- Worker failure: Job retried automatically (Celery)
	- API crash: ECS restarts container, ALB routes to healthy instance
	- Redis failure: ElastiCache auto-failover to standby
	- Complete outage: Deploy from last known good commit, restore DB from backup

	---

	## Next Steps

	1. Get local development working with Docker Compose
	2. Test full pipeline end-to-end with sample YouTube videos
	3. Deploy proof-of-concept to Vercel + Modal for beta testing
	4. Collect metrics on processing time, costs, user feedback
	5. Scale to production architecture if product gains traction

	See [Audio Processing Pipeline](../backend/pipeline.md) for implementation details.