Spaces:
Sleeping
Sleeping
| # VoiceForge Architecture | |
| ## Overview | |
| VoiceForge is a production-grade Speech-to-Text and Text-to-Speech application built with modern Python technologies. This document describes the system architecture and key design decisions. | |
| ## System Architecture | |
| ``` | |
| ┌─────────────────────────────────────────────────────────────────┐ | |
| │ Load Balancer │ | |
| │ (Nginx / Cloud LB) │ | |
| └───────────────────────────┬─────────────────────────────────────┘ | |
| │ | |
| ┌───────────────┼───────────────┐ | |
| │ │ │ | |
| ┌──────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐ | |
| │ Frontend │ │ Backend │ │ Worker │ | |
| │ Streamlit │ │ FastAPI │ │ Celery │ | |
| │ :8501 │ │ :8000 │ │ │ | |
| └──────┬──────┘ └─────┬──────┘ └─────┬──────┘ | |
| │ │ │ | |
| ┌──────▼───────────────▼───────────────▼──────┐ | |
| │ Service Layer │ | |
| │ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │ | |
| │ │ STT │ │ TTS │ │ File Service │ │ | |
| │ │ Service │ │ Service │ │ │ │ | |
| │ └────┬────┘ └────┬────┘ └────────┬────────┘ │ | |
| │ ┌─────────┐ ┌─────────┐ │ │ | |
| │ │ NLP │ │ Export │ │ │ | |
| │ │ Service │ │ Service │ │ │ | |
| │ └────┬────┘ └────┬────┘ │ │ | |
| └───────┼───────────┼───────────────┼──────────┘ | |
| │ │ │ | |
| ┌───────▼───────────▼───────────────▼──────────┐ | |
| │ Data Layer │ | |
| │ ┌──────────┐ ┌───────┐ ┌───────────────┐ │ | |
| │ │PostgreSQL│ │ Redis │ │ File Storage │ │ | |
| │ │ :5432 │ │ :6379 │ │ /uploads │ │ | |
| │ └──────────┘ └───────┘ └───────────────┘ │ | |
| └────────────────────────────────────────────────┘ | |
| │ | |
| ┌───────▼─────────────────────────────────────┐ | |
| │ External APIs │ | |
| │ ┌─────────────────┐ ┌──────────────────┐ │ | |
| │ │ Google Cloud │ │ Google Cloud │ │ | |
| │ │ Speech-to-Text │ │ Text-to-Speech │ │ | |
| │ └─────────────────┘ └──────────────────┘ │ | |
| └─────────────────────────────────────────────┘ | |
| ``` | |
| ## Components | |
| ### Frontend (Streamlit) | |
| - **Purpose**: Web interface for users | |
| - **Technology**: Streamlit 1.31+ | |
| - **Key Features**: | |
| - Real-time microphone recording (WebRTC) | |
| - File upload with drag-and-drop | |
| - Audio waveform visualization | |
| - Transcript editing and export | |
| - Voice selection and preview | |
| ### Backend (FastAPI) | |
| - **Purpose**: REST API server | |
| - **Technology**: FastAPI 0.109+ | |
| - **Key Features**: | |
| - OpenAPI documentation | |
| - CORS middleware | |
| - JWT authentication (Phase 3) | |
| - Request validation | |
| - Error handling | |
| ### Worker (Celery) | |
| - **Purpose**: Background task processing | |
| - **Technology**: Celery 5.3+ with Redis broker | |
| - **Key Features**: | |
| - Long audio file processing | |
| - Batch transcription | |
| - NLP analysis tasks | |
| ### Database (PostgreSQL) | |
| - **Purpose**: Persistent data storage | |
| - **Technology**: PostgreSQL 15+ | |
| - **Tables**: | |
| - `users` - User accounts | |
| - `audio_files` - Uploaded audio metadata | |
| - `transcripts` - Transcription results | |
| - `user_preferences` - User settings | |
| - `usage_events` - Analytics data | |
| - `api_keys` - Enterprise API keys | |
| ### Cache (Redis) | |
| - **Purpose**: Caching and task queue | |
| - **Technology**: Redis 7+ | |
| - **Use Cases**: | |
| - Voice list caching | |
| - Transcription result caching | |
| - Celery task queue | |
| - Session storage | |
| ### Observability (Prometheus) | |
| - **Purpose**: Application monitoring | |
| - **Technology**: prometheus-fastapi-instrumentator | |
| - **Key Metrics**: | |
| - Request latency and throughput | |
| - Error rates | |
| - Endpoint usage statistics | |
| ## Data Flow | |
| ### Speech-to-Text Flow | |
| ``` | |
| 1. User uploads audio file | |
| 2. Frontend sends to /api/v1/stt/upload | |
| 3. Backend validates file format and size | |
| 4. File saved to storage | |
| 5. STT Service calls Google Cloud Speech API | |
| 6. Results processed (words, segments, timestamps) | |
| 7. Transcript saved to database | |
| 8. Response returned to frontend | |
| ``` | |
| ### Text-to-Speech Flow | |
| ``` | |
| 1. User enters text | |
| 2. Frontend sends to /api/v1/tts/synthesize | |
| 3. Backend validates text and voice | |
| 4. TTS Service calls Google Cloud TTS API | |
| 5. Audio returned as base64 | |
| 6. Frontend plays/downloads audio | |
| ``` | |
| ## Design Decisions | |
| ### Why PostgreSQL with JSONB? | |
| - Single database simplifies deployment | |
| - JSONB supports flexible document storage for segments | |
| - SQL for relational queries (users, files) | |
| - Full-text search capability | |
| ### Why Streamlit? | |
| - Rapid development for data apps | |
| - Built-in components for audio | |
| - Easy deployment | |
| - Python-native (no JS required) | |
| ### Why Google Cloud APIs? | |
| - Industry-leading accuracy | |
| - 100+ languages supported | |
| - 200+ voice options | |
| - Generous free tier | |
| ## Security Considerations | |
| - Secrets via environment variables | |
| - HTTPS in production | |
| - JWT for authentication | |
| - Per-user data isolation | |
| - Temporary file cleanup | |
| ## Deployment Options | |
| ### Local Development | |
| ```bash | |
| # Backend | |
| cd backend | |
| uvicorn app.main:app --reload | |
| # Frontend | |
| cd frontend | |
| streamlit run streamlit_app.py | |
| ``` | |
| ### Docker Compose | |
| ```bash | |
| docker-compose -f deploy/docker/docker-compose.dev.yml up | |
| ``` | |
| ### Production | |
| - Deploy to any container orchestrator | |
| - Use managed PostgreSQL (Cloud SQL, RDS) | |
| - Use managed Redis (Memorystore, ElastiCache) | |
| - Load balance with Nginx/Cloud LB | |