Spaces:
Sleeping
Sleeping
File size: 7,424 Bytes
d00203b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | # VoiceForge Architecture
## Overview
VoiceForge is a production-grade Speech-to-Text and Text-to-Speech application built with modern Python technologies. This document describes the system architecture and key design decisions.
## System Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Load Balancer β
β (Nginx / Cloud LB) β
βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
β β β
ββββββββΌβββββββ βββββββΌβββββββ βββββββΌβββββββ
β Frontend β β Backend β β Worker β
β Streamlit β β FastAPI β β Celery β
β :8501 β β :8000 β β β
ββββββββ¬βββββββ βββββββ¬βββββββ βββββββ¬βββββββ
β β β
ββββββββΌββββββββββββββββΌββββββββββββββββΌβββββββ
β Service Layer β
β βββββββββββ βββββββββββ βββββββββββββββββββ β
β β STT β β TTS β β File Service β β
β β Service β β Service β β β β
β ββββββ¬βββββ ββββββ¬βββββ ββββββββββ¬βββββββββ β
β βββββββββββ βββββββββββ β β
β β NLP β β Export β β β
β β Service β β Service β β β
β ββββββ¬βββββ ββββββ¬βββββ β β
βββββββββΌββββββββββββΌββββββββββββββββΌβββββββββββ
β β β
βββββββββΌββββββββββββΌββββββββββββββββΌβββββββββββ
β Data Layer β
β ββββββββββββ βββββββββ βββββββββββββββββ β
β βPostgreSQLβ β Redis β β File Storage β β
β β :5432 β β :6379 β β /uploads β β
β ββββββββββββ βββββββββ βββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββΌββββββββββββββββββββββββββββββββββββββ
β External APIs β
β βββββββββββββββββββ ββββββββββββββββββββ β
β β Google Cloud β β Google Cloud β β
β β Speech-to-Text β β Text-to-Speech β β
β βββββββββββββββββββ ββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
```
## Components
### Frontend (Streamlit)
- **Purpose**: Web interface for users
- **Technology**: Streamlit 1.31+
- **Key Features**:
- Real-time microphone recording (WebRTC)
- File upload with drag-and-drop
- Audio waveform visualization
- Transcript editing and export
- Voice selection and preview
### Backend (FastAPI)
- **Purpose**: REST API server
- **Technology**: FastAPI 0.109+
- **Key Features**:
- OpenAPI documentation
- CORS middleware
- JWT authentication (Phase 3)
- Request validation
- Error handling
### Worker (Celery)
- **Purpose**: Background task processing
- **Technology**: Celery 5.3+ with Redis broker
- **Key Features**:
- Long audio file processing
- Batch transcription
- NLP analysis tasks
### Database (PostgreSQL)
- **Purpose**: Persistent data storage
- **Technology**: PostgreSQL 15+
- **Tables**:
- `users` - User accounts
- `audio_files` - Uploaded audio metadata
- `transcripts` - Transcription results
- `user_preferences` - User settings
- `usage_events` - Analytics data
- `api_keys` - Enterprise API keys
### Cache (Redis)
- **Purpose**: Caching and task queue
- **Technology**: Redis 7+
- **Use Cases**:
- Voice list caching
- Transcription result caching
- Celery task queue
- Session storage
### Observability (Prometheus)
- **Purpose**: Application monitoring
- **Technology**: prometheus-fastapi-instrumentator
- **Key Metrics**:
- Request latency and throughput
- Error rates
- Endpoint usage statistics
## Data Flow
### Speech-to-Text Flow
```
1. User uploads audio file
2. Frontend sends to /api/v1/stt/upload
3. Backend validates file format and size
4. File saved to storage
5. STT Service calls Google Cloud Speech API
6. Results processed (words, segments, timestamps)
7. Transcript saved to database
8. Response returned to frontend
```
### Text-to-Speech Flow
```
1. User enters text
2. Frontend sends to /api/v1/tts/synthesize
3. Backend validates text and voice
4. TTS Service calls Google Cloud TTS API
5. Audio returned as base64
6. Frontend plays/downloads audio
```
## Design Decisions
### Why PostgreSQL with JSONB?
- Single database simplifies deployment
- JSONB supports flexible document storage for segments
- SQL for relational queries (users, files)
- Full-text search capability
### Why Streamlit?
- Rapid development for data apps
- Built-in components for audio
- Easy deployment
- Python-native (no JS required)
### Why Google Cloud APIs?
- Industry-leading accuracy
- 100+ languages supported
- 200+ voice options
- Generous free tier
## Security Considerations
- Secrets via environment variables
- HTTPS in production
- JWT for authentication
- Per-user data isolation
- Temporary file cleanup
## Deployment Options
### Local Development
```bash
# Backend
cd backend
uvicorn app.main:app --reload
# Frontend
cd frontend
streamlit run streamlit_app.py
```
### Docker Compose
```bash
docker-compose -f deploy/docker/docker-compose.dev.yml up
```
### Production
- Deploy to any container orchestrator
- Use managed PostgreSQL (Cloud SQL, RDS)
- Use managed Redis (Memorystore, ElastiCache)
- Load balance with Nginx/Cloud LB
|