File size: 7,424 Bytes
d00203b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# VoiceForge Architecture

## Overview

VoiceForge is a production-grade Speech-to-Text and Text-to-Speech application built with modern Python technologies. This document describes the system architecture and key design decisions.

## System Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Load Balancer                             β”‚
β”‚                      (Nginx / Cloud LB)                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚               β”‚               β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
     β”‚   Frontend  β”‚ β”‚   Backend  β”‚ β”‚   Worker   β”‚
     β”‚  Streamlit  β”‚ β”‚   FastAPI  β”‚ β”‚   Celery   β”‚
     β”‚   :8501     β”‚ β”‚   :8000    β”‚ β”‚            β”‚
     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
            β”‚               β”‚               β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
     β”‚              Service Layer                    β”‚
     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
     β”‚  β”‚   STT   β”‚ β”‚   TTS   β”‚ β”‚   File Service  β”‚ β”‚
     β”‚  β”‚ Service β”‚ β”‚ Service β”‚ β”‚                 β”‚ β”‚
     β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚          β”‚
     β”‚  β”‚   NLP   β”‚ β”‚  Export β”‚          β”‚          β”‚
     β”‚  β”‚ Service β”‚ β”‚ Service β”‚          β”‚          β”‚
     β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜          β”‚          β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚           β”‚               β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚              Data Layer                        β”‚
     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
     β”‚  β”‚PostgreSQLβ”‚  β”‚ Redis β”‚  β”‚  File Storage β”‚  β”‚
     β”‚  β”‚  :5432   β”‚  β”‚ :6379 β”‚  β”‚    /uploads   β”‚  β”‚
     β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚           External APIs                       β”‚
     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
     β”‚  β”‚ Google Cloud    β”‚  β”‚  Google Cloud    β”‚  β”‚
     β”‚  β”‚ Speech-to-Text  β”‚  β”‚  Text-to-Speech  β”‚  β”‚
     β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Components

### Frontend (Streamlit)

- **Purpose**: Web interface for users
- **Technology**: Streamlit 1.31+
- **Key Features**:
  - Real-time microphone recording (WebRTC)
  - File upload with drag-and-drop
  - Audio waveform visualization
  - Transcript editing and export
  - Voice selection and preview

### Backend (FastAPI)

- **Purpose**: REST API server
- **Technology**: FastAPI 0.109+
- **Key Features**:
  - OpenAPI documentation
  - CORS middleware
  - JWT authentication (Phase 3)
  - Request validation
  - Error handling

### Worker (Celery)

- **Purpose**: Background task processing
- **Technology**: Celery 5.3+ with Redis broker
- **Key Features**:
  - Long audio file processing
  - Batch transcription
  - NLP analysis tasks

### Database (PostgreSQL)

- **Purpose**: Persistent data storage
- **Technology**: PostgreSQL 15+
- **Tables**:
  - `users` - User accounts
  - `audio_files` - Uploaded audio metadata
  - `transcripts` - Transcription results
  - `user_preferences` - User settings
  - `usage_events` - Analytics data
  - `api_keys` - Enterprise API keys

### Cache (Redis)

- **Purpose**: Caching and task queue
- **Technology**: Redis 7+
- **Use Cases**:
  - Voice list caching
  - Transcription result caching
  - Celery task queue
  - Session storage

### Observability (Prometheus)

- **Purpose**: Application monitoring
- **Technology**: prometheus-fastapi-instrumentator
- **Key Metrics**:
  - Request latency and throughput
  - Error rates
  - Endpoint usage statistics

## Data Flow

### Speech-to-Text Flow

```
1. User uploads audio file
2. Frontend sends to /api/v1/stt/upload
3. Backend validates file format and size
4. File saved to storage
5. STT Service calls Google Cloud Speech API
6. Results processed (words, segments, timestamps)
7. Transcript saved to database
8. Response returned to frontend
```

### Text-to-Speech Flow

```
1. User enters text
2. Frontend sends to /api/v1/tts/synthesize
3. Backend validates text and voice
4. TTS Service calls Google Cloud TTS API
5. Audio returned as base64
6. Frontend plays/downloads audio
```

## Design Decisions

### Why PostgreSQL with JSONB?

- Single database simplifies deployment
- JSONB supports flexible document storage for segments
- SQL for relational queries (users, files)
- Full-text search capability

### Why Streamlit?

- Rapid development for data apps
- Built-in components for audio
- Easy deployment
- Python-native (no JS required)

### Why Google Cloud APIs?

- Industry-leading accuracy
- 100+ languages supported
- 200+ voice options
- Generous free tier

## Security Considerations

- Secrets via environment variables
- HTTPS in production
- JWT for authentication
- Per-user data isolation
- Temporary file cleanup

## Deployment Options

### Local Development

```bash
# Backend
cd backend
uvicorn app.main:app --reload

# Frontend
cd frontend
streamlit run streamlit_app.py
```

### Docker Compose

```bash
docker-compose -f deploy/docker/docker-compose.dev.yml up
```

### Production

- Deploy to any container orchestrator
- Use managed PostgreSQL (Cloud SQL, RDS)
- Use managed Redis (Memorystore, ElastiCache)
- Load balance with Nginx/Cloud LB