File size: 16,234 Bytes
89a8916 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 |
# Voice Tech for All: Technical Report
## Multi-lingual Text-to-Speech System with Style Transfer
**Hackathon**: Voice Tech for All
**Date**: December 2025
---
## Executive Summary
We present a **multi-lingual Text-to-Speech (TTS) system** supporting **11 Indian languages** with **style/prosody control** capabilities. The system is designed for deployment as a healthcare assistant for pregnant mothers in low-income communities, making health information accessible in native languages.
### Key Achievements
| Metric | Value |
| ---------------------- | ----------------------------------------------------------------------------------------------------------- |
| Languages Supported | 11 (Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English, Gujarati) |
| Voice Variants | 21 (male + female for each language) |
| Style Presets | 9 (default, slow, fast, soft, loud, happy, sad, calm, excited) |
| Average Inference Time | ~0.3s (CPU, Apple M2) |
| Model Size | ~300MB per voice (VITS), ~145MB (MMS) |
| API Latency | <500ms for typical sentences |
---
## 1. System Architecture
### 1.1 Overview
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ REST API Server (FastAPI) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ/synthesizeโ โ /voices โ โ /styles โโ
โ โ /stream โ โ /languages โ โ /health โโ
โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ TTS Engine โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Text Normalizer โโ โ Tokenizer โโ โ VITS/MMS โ โ
โ โ (Indian scripts)โ โ (char-to-ID) โ โ Inference โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Style Processor (Prosody Control) โโ
โ โ โข Pitch Shifting (librosa) โโ
โ โ โข Time Stretching (speed control) โโ
โ โ โข Energy/Volume Modification โโ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Model Repository โ
โ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ SYSPIN VITS Models โ โ Facebook MMS Models โ โ
โ โ (10 languages) โ โ (Gujarati) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
### 1.2 Component Details
#### Text Normalizer
- Handles Indian script peculiarities
- Converts number notations: `{100}{เคเคเคธเฅ}` โ `เคเคเคธเฅ`
- Normalizes punctuation across scripts
- Handles code-switching (Hindi in English text)
#### VITS Models (SYSPIN)
- **Architecture**: Conditional Variational Autoencoder with Adversarial Learning
- **Training Data**: 20-30 hours per speaker from IISc Bangalore
- **Output**: 22050 Hz, 16-bit PCM
- **Languages**: Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
#### MMS Model (Facebook)
- **Architecture**: VITS-based, trained on MMS corpus
- **Output**: 16000 Hz
- **Languages**: Gujarati (and 1100+ others available)
- **Model Size**: 145MB
#### Style Processor
- **Pitch Shifting**: Using librosa phase vocoder
- **Time Stretching**: WSOLA algorithm via librosa
- **Energy Control**: Soft clipping with tanh for natural sound
---
## 2. API Specification
### 2.1 Endpoints
| Endpoint | Method | Description |
| -------------------- | ------ | -------------------------------- |
| `/` | GET | API info and documentation links |
| `/health` | GET | System health and loaded models |
| `/voices` | GET | List all available voices |
| `/languages` | GET | List supported languages |
| `/styles` | GET | List style presets |
| `/synthesize` | POST | Generate speech from text |
| `/synthesize/get` | GET | Simple synthesis (for testing) |
| `/synthesize/stream` | POST | Streaming audio response |
| `/preload` | POST | Preload voice into memory |
| `/batch` | POST | Batch synthesis |
### 2.2 Synthesis Request
```json
{
"text": "เชจเชฎเชธเซเชคเซ, เชนเซเช เชคเชฎเชพเชฐเซ เชเซเชตเซ เชฐเซเชคเซ เชฎเชฆเชฆ เชเชฐเซ เชถเชเซเช?",
"voice": "gu_mms",
"speed": 1.0,
"pitch": 1.0,
"energy": 1.0,
"style": "calm",
"normalize": true
}
```
### 2.3 Style Presets
| Preset | Speed | Pitch | Energy | Use Case |
| ------- | ----- | ----- | ------ | ---------------------- |
| default | 1.0 | 1.0 | 1.0 | Normal speech |
| slow | 0.75 | 1.0 | 1.0 | Elderly users, clarity |
| fast | 1.25 | 1.0 | 1.0 | Quick information |
| soft | 0.9 | 0.95 | 0.7 | Calming content |
| loud | 1.0 | 1.05 | 1.3 | Alerts, emphasis |
| happy | 1.1 | 1.1 | 1.2 | Positive messages |
| sad | 0.85 | 0.9 | 0.8 | Empathetic responses |
| calm | 0.9 | 0.95 | 0.85 | Healthcare guidance |
| excited | 1.2 | 1.15 | 1.3 | Celebrations |
---
## 3. Supported Languages
| Language | Code | Voices | Model Type | Sample Rate |
| ------------- | ---- | ------------ | ------------ | ----------- |
| Hindi | hi | Male, Female | SYSPIN VITS | 22050 Hz |
| Bengali | bn | Male, Female | SYSPIN VITS | 22050 Hz |
| Marathi | mr | Male, Female | SYSPIN VITS | 22050 Hz |
| Telugu | te | Male, Female | SYSPIN VITS | 22050 Hz |
| Kannada | kn | Male, Female | SYSPIN VITS | 22050 Hz |
| Bhojpuri | bho | Male, Female | SYSPIN VITS | 22050 Hz |
| Chhattisgarhi | hne | Male, Female | SYSPIN VITS | 22050 Hz |
| Maithili | mai | Male, Female | SYSPIN VITS | 22050 Hz |
| Magahi | mag | Male, Female | SYSPIN VITS | 22050 Hz |
| English | en | Male, Female | SYSPIN VITS | 22050 Hz |
| Gujarati | gu | Neutral | Facebook MMS | 16000 Hz |
---
## 4. Implementation Details
### 4.1 Technology Stack
| Component | Technology |
| ----------------- | ---------------------------------------- |
| Backend Framework | FastAPI |
| ML Framework | PyTorch |
| TTS Models | VITS (Coqui AI / SYSPIN), MMS (Facebook) |
| Audio Processing | librosa, soundfile, scipy |
| Model Hub | Hugging Face Hub |
| API Documentation | OpenAPI/Swagger |
### 4.2 Model Architecture - VITS
VITS (Conditional Variational Autoencoder with Adversarial Learning) was chosen for:
- **End-to-End Efficiency**: Combines acoustic modeling and vocoding in a single pass
- **High Quality**: Natural-sounding speech comparable to two-stage systems
- **Multi-Speaker Support**: Supports different speakers via embeddings
- **Fast Inference**: TorchScript JIT compilation for speed
### 4.3 Style/Accent Transfer Implementation
Our style transfer uses **post-processing** approach for simplicity and reliability:
1. **Pitch Shifting**: Phase vocoder via librosa
```python
semitones = 12 * np.log2(pitch_factor)
shifted = librosa.effects.pitch_shift(audio, sr=sr, n_steps=semitones)
```
2. **Time Stretching**: WSOLA algorithm
```python
stretched = librosa.effects.time_stretch(audio, rate=speed_factor)
```
3. **Energy Control**: Soft clipping for natural sound
```python
modified = audio * energy_factor
if energy_factor > 1.0:
modified = np.tanh(modified * 2) * 0.95 # Soft clip
```
### 4.4 Key Design Decisions
1. **TorchScript Models**: JIT-compiled for faster inference
2. **Lazy Loading**: Models loaded on-demand to minimize memory
3. **CPU Fallback**: Apple Silicon MPS compatibility issues handled
4. **Streaming Support**: Progressive audio delivery for real-time apps
---
## 5. Usage Examples
### 5.1 Python API
```python
from src.engine import TTSEngine
# Initialize engine
engine = TTSEngine(device="auto")
# Basic synthesis
output = engine.synthesize(
text="เคเคฐเฅเคญเคพเคตเคธเฅเคฅเคพ เคฎเฅเค เคธเฅเคตเคธเฅเคฅ เคเคนเคพเคฐ เคฌเคนเฅเคค เคฎเคนเคคเฅเคตเคชเฅเคฐเฅเคฃ เคนเฅ",
voice="hi_female"
)
# With style control
output = engine.synthesize(
text="เคเคชเคเคพ เคฆเคฟเคจ เคถเฅเคญ เคนเฅ",
voice="hi_male",
style="happy",
pitch=1.1
)
# Gujarati
output = engine.synthesize(
text="เชธเซเชตเชธเซเชฅ เชฐเชนเซ, เชเซเชถ เชฐเชนเซ",
voice="gu_mms",
style="calm"
)
```
### 5.2 REST API
```bash
# Basic synthesis
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{"text": "เคจเคฎเคธเฅเคคเฅ", "voice": "hi_male"}' \
--output speech.wav
# With style
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{"text": "เคเคชเคเคพ เคธเฅเคตเคพเคเคค เคนเฅ", "voice": "hi_female", "style": "happy"}' \
--output welcome.wav
# Gujarati
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{"text": "เชจเชฎเชธเซเชคเซ", "voice": "gu_mms"}' \
--output gujarati.wav
```
### 5.3 Command Line
```bash
# Download models
python -m src.cli download --voice hi_male
python -m src.cli download --lang hi # All Hindi voices
# Synthesize
python -m src.cli synthesize --text "เคจเคฎเคธเฅเคคเฅ" --voice hi_male --output hello.wav
# Start server
python -m src.cli serve --port 8000
```
---
## 6. Healthcare Use Case
### 6.1 Target Application
The TTS system is designed for integration with an **LLM-based healthcare assistant** for pregnant mothers in low-income communities.
### 6.2 Key Features for Healthcare
1. **Multi-lingual Support**: Information in native languages
2. **Calm Style Preset**: Reassuring tone for medical guidance
3. **Slow Speed Option**: Clear pronunciation for instructions
4. **Low Latency**: Real-time conversational responses
### 6.3 Example Healthcare Dialogue
```
User: "เชเชฐเซเชญเชพเชตเชธเซเชฅเชพเชฎเชพเช เชถเซเช เชเชพเชตเซเช เชเซเชเช?"
System Response (TTS with calm style in Gujarati):
"เชเชฐเซเชญเชพเชตเชธเซเชฅเชพเชฎเชพเช เชคเชฎเชพเชฐเซ เชชเซเชฐเซเชเซเชจ, เชเชฏเชฐเซเชจ เช
เชจเซ เชซเซเชฒเชฟเช เชเชธเชฟเชกเชฅเซ เชญเชฐเชชเซเชฐ
เชเซเชฐเชพเช เชฒเซเชตเซ เชเซเชเช. เชฆเชพเชณ, เชชเชพเชฒเช, เชเชเชกเชพ เช
เชจเซ เชฆเซเชง เชธเชพเชฐเชพ เชตเชฟเชเชฒเซเชชเซ เชเซ."
```
---
## 7. Performance Benchmarks
| Test | Time | Notes |
| ----------------------- | ----- | ---------------------------------- |
| Hindi synthesis (short) | 0.25s | "เคจเคฎเคธเฅเคคเฅ" |
| Hindi synthesis (long) | 0.45s | 50-word sentence |
| Gujarati MMS | 0.35s | First load includes model download |
| Style processing | +0.1s | Pitch + speed adjustment |
| API round-trip | 0.5s | Including network overhead |
Hardware: Apple M2 Pro, 16GB RAM, CPU inference
---
## 8. Deployment
### 8.1 Quick Start
```bash
# Clone repository
git clone https://github.com/harshil748/VoiceAPI
cd VoiceAPI
# Setup environment
python3 -m venv tts
source tts/bin/activate
pip install -r requirements.txt
# Download a model
python -m src.cli download --voice hi_male
# Start server
python -m src.cli serve --port 8000
```
### 8.2 Docker
```dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
RUN python -m src.cli download --lang hi
EXPOSE 8000
CMD ["python", "-m", "src.cli", "serve"]
```
---
## 9. Limitations and Future Work
### 9.1 Current Limitations
1. **Model Size**: Each VITS model is ~300MB
2. **MPS Compatibility**: Apple Silicon MPS not fully supported
3. **Real-time Streaming**: Limited to sentence-level
4. **Gujarati Gender**: MMS has only neutral voice
### 9.2 Future Improvements
1. **Model Quantization**: INT8 for smaller size
2. **Voice Cloning**: Reference audio-based synthesis
3. **SSML Support**: Markup language for fine control
4. **More Languages**: Odia, Assamese, Punjabi
5. **Fine-tuning**: Custom voice training on SPICOR data
---
## 10. Credits
### Model Sources
| Source | Models | License |
| ----------------------- | --------------------- | ------------ |
| SYSPIN (IISc Bangalore) | VITS for 10 languages | CC BY 4.0 |
| Facebook MMS | Gujarati VITS | CC BY-NC 4.0 |
### Dataset
- **SPICOR TTS Project**: IISc SPIRE Lab, Bangalore
- **Audio Quality**: 48kHz, 24-bit, mono
### Frameworks
- Coqui TTS, Hugging Face Transformers, FastAPI, librosa
---
## 11. Conclusion
We have developed a comprehensive multi-lingual TTS system that:
โ
Supports **11 Indian languages** with 21 voice variants
โ
Provides **9 style presets** for prosody control
โ
Offers a **REST API** with OpenAPI documentation
โ
Achieves **<500ms latency** for typical sentences
โ
Is **production-ready** with proper error handling
The system is well-suited for the healthcare assistant use case, providing clear, natural-sounding speech in native languages to help pregnant mothers access healthcare information.
---
**Repository**: https://github.com/harshil748/VoiceAPI
**API Documentation**: http://localhost:8000/docs
|