File size: 16,234 Bytes
89a8916
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
# Voice Tech for All: Technical Report

## Multi-lingual Text-to-Speech System with Style Transfer

**Hackathon**: Voice Tech for All  
**Date**: December 2025

---

## Executive Summary

We present a **multi-lingual Text-to-Speech (TTS) system** supporting **11 Indian languages** with **style/prosody control** capabilities. The system is designed for deployment as a healthcare assistant for pregnant mothers in low-income communities, making health information accessible in native languages.

### Key Achievements

| Metric                 | Value                                                                                                       |
| ---------------------- | ----------------------------------------------------------------------------------------------------------- |
| Languages Supported    | 11 (Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English, Gujarati) |
| Voice Variants         | 21 (male + female for each language)                                                                        |
| Style Presets          | 9 (default, slow, fast, soft, loud, happy, sad, calm, excited)                                              |
| Average Inference Time | ~0.3s (CPU, Apple M2)                                                                                       |
| Model Size             | ~300MB per voice (VITS), ~145MB (MMS)                                                                       |
| API Latency            | <500ms for typical sentences                                                                                |

---

## 1. System Architecture

### 1.1 Overview

```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    REST API Server (FastAPI)                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚  โ”‚/synthesizeโ”‚  โ”‚ /voices     โ”‚  โ”‚ /styles               โ”‚โ”‚
โ”‚  โ”‚ /stream   โ”‚  โ”‚ /languages  โ”‚  โ”‚ /health               โ”‚โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                      TTS Engine                              โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ Text Normalizer โ”‚โ†’ โ”‚ Tokenizer       โ”‚โ†’ โ”‚ VITS/MMS    โ”‚ โ”‚
โ”‚  โ”‚ (Indian scripts)โ”‚  โ”‚ (char-to-ID)    โ”‚  โ”‚ Inference   โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                              โ†“                               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚  โ”‚              Style Processor (Prosody Control)          โ”‚โ”‚
โ”‚  โ”‚  โ€ข Pitch Shifting (librosa)                             โ”‚โ”‚
โ”‚  โ”‚  โ€ข Time Stretching (speed control)                      โ”‚โ”‚
โ”‚  โ”‚  โ€ข Energy/Volume Modification                           โ”‚โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                    Model Repository                          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ SYSPIN VITS Models โ”‚  โ”‚ Facebook MMS Models            โ”‚ โ”‚
โ”‚  โ”‚ (10 languages)     โ”‚  โ”‚ (Gujarati)                     โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

### 1.2 Component Details

#### Text Normalizer

- Handles Indian script peculiarities
- Converts number notations: `{100}{เคเค•เคธเฅ‹}` โ†’ `เคเค•เคธเฅ‹`
- Normalizes punctuation across scripts
- Handles code-switching (Hindi in English text)

#### VITS Models (SYSPIN)

- **Architecture**: Conditional Variational Autoencoder with Adversarial Learning
- **Training Data**: 20-30 hours per speaker from IISc Bangalore
- **Output**: 22050 Hz, 16-bit PCM
- **Languages**: Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English

#### MMS Model (Facebook)

- **Architecture**: VITS-based, trained on MMS corpus
- **Output**: 16000 Hz
- **Languages**: Gujarati (and 1100+ others available)
- **Model Size**: 145MB

#### Style Processor

- **Pitch Shifting**: Using librosa phase vocoder
- **Time Stretching**: WSOLA algorithm via librosa
- **Energy Control**: Soft clipping with tanh for natural sound

---

## 2. API Specification

### 2.1 Endpoints

| Endpoint             | Method | Description                      |
| -------------------- | ------ | -------------------------------- |
| `/`                  | GET    | API info and documentation links |
| `/health`            | GET    | System health and loaded models  |
| `/voices`            | GET    | List all available voices        |
| `/languages`         | GET    | List supported languages         |
| `/styles`            | GET    | List style presets               |
| `/synthesize`        | POST   | Generate speech from text        |
| `/synthesize/get`    | GET    | Simple synthesis (for testing)   |
| `/synthesize/stream` | POST   | Streaming audio response         |
| `/preload`           | POST   | Preload voice into memory        |
| `/batch`             | POST   | Batch synthesis                  |

### 2.2 Synthesis Request

```json
{
	"text": "เชจเชฎเชธเซเชคเซ‡, เชนเซเช‚ เชคเชฎเชพเชฐเซ€ เช•เซ‡เชตเซ€ เชฐเซ€เชคเซ‡ เชฎเชฆเชฆ เช•เชฐเซ€ เชถเช•เซเช‚?",
	"voice": "gu_mms",
	"speed": 1.0,
	"pitch": 1.0,
	"energy": 1.0,
	"style": "calm",
	"normalize": true
}
```

### 2.3 Style Presets

| Preset  | Speed | Pitch | Energy | Use Case               |
| ------- | ----- | ----- | ------ | ---------------------- |
| default | 1.0   | 1.0   | 1.0    | Normal speech          |
| slow    | 0.75  | 1.0   | 1.0    | Elderly users, clarity |
| fast    | 1.25  | 1.0   | 1.0    | Quick information      |
| soft    | 0.9   | 0.95  | 0.7    | Calming content        |
| loud    | 1.0   | 1.05  | 1.3    | Alerts, emphasis       |
| happy   | 1.1   | 1.1   | 1.2    | Positive messages      |
| sad     | 0.85  | 0.9   | 0.8    | Empathetic responses   |
| calm    | 0.9   | 0.95  | 0.85   | Healthcare guidance    |
| excited | 1.2   | 1.15  | 1.3    | Celebrations           |

---

## 3. Supported Languages

| Language      | Code | Voices       | Model Type   | Sample Rate |
| ------------- | ---- | ------------ | ------------ | ----------- |
| Hindi         | hi   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Bengali       | bn   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Marathi       | mr   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Telugu        | te   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Kannada       | kn   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Bhojpuri      | bho  | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Chhattisgarhi | hne  | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Maithili      | mai  | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Magahi        | mag  | Male, Female | SYSPIN VITS  | 22050 Hz    |
| English       | en   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Gujarati      | gu   | Neutral      | Facebook MMS | 16000 Hz    |

---

## 4. Implementation Details

### 4.1 Technology Stack

| Component         | Technology                               |
| ----------------- | ---------------------------------------- |
| Backend Framework | FastAPI                                  |
| ML Framework      | PyTorch                                  |
| TTS Models        | VITS (Coqui AI / SYSPIN), MMS (Facebook) |
| Audio Processing  | librosa, soundfile, scipy                |
| Model Hub         | Hugging Face Hub                         |
| API Documentation | OpenAPI/Swagger                          |

### 4.2 Model Architecture - VITS

VITS (Conditional Variational Autoencoder with Adversarial Learning) was chosen for:

- **End-to-End Efficiency**: Combines acoustic modeling and vocoding in a single pass
- **High Quality**: Natural-sounding speech comparable to two-stage systems
- **Multi-Speaker Support**: Supports different speakers via embeddings
- **Fast Inference**: TorchScript JIT compilation for speed

### 4.3 Style/Accent Transfer Implementation

Our style transfer uses **post-processing** approach for simplicity and reliability:

1. **Pitch Shifting**: Phase vocoder via librosa

   ```python
   semitones = 12 * np.log2(pitch_factor)
   shifted = librosa.effects.pitch_shift(audio, sr=sr, n_steps=semitones)
   ```

2. **Time Stretching**: WSOLA algorithm

   ```python
   stretched = librosa.effects.time_stretch(audio, rate=speed_factor)
   ```

3. **Energy Control**: Soft clipping for natural sound
   ```python
   modified = audio * energy_factor
   if energy_factor > 1.0:
       modified = np.tanh(modified * 2) * 0.95  # Soft clip
   ```

### 4.4 Key Design Decisions

1. **TorchScript Models**: JIT-compiled for faster inference
2. **Lazy Loading**: Models loaded on-demand to minimize memory
3. **CPU Fallback**: Apple Silicon MPS compatibility issues handled
4. **Streaming Support**: Progressive audio delivery for real-time apps

---

## 5. Usage Examples

### 5.1 Python API

```python
from src.engine import TTSEngine

# Initialize engine
engine = TTSEngine(device="auto")

# Basic synthesis
output = engine.synthesize(
    text="เค—เคฐเฅเคญเคพเคตเคธเฅเคฅเคพ เคฎเฅ‡เค‚ เคธเฅเคตเคธเฅเคฅ เค†เคนเคพเคฐ เคฌเคนเฅเคค เคฎเคนเคคเฅเคตเคชเฅ‚เคฐเฅเคฃ เคนเฅˆ",
    voice="hi_female"
)

# With style control
output = engine.synthesize(
    text="เค†เคชเค•เคพ เคฆเคฟเคจ เคถเฅเคญ เคนเฅ‹",
    voice="hi_male",
    style="happy",
    pitch=1.1
)

# Gujarati
output = engine.synthesize(
    text="เชธเซเชตเชธเซเชฅ เชฐเชนเซ‹, เช–เซเชถ เชฐเชนเซ‹",
    voice="gu_mms",
    style="calm"
)
```

### 5.2 REST API

```bash
# Basic synthesis
curl -X POST "http://localhost:8000/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text": "เคจเคฎเคธเฅเคคเฅ‡", "voice": "hi_male"}' \
  --output speech.wav

# With style
curl -X POST "http://localhost:8000/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text": "เค†เคชเค•เคพ เคธเฅเคตเคพเค—เคค เคนเฅˆ", "voice": "hi_female", "style": "happy"}' \
  --output welcome.wav

# Gujarati
curl -X POST "http://localhost:8000/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text": "เชจเชฎเชธเซเชคเซ‡", "voice": "gu_mms"}' \
  --output gujarati.wav
```

### 5.3 Command Line

```bash
# Download models
python -m src.cli download --voice hi_male
python -m src.cli download --lang hi  # All Hindi voices

# Synthesize
python -m src.cli synthesize --text "เคจเคฎเคธเฅเคคเฅ‡" --voice hi_male --output hello.wav

# Start server
python -m src.cli serve --port 8000
```

---

## 6. Healthcare Use Case

### 6.1 Target Application

The TTS system is designed for integration with an **LLM-based healthcare assistant** for pregnant mothers in low-income communities.

### 6.2 Key Features for Healthcare

1. **Multi-lingual Support**: Information in native languages
2. **Calm Style Preset**: Reassuring tone for medical guidance
3. **Slow Speed Option**: Clear pronunciation for instructions
4. **Low Latency**: Real-time conversational responses

### 6.3 Example Healthcare Dialogue

```
User: "เช—เชฐเซเชญเชพเชตเชธเซเชฅเชพเชฎเชพเช‚ เชถเซเช‚ เช–เชพเชตเซเช‚ เชœเซ‹เชˆเช?"

System Response (TTS with calm style in Gujarati):
"เช—เชฐเซเชญเชพเชตเชธเซเชฅเชพเชฎเชพเช‚ เชคเชฎเชพเชฐเซ‡ เชชเซเชฐเซ‹เชŸเซ€เชจ, เช†เชฏเชฐเซเชจ เช…เชจเซ‡ เชซเซ‹เชฒเชฟเช• เชเชธเชฟเชกเชฅเซ€ เชญเชฐเชชเซ‚เชฐ
เช–เซ‹เชฐเชพเช• เชฒเซ‡เชตเซ‹ เชœเซ‹เชˆเช. เชฆเชพเชณ, เชชเชพเชฒเช•, เชˆเช‚เชกเชพ เช…เชจเซ‡ เชฆเซ‚เชง เชธเชพเชฐเชพ เชตเชฟเช•เชฒเซเชชเซ‹ เช›เซ‡."
```

---

## 7. Performance Benchmarks

| Test                    | Time  | Notes                              |
| ----------------------- | ----- | ---------------------------------- |
| Hindi synthesis (short) | 0.25s | "เคจเคฎเคธเฅเคคเฅ‡"                           |
| Hindi synthesis (long)  | 0.45s | 50-word sentence                   |
| Gujarati MMS            | 0.35s | First load includes model download |
| Style processing        | +0.1s | Pitch + speed adjustment           |
| API round-trip          | 0.5s  | Including network overhead         |

Hardware: Apple M2 Pro, 16GB RAM, CPU inference

---

## 8. Deployment

### 8.1 Quick Start

```bash
# Clone repository
git clone https://github.com/harshil748/VoiceAPI
cd VoiceAPI

# Setup environment
python3 -m venv tts
source tts/bin/activate
pip install -r requirements.txt

# Download a model
python -m src.cli download --voice hi_male

# Start server
python -m src.cli serve --port 8000
```

### 8.2 Docker

```dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
RUN python -m src.cli download --lang hi
EXPOSE 8000
CMD ["python", "-m", "src.cli", "serve"]
```

---

## 9. Limitations and Future Work

### 9.1 Current Limitations

1. **Model Size**: Each VITS model is ~300MB
2. **MPS Compatibility**: Apple Silicon MPS not fully supported
3. **Real-time Streaming**: Limited to sentence-level
4. **Gujarati Gender**: MMS has only neutral voice

### 9.2 Future Improvements

1. **Model Quantization**: INT8 for smaller size
2. **Voice Cloning**: Reference audio-based synthesis
3. **SSML Support**: Markup language for fine control
4. **More Languages**: Odia, Assamese, Punjabi
5. **Fine-tuning**: Custom voice training on SPICOR data

---

## 10. Credits

### Model Sources

| Source                  | Models                | License      |
| ----------------------- | --------------------- | ------------ |
| SYSPIN (IISc Bangalore) | VITS for 10 languages | CC BY 4.0    |
| Facebook MMS            | Gujarati VITS         | CC BY-NC 4.0 |

### Dataset

- **SPICOR TTS Project**: IISc SPIRE Lab, Bangalore
- **Audio Quality**: 48kHz, 24-bit, mono

### Frameworks

- Coqui TTS, Hugging Face Transformers, FastAPI, librosa

---

## 11. Conclusion

We have developed a comprehensive multi-lingual TTS system that:

โœ… Supports **11 Indian languages** with 21 voice variants  
โœ… Provides **9 style presets** for prosody control  
โœ… Offers a **REST API** with OpenAPI documentation  
โœ… Achieves **<500ms latency** for typical sentences  
โœ… Is **production-ready** with proper error handling

The system is well-suited for the healthcare assistant use case, providing clear, natural-sounding speech in native languages to help pregnant mothers access healthcare information.

---

**Repository**: https://github.com/harshil748/VoiceAPI  
**API Documentation**: http://localhost:8000/docs