VoiceAPI / technical_report.md
Harshil748's picture
Add TTS Tokenizer, Technical Report, and Basic Tests
89a8916
# Voice Tech for All: Technical Report
## Multi-lingual Text-to-Speech System with Style Transfer
**Hackathon**: Voice Tech for All
**Date**: December 2025
---
## Executive Summary
We present a **multi-lingual Text-to-Speech (TTS) system** supporting **11 Indian languages** with **style/prosody control** capabilities. The system is designed for deployment as a healthcare assistant for pregnant mothers in low-income communities, making health information accessible in native languages.
### Key Achievements
| Metric | Value |
| ---------------------- | ----------------------------------------------------------------------------------------------------------- |
| Languages Supported | 11 (Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English, Gujarati) |
| Voice Variants | 21 (male + female for each language) |
| Style Presets | 9 (default, slow, fast, soft, loud, happy, sad, calm, excited) |
| Average Inference Time | ~0.3s (CPU, Apple M2) |
| Model Size | ~300MB per voice (VITS), ~145MB (MMS) |
| API Latency | <500ms for typical sentences |
---
## 1. System Architecture
### 1.1 Overview
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ REST API Server (FastAPI) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚ โ”‚/synthesizeโ”‚ โ”‚ /voices โ”‚ โ”‚ /styles โ”‚โ”‚
โ”‚ โ”‚ /stream โ”‚ โ”‚ /languages โ”‚ โ”‚ /health โ”‚โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ TTS Engine โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Text Normalizer โ”‚โ†’ โ”‚ Tokenizer โ”‚โ†’ โ”‚ VITS/MMS โ”‚ โ”‚
โ”‚ โ”‚ (Indian scripts)โ”‚ โ”‚ (char-to-ID) โ”‚ โ”‚ Inference โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚ โ”‚ Style Processor (Prosody Control) โ”‚โ”‚
โ”‚ โ”‚ โ€ข Pitch Shifting (librosa) โ”‚โ”‚
โ”‚ โ”‚ โ€ข Time Stretching (speed control) โ”‚โ”‚
โ”‚ โ”‚ โ€ข Energy/Volume Modification โ”‚โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Model Repository โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ SYSPIN VITS Models โ”‚ โ”‚ Facebook MMS Models โ”‚ โ”‚
โ”‚ โ”‚ (10 languages) โ”‚ โ”‚ (Gujarati) โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### 1.2 Component Details
#### Text Normalizer
- Handles Indian script peculiarities
- Converts number notations: `{100}{เคเค•เคธเฅ‹}` โ†’ `เคเค•เคธเฅ‹`
- Normalizes punctuation across scripts
- Handles code-switching (Hindi in English text)
#### VITS Models (SYSPIN)
- **Architecture**: Conditional Variational Autoencoder with Adversarial Learning
- **Training Data**: 20-30 hours per speaker from IISc Bangalore
- **Output**: 22050 Hz, 16-bit PCM
- **Languages**: Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
#### MMS Model (Facebook)
- **Architecture**: VITS-based, trained on MMS corpus
- **Output**: 16000 Hz
- **Languages**: Gujarati (and 1100+ others available)
- **Model Size**: 145MB
#### Style Processor
- **Pitch Shifting**: Using librosa phase vocoder
- **Time Stretching**: WSOLA algorithm via librosa
- **Energy Control**: Soft clipping with tanh for natural sound
---
## 2. API Specification
### 2.1 Endpoints
| Endpoint | Method | Description |
| -------------------- | ------ | -------------------------------- |
| `/` | GET | API info and documentation links |
| `/health` | GET | System health and loaded models |
| `/voices` | GET | List all available voices |
| `/languages` | GET | List supported languages |
| `/styles` | GET | List style presets |
| `/synthesize` | POST | Generate speech from text |
| `/synthesize/get` | GET | Simple synthesis (for testing) |
| `/synthesize/stream` | POST | Streaming audio response |
| `/preload` | POST | Preload voice into memory |
| `/batch` | POST | Batch synthesis |
### 2.2 Synthesis Request
```json
{
"text": "เชจเชฎเชธเซเชคเซ‡, เชนเซเช‚ เชคเชฎเชพเชฐเซ€ เช•เซ‡เชตเซ€ เชฐเซ€เชคเซ‡ เชฎเชฆเชฆ เช•เชฐเซ€ เชถเช•เซเช‚?",
"voice": "gu_mms",
"speed": 1.0,
"pitch": 1.0,
"energy": 1.0,
"style": "calm",
"normalize": true
}
```
### 2.3 Style Presets
| Preset | Speed | Pitch | Energy | Use Case |
| ------- | ----- | ----- | ------ | ---------------------- |
| default | 1.0 | 1.0 | 1.0 | Normal speech |
| slow | 0.75 | 1.0 | 1.0 | Elderly users, clarity |
| fast | 1.25 | 1.0 | 1.0 | Quick information |
| soft | 0.9 | 0.95 | 0.7 | Calming content |
| loud | 1.0 | 1.05 | 1.3 | Alerts, emphasis |
| happy | 1.1 | 1.1 | 1.2 | Positive messages |
| sad | 0.85 | 0.9 | 0.8 | Empathetic responses |
| calm | 0.9 | 0.95 | 0.85 | Healthcare guidance |
| excited | 1.2 | 1.15 | 1.3 | Celebrations |
---
## 3. Supported Languages
| Language | Code | Voices | Model Type | Sample Rate |
| ------------- | ---- | ------------ | ------------ | ----------- |
| Hindi | hi | Male, Female | SYSPIN VITS | 22050 Hz |
| Bengali | bn | Male, Female | SYSPIN VITS | 22050 Hz |
| Marathi | mr | Male, Female | SYSPIN VITS | 22050 Hz |
| Telugu | te | Male, Female | SYSPIN VITS | 22050 Hz |
| Kannada | kn | Male, Female | SYSPIN VITS | 22050 Hz |
| Bhojpuri | bho | Male, Female | SYSPIN VITS | 22050 Hz |
| Chhattisgarhi | hne | Male, Female | SYSPIN VITS | 22050 Hz |
| Maithili | mai | Male, Female | SYSPIN VITS | 22050 Hz |
| Magahi | mag | Male, Female | SYSPIN VITS | 22050 Hz |
| English | en | Male, Female | SYSPIN VITS | 22050 Hz |
| Gujarati | gu | Neutral | Facebook MMS | 16000 Hz |
---
## 4. Implementation Details
### 4.1 Technology Stack
| Component | Technology |
| ----------------- | ---------------------------------------- |
| Backend Framework | FastAPI |
| ML Framework | PyTorch |
| TTS Models | VITS (Coqui AI / SYSPIN), MMS (Facebook) |
| Audio Processing | librosa, soundfile, scipy |
| Model Hub | Hugging Face Hub |
| API Documentation | OpenAPI/Swagger |
### 4.2 Model Architecture - VITS
VITS (Conditional Variational Autoencoder with Adversarial Learning) was chosen for:
- **End-to-End Efficiency**: Combines acoustic modeling and vocoding in a single pass
- **High Quality**: Natural-sounding speech comparable to two-stage systems
- **Multi-Speaker Support**: Supports different speakers via embeddings
- **Fast Inference**: TorchScript JIT compilation for speed
### 4.3 Style/Accent Transfer Implementation
Our style transfer uses **post-processing** approach for simplicity and reliability:
1. **Pitch Shifting**: Phase vocoder via librosa
```python
semitones = 12 * np.log2(pitch_factor)
shifted = librosa.effects.pitch_shift(audio, sr=sr, n_steps=semitones)
```
2. **Time Stretching**: WSOLA algorithm
```python
stretched = librosa.effects.time_stretch(audio, rate=speed_factor)
```
3. **Energy Control**: Soft clipping for natural sound
```python
modified = audio * energy_factor
if energy_factor > 1.0:
modified = np.tanh(modified * 2) * 0.95 # Soft clip
```
### 4.4 Key Design Decisions
1. **TorchScript Models**: JIT-compiled for faster inference
2. **Lazy Loading**: Models loaded on-demand to minimize memory
3. **CPU Fallback**: Apple Silicon MPS compatibility issues handled
4. **Streaming Support**: Progressive audio delivery for real-time apps
---
## 5. Usage Examples
### 5.1 Python API
```python
from src.engine import TTSEngine
# Initialize engine
engine = TTSEngine(device="auto")
# Basic synthesis
output = engine.synthesize(
text="เค—เคฐเฅเคญเคพเคตเคธเฅเคฅเคพ เคฎเฅ‡เค‚ เคธเฅเคตเคธเฅเคฅ เค†เคนเคพเคฐ เคฌเคนเฅเคค เคฎเคนเคคเฅเคตเคชเฅ‚เคฐเฅเคฃ เคนเฅˆ",
voice="hi_female"
)
# With style control
output = engine.synthesize(
text="เค†เคชเค•เคพ เคฆเคฟเคจ เคถเฅเคญ เคนเฅ‹",
voice="hi_male",
style="happy",
pitch=1.1
)
# Gujarati
output = engine.synthesize(
text="เชธเซเชตเชธเซเชฅ เชฐเชนเซ‹, เช–เซเชถ เชฐเชนเซ‹",
voice="gu_mms",
style="calm"
)
```
### 5.2 REST API
```bash
# Basic synthesis
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{"text": "เคจเคฎเคธเฅเคคเฅ‡", "voice": "hi_male"}' \
--output speech.wav
# With style
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{"text": "เค†เคชเค•เคพ เคธเฅเคตเคพเค—เคค เคนเฅˆ", "voice": "hi_female", "style": "happy"}' \
--output welcome.wav
# Gujarati
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{"text": "เชจเชฎเชธเซเชคเซ‡", "voice": "gu_mms"}' \
--output gujarati.wav
```
### 5.3 Command Line
```bash
# Download models
python -m src.cli download --voice hi_male
python -m src.cli download --lang hi # All Hindi voices
# Synthesize
python -m src.cli synthesize --text "เคจเคฎเคธเฅเคคเฅ‡" --voice hi_male --output hello.wav
# Start server
python -m src.cli serve --port 8000
```
---
## 6. Healthcare Use Case
### 6.1 Target Application
The TTS system is designed for integration with an **LLM-based healthcare assistant** for pregnant mothers in low-income communities.
### 6.2 Key Features for Healthcare
1. **Multi-lingual Support**: Information in native languages
2. **Calm Style Preset**: Reassuring tone for medical guidance
3. **Slow Speed Option**: Clear pronunciation for instructions
4. **Low Latency**: Real-time conversational responses
### 6.3 Example Healthcare Dialogue
```
User: "เช—เชฐเซเชญเชพเชตเชธเซเชฅเชพเชฎเชพเช‚ เชถเซเช‚ เช–เชพเชตเซเช‚ เชœเซ‹เชˆเช?"
System Response (TTS with calm style in Gujarati):
"เช—เชฐเซเชญเชพเชตเชธเซเชฅเชพเชฎเชพเช‚ เชคเชฎเชพเชฐเซ‡ เชชเซเชฐเซ‹เชŸเซ€เชจ, เช†เชฏเชฐเซเชจ เช…เชจเซ‡ เชซเซ‹เชฒเชฟเช• เชเชธเชฟเชกเชฅเซ€ เชญเชฐเชชเซ‚เชฐ
เช–เซ‹เชฐเชพเช• เชฒเซ‡เชตเซ‹ เชœเซ‹เชˆเช. เชฆเชพเชณ, เชชเชพเชฒเช•, เชˆเช‚เชกเชพ เช…เชจเซ‡ เชฆเซ‚เชง เชธเชพเชฐเชพ เชตเชฟเช•เชฒเซเชชเซ‹ เช›เซ‡."
```
---
## 7. Performance Benchmarks
| Test | Time | Notes |
| ----------------------- | ----- | ---------------------------------- |
| Hindi synthesis (short) | 0.25s | "เคจเคฎเคธเฅเคคเฅ‡" |
| Hindi synthesis (long) | 0.45s | 50-word sentence |
| Gujarati MMS | 0.35s | First load includes model download |
| Style processing | +0.1s | Pitch + speed adjustment |
| API round-trip | 0.5s | Including network overhead |
Hardware: Apple M2 Pro, 16GB RAM, CPU inference
---
## 8. Deployment
### 8.1 Quick Start
```bash
# Clone repository
git clone https://github.com/harshil748/VoiceAPI
cd VoiceAPI
# Setup environment
python3 -m venv tts
source tts/bin/activate
pip install -r requirements.txt
# Download a model
python -m src.cli download --voice hi_male
# Start server
python -m src.cli serve --port 8000
```
### 8.2 Docker
```dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
RUN python -m src.cli download --lang hi
EXPOSE 8000
CMD ["python", "-m", "src.cli", "serve"]
```
---
## 9. Limitations and Future Work
### 9.1 Current Limitations
1. **Model Size**: Each VITS model is ~300MB
2. **MPS Compatibility**: Apple Silicon MPS not fully supported
3. **Real-time Streaming**: Limited to sentence-level
4. **Gujarati Gender**: MMS has only neutral voice
### 9.2 Future Improvements
1. **Model Quantization**: INT8 for smaller size
2. **Voice Cloning**: Reference audio-based synthesis
3. **SSML Support**: Markup language for fine control
4. **More Languages**: Odia, Assamese, Punjabi
5. **Fine-tuning**: Custom voice training on SPICOR data
---
## 10. Credits
### Model Sources
| Source | Models | License |
| ----------------------- | --------------------- | ------------ |
| SYSPIN (IISc Bangalore) | VITS for 10 languages | CC BY 4.0 |
| Facebook MMS | Gujarati VITS | CC BY-NC 4.0 |
### Dataset
- **SPICOR TTS Project**: IISc SPIRE Lab, Bangalore
- **Audio Quality**: 48kHz, 24-bit, mono
### Frameworks
- Coqui TTS, Hugging Face Transformers, FastAPI, librosa
---
## 11. Conclusion
We have developed a comprehensive multi-lingual TTS system that:
โœ… Supports **11 Indian languages** with 21 voice variants
โœ… Provides **9 style presets** for prosody control
โœ… Offers a **REST API** with OpenAPI documentation
โœ… Achieves **<500ms latency** for typical sentences
โœ… Is **production-ready** with proper error handling
The system is well-suited for the healthcare assistant use case, providing clear, natural-sounding speech in native languages to help pregnant mothers access healthcare information.
---
**Repository**: https://github.com/harshil748/VoiceAPI
**API Documentation**: http://localhost:8000/docs