# Voice Tech for All: Technical Report

## Multi-lingual Text-to-Speech System with Style Transfer

**Hackathon**: Voice Tech for All  
**Date**: December 2025

---

## Executive Summary

We present a **multi-lingual Text-to-Speech (TTS) system** supporting **11 Indian languages** with **style/prosody control** capabilities. The system is designed for deployment as a healthcare assistant for pregnant mothers in low-income communities, making health information accessible in native languages.

### Key Achievements

| Metric                 | Value                                                                                                       |
| ---------------------- | ----------------------------------------------------------------------------------------------------------- |
| Languages Supported    | 11 (Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English, Gujarati) |
| Voice Variants         | 21 (male + female for each language)                                                                        |
| Style Presets          | 9 (default, slow, fast, soft, loud, happy, sad, calm, excited)                                              |
| Average Inference Time | ~0.3s (CPU, Apple M2)                                                                                       |
| Model Size             | ~300MB per voice (VITS), ~145MB (MMS)                                                                       |
| API Latency            | <500ms for typical sentences                                                                                |

---

## 1. System Architecture

### 1.1 Overview

```
┌─────────────────────────────────────────────────────────────┐
│                    REST API Server (FastAPI)                 │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────┐  ┌──────────────┐  ┌─────────────────────────┐│
│  │/synthesize│  │ /voices     │  │ /styles               ││
│  │ /stream   │  │ /languages  │  │ /health               ││
│  └──────────┘  └──────────────┘  └─────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                      TTS Engine                              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────┐ │
│  │ Text Normalizer │→ │ Tokenizer       │→ │ VITS/MMS    │ │
│  │ (Indian scripts)│  │ (char-to-ID)    │  │ Inference   │ │
│  └─────────────────┘  └─────────────────┘  └──────────────┘ │
│                              ↓                               │
│  ┌─────────────────────────────────────────────────────────┐│
│  │              Style Processor (Prosody Control)          ││
│  │  • Pitch Shifting (librosa)                             ││
│  │  • Time Stretching (speed control)                      ││
│  │  • Energy/Volume Modification                           ││
│  └─────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                    Model Repository                          │
│  ┌────────────────────┐  ┌────────────────────────────────┐ │
│  │ SYSPIN VITS Models │  │ Facebook MMS Models            │ │
│  │ (10 languages)     │  │ (Gujarati)                     │ │
│  └────────────────────┘  └────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```

### 1.2 Component Details

#### Text Normalizer

- Handles Indian script peculiarities
- Converts number notations: `{100}{एकसो}` → `एकसो`
- Normalizes punctuation across scripts
- Handles code-switching (Hindi in English text)

#### VITS Models (SYSPIN)

- **Architecture**: Conditional Variational Autoencoder with Adversarial Learning
- **Training Data**: 20-30 hours per speaker from IISc Bangalore
- **Output**: 22050 Hz, 16-bit PCM
- **Languages**: Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English

#### MMS Model (Facebook)

- **Architecture**: VITS-based, trained on MMS corpus
- **Output**: 16000 Hz
- **Languages**: Gujarati (and 1100+ others available)
- **Model Size**: 145MB

#### Style Processor

- **Pitch Shifting**: Using librosa phase vocoder
- **Time Stretching**: WSOLA algorithm via librosa
- **Energy Control**: Soft clipping with tanh for natural sound

---

## 2. API Specification

### 2.1 Endpoints

| Endpoint             | Method | Description                      |
| -------------------- | ------ | -------------------------------- |
| `/`                  | GET    | API info and documentation links |
| `/health`            | GET    | System health and loaded models  |
| `/voices`            | GET    | List all available voices        |
| `/languages`         | GET    | List supported languages         |
| `/styles`            | GET    | List style presets               |
| `/synthesize`        | POST   | Generate speech from text        |
| `/synthesize/get`    | GET    | Simple synthesis (for testing)   |
| `/synthesize/stream` | POST   | Streaming audio response         |
| `/preload`           | POST   | Preload voice into memory        |
| `/batch`             | POST   | Batch synthesis                  |

### 2.2 Synthesis Request

```json
{
	"text": "નમસ્તે, હું તમારી કેવી રીતે મદદ કરી શકું?",
	"voice": "gu_mms",
	"speed": 1.0,
	"pitch": 1.0,
	"energy": 1.0,
	"style": "calm",
	"normalize": true
}
```

### 2.3 Style Presets

| Preset  | Speed | Pitch | Energy | Use Case               |
| ------- | ----- | ----- | ------ | ---------------------- |
| default | 1.0   | 1.0   | 1.0    | Normal speech          |
| slow    | 0.75  | 1.0   | 1.0    | Elderly users, clarity |
| fast    | 1.25  | 1.0   | 1.0    | Quick information      |
| soft    | 0.9   | 0.95  | 0.7    | Calming content        |
| loud    | 1.0   | 1.05  | 1.3    | Alerts, emphasis       |
| happy   | 1.1   | 1.1   | 1.2    | Positive messages      |
| sad     | 0.85  | 0.9   | 0.8    | Empathetic responses   |
| calm    | 0.9   | 0.95  | 0.85   | Healthcare guidance    |
| excited | 1.2   | 1.15  | 1.3    | Celebrations           |

---

## 3. Supported Languages

| Language      | Code | Voices       | Model Type   | Sample Rate |
| ------------- | ---- | ------------ | ------------ | ----------- |
| Hindi         | hi   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Bengali       | bn   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Marathi       | mr   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Telugu        | te   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Kannada       | kn   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Bhojpuri      | bho  | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Chhattisgarhi | hne  | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Maithili      | mai  | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Magahi        | mag  | Male, Female | SYSPIN VITS  | 22050 Hz    |
| English       | en   | Male, Female | SYSPIN VITS  | 22050 Hz    |
| Gujarati      | gu   | Neutral      | Facebook MMS | 16000 Hz    |

---

## 4. Implementation Details

### 4.1 Technology Stack

| Component         | Technology                               |
| ----------------- | ---------------------------------------- |
| Backend Framework | FastAPI                                  |
| ML Framework      | PyTorch                                  |
| TTS Models        | VITS (Coqui AI / SYSPIN), MMS (Facebook) |
| Audio Processing  | librosa, soundfile, scipy                |
| Model Hub         | Hugging Face Hub                         |
| API Documentation | OpenAPI/Swagger                          |

### 4.2 Model Architecture - VITS

VITS (Conditional Variational Autoencoder with Adversarial Learning) was chosen for:

- **End-to-End Efficiency**: Combines acoustic modeling and vocoding in a single pass
- **High Quality**: Natural-sounding speech comparable to two-stage systems
- **Multi-Speaker Support**: Supports different speakers via embeddings
- **Fast Inference**: TorchScript JIT compilation for speed

### 4.3 Style/Accent Transfer Implementation

Our style transfer uses **post-processing** approach for simplicity and reliability:

1. **Pitch Shifting**: Phase vocoder via librosa

   ```python
   semitones = 12 * np.log2(pitch_factor)
   shifted = librosa.effects.pitch_shift(audio, sr=sr, n_steps=semitones)
   ```

2. **Time Stretching**: WSOLA algorithm

   ```python
   stretched = librosa.effects.time_stretch(audio, rate=speed_factor)
   ```

3. **Energy Control**: Soft clipping for natural sound
   ```python
   modified = audio * energy_factor
   if energy_factor > 1.0:
       modified = np.tanh(modified * 2) * 0.95  # Soft clip
   ```

### 4.4 Key Design Decisions

1. **TorchScript Models**: JIT-compiled for faster inference
2. **Lazy Loading**: Models loaded on-demand to minimize memory
3. **CPU Fallback**: Apple Silicon MPS compatibility issues handled
4. **Streaming Support**: Progressive audio delivery for real-time apps

---

## 5. Usage Examples

### 5.1 Python API

```python
from src.engine import TTSEngine

# Initialize engine
engine = TTSEngine(device="auto")

# Basic synthesis
output = engine.synthesize(
    text="गर्भावस्था में स्वस्थ आहार बहुत महत्वपूर्ण है",
    voice="hi_female"
)

# With style control
output = engine.synthesize(
    text="आपका दिन शुभ हो",
    voice="hi_male",
    style="happy",
    pitch=1.1
)

# Gujarati
output = engine.synthesize(
    text="સ્વસ્થ રહો, ખુશ રહો",
    voice="gu_mms",
    style="calm"
)
```

### 5.2 REST API

```bash
# Basic synthesis
curl -X POST "http://localhost:8000/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text": "नमस्ते", "voice": "hi_male"}' \
  --output speech.wav

# With style
curl -X POST "http://localhost:8000/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text": "आपका स्वागत है", "voice": "hi_female", "style": "happy"}' \
  --output welcome.wav

# Gujarati
curl -X POST "http://localhost:8000/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text": "નમસ્તે", "voice": "gu_mms"}' \
  --output gujarati.wav
```

### 5.3 Command Line

```bash
# Download models
python -m src.cli download --voice hi_male
python -m src.cli download --lang hi  # All Hindi voices

# Synthesize
python -m src.cli synthesize --text "नमस्ते" --voice hi_male --output hello.wav

# Start server
python -m src.cli serve --port 8000
```

---

## 6. Healthcare Use Case

### 6.1 Target Application

The TTS system is designed for integration with an **LLM-based healthcare assistant** for pregnant mothers in low-income communities.

### 6.2 Key Features for Healthcare

1. **Multi-lingual Support**: Information in native languages
2. **Calm Style Preset**: Reassuring tone for medical guidance
3. **Slow Speed Option**: Clear pronunciation for instructions
4. **Low Latency**: Real-time conversational responses

### 6.3 Example Healthcare Dialogue

```
User: "ગર્ભાવસ્થામાં શું ખાવું જોઈએ?"

System Response (TTS with calm style in Gujarati):
"ગર્ભાવસ્થામાં તમારે પ્રોટીન, આયર્ન અને ફોલિક એસિડથી ભરપૂર
ખોરાક લેવો જોઈએ. દાળ, પાલક, ઈંડા અને દૂધ સારા વિકલ્પો છે."
```

---

## 7. Performance Benchmarks

| Test                    | Time  | Notes                              |
| ----------------------- | ----- | ---------------------------------- |
| Hindi synthesis (short) | 0.25s | "नमस्ते"                           |
| Hindi synthesis (long)  | 0.45s | 50-word sentence                   |
| Gujarati MMS            | 0.35s | First load includes model download |
| Style processing        | +0.1s | Pitch + speed adjustment           |
| API round-trip          | 0.5s  | Including network overhead         |

Hardware: Apple M2 Pro, 16GB RAM, CPU inference

---

## 8. Deployment

### 8.1 Quick Start

```bash
# Clone repository
git clone https://github.com/harshil748/VoiceAPI
cd VoiceAPI

# Setup environment
python3 -m venv tts
source tts/bin/activate
pip install -r requirements.txt

# Download a model
python -m src.cli download --voice hi_male

# Start server
python -m src.cli serve --port 8000
```

### 8.2 Docker

```dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
RUN python -m src.cli download --lang hi
EXPOSE 8000
CMD ["python", "-m", "src.cli", "serve"]
```

---

## 9. Limitations and Future Work

### 9.1 Current Limitations

1. **Model Size**: Each VITS model is ~300MB
2. **MPS Compatibility**: Apple Silicon MPS not fully supported
3. **Real-time Streaming**: Limited to sentence-level
4. **Gujarati Gender**: MMS has only neutral voice

### 9.2 Future Improvements

1. **Model Quantization**: INT8 for smaller size
2. **Voice Cloning**: Reference audio-based synthesis
3. **SSML Support**: Markup language for fine control
4. **More Languages**: Odia, Assamese, Punjabi
5. **Fine-tuning**: Custom voice training on SPICOR data

---

## 10. Credits

### Model Sources

| Source                  | Models                | License      |
| ----------------------- | --------------------- | ------------ |
| SYSPIN (IISc Bangalore) | VITS for 10 languages | CC BY 4.0    |
| Facebook MMS            | Gujarati VITS         | CC BY-NC 4.0 |

### Dataset

- **SPICOR TTS Project**: IISc SPIRE Lab, Bangalore
- **Audio Quality**: 48kHz, 24-bit, mono

### Frameworks

- Coqui TTS, Hugging Face Transformers, FastAPI, librosa

---

## 11. Conclusion

We have developed a comprehensive multi-lingual TTS system that:

✅ Supports **11 Indian languages** with 21 voice variants  
✅ Provides **9 style presets** for prosody control  
✅ Offers a **REST API** with OpenAPI documentation  
✅ Achieves **<500ms latency** for typical sentences  
✅ Is **production-ready** with proper error handling

The system is well-suited for the healthcare assistant use case, providing clear, natural-sounding speech in native languages to help pregnant mothers access healthcare information.

---

**Repository**: https://github.com/harshil748/VoiceAPI  
**API Documentation**: http://localhost:8000/docs