VoiceAPI / technical_report.md

Add TTS Tokenizer, Technical Report, and Basic Tests

89a8916 about 1 month ago

16.2 kB

	# Voice Tech for All: Technical Report

	## Multi-lingual Text-to-Speech System with Style Transfer

	Hackathon: Voice Tech for All
	Date: December 2025

	---

	## Executive Summary

	We present a multi-lingual Text-to-Speech (TTS) system supporting 11 Indian languages with style/prosody control capabilities. The system is designed for deployment as a healthcare assistant for pregnant mothers in low-income communities, making health information accessible in native languages.

	### Key Achievements

	\| Metric \| Value \|
	\| ---------------------- \| ----------------------------------------------------------------------------------------------------------- \|
	\| Languages Supported \| 11 (Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English, Gujarati) \|
	\| Voice Variants \| 21 (male + female for each language) \|
	\| Style Presets \| 9 (default, slow, fast, soft, loud, happy, sad, calm, excited) \|
	\| Average Inference Time \| ~0.3s (CPU, Apple M2) \|
	\| Model Size \| ~300MB per voice (VITS), ~145MB (MMS) \|
	\| API Latency \| <500ms for typical sentences \|

	---

	## 1. System Architecture

	### 1.1 Overview

	```
	┌─────────────────────────────────────────────────────────────┐
	│ REST API Server (FastAPI) │
	├─────────────────────────────────────────────────────────────┤
	│ ┌──────────┐ ┌──────────────┐ ┌─────────────────────────┐│
	│ │/synthesize│ │ /voices │ │ /styles ││
	│ │ /stream │ │ /languages │ │ /health ││
	│ └──────────┘ └──────────────┘ └─────────────────────────┘│
	├─────────────────────────────────────────────────────────────┤
	│ TTS Engine │
	│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────┐ │
	│ │ Text Normalizer │→ │ Tokenizer │→ │ VITS/MMS │ │
	│ │ (Indian scripts)│ │ (char-to-ID) │ │ Inference │ │
	│ └─────────────────┘ └─────────────────┘ └──────────────┘ │
	│ ↓ │
	│ ┌─────────────────────────────────────────────────────────┐│
	│ │ Style Processor (Prosody Control) ││
	│ │ • Pitch Shifting (librosa) ││
	│ │ • Time Stretching (speed control) ││
	│ │ • Energy/Volume Modification ││
	│ └─────────────────────────────────────────────────────────┘│
	├─────────────────────────────────────────────────────────────┤
	│ Model Repository │
	│ ┌────────────────────┐ ┌────────────────────────────────┐ │
	│ │ SYSPIN VITS Models │ │ Facebook MMS Models │ │
	│ │ (10 languages) │ │ (Gujarati) │ │
	│ └────────────────────┘ └────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────┘
	```

	### 1.2 Component Details

	#### Text Normalizer

	- Handles Indian script peculiarities
	- Converts number notations: `{100}{एकसो}` → `एकसो`
	- Normalizes punctuation across scripts
	- Handles code-switching (Hindi in English text)

	#### VITS Models (SYSPIN)

	- Architecture: Conditional Variational Autoencoder with Adversarial Learning
	- Training Data: 20-30 hours per speaker from IISc Bangalore
	- Output: 22050 Hz, 16-bit PCM
	- Languages: Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English

	#### MMS Model (Facebook)

	- Architecture: VITS-based, trained on MMS corpus
	- Output: 16000 Hz
	- Languages: Gujarati (and 1100+ others available)
	- Model Size: 145MB

	#### Style Processor

	- Pitch Shifting: Using librosa phase vocoder
	- Time Stretching: WSOLA algorithm via librosa
	- Energy Control: Soft clipping with tanh for natural sound

	---

	## 2. API Specification

	### 2.1 Endpoints

	\| Endpoint \| Method \| Description \|
	\| -------------------- \| ------ \| -------------------------------- \|
	\| `/` \| GET \| API info and documentation links \|
	\| `/health` \| GET \| System health and loaded models \|
	\| `/voices` \| GET \| List all available voices \|
	\| `/languages` \| GET \| List supported languages \|
	\| `/styles` \| GET \| List style presets \|
	\| `/synthesize` \| POST \| Generate speech from text \|
	\| `/synthesize/get` \| GET \| Simple synthesis (for testing) \|
	\| `/synthesize/stream` \| POST \| Streaming audio response \|
	\| `/preload` \| POST \| Preload voice into memory \|
	\| `/batch` \| POST \| Batch synthesis \|

	### 2.2 Synthesis Request

	```json
	{
	"text": "નમસ્તે, હું તમારી કેવી રીતે મદદ કરી શકું?",
	"voice": "gu_mms",
	"speed": 1.0,
	"pitch": 1.0,
	"energy": 1.0,
	"style": "calm",
	"normalize": true
	}
	```

	### 2.3 Style Presets

	\| Preset \| Speed \| Pitch \| Energy \| Use Case \|
	\| ------- \| ----- \| ----- \| ------ \| ---------------------- \|
	\| default \| 1.0 \| 1.0 \| 1.0 \| Normal speech \|
	\| slow \| 0.75 \| 1.0 \| 1.0 \| Elderly users, clarity \|
	\| fast \| 1.25 \| 1.0 \| 1.0 \| Quick information \|
	\| soft \| 0.9 \| 0.95 \| 0.7 \| Calming content \|
	\| loud \| 1.0 \| 1.05 \| 1.3 \| Alerts, emphasis \|
	\| happy \| 1.1 \| 1.1 \| 1.2 \| Positive messages \|
	\| sad \| 0.85 \| 0.9 \| 0.8 \| Empathetic responses \|
	\| calm \| 0.9 \| 0.95 \| 0.85 \| Healthcare guidance \|
	\| excited \| 1.2 \| 1.15 \| 1.3 \| Celebrations \|

	---

	## 3. Supported Languages

	\| Language \| Code \| Voices \| Model Type \| Sample Rate \|
	\| ------------- \| ---- \| ------------ \| ------------ \| ----------- \|
	\| Hindi \| hi \| Male, Female \| SYSPIN VITS \| 22050 Hz \|
	\| Bengali \| bn \| Male, Female \| SYSPIN VITS \| 22050 Hz \|
	\| Marathi \| mr \| Male, Female \| SYSPIN VITS \| 22050 Hz \|
	\| Telugu \| te \| Male, Female \| SYSPIN VITS \| 22050 Hz \|
	\| Kannada \| kn \| Male, Female \| SYSPIN VITS \| 22050 Hz \|
	\| Bhojpuri \| bho \| Male, Female \| SYSPIN VITS \| 22050 Hz \|
	\| Chhattisgarhi \| hne \| Male, Female \| SYSPIN VITS \| 22050 Hz \|
	\| Maithili \| mai \| Male, Female \| SYSPIN VITS \| 22050 Hz \|
	\| Magahi \| mag \| Male, Female \| SYSPIN VITS \| 22050 Hz \|
	\| English \| en \| Male, Female \| SYSPIN VITS \| 22050 Hz \|
	\| Gujarati \| gu \| Neutral \| Facebook MMS \| 16000 Hz \|

	---

	## 4. Implementation Details

	### 4.1 Technology Stack

	\| Component \| Technology \|
	\| ----------------- \| ---------------------------------------- \|
	\| Backend Framework \| FastAPI \|
	\| ML Framework \| PyTorch \|
	\| TTS Models \| VITS (Coqui AI / SYSPIN), MMS (Facebook) \|
	\| Audio Processing \| librosa, soundfile, scipy \|
	\| Model Hub \| Hugging Face Hub \|
	\| API Documentation \| OpenAPI/Swagger \|

	### 4.2 Model Architecture - VITS

	VITS (Conditional Variational Autoencoder with Adversarial Learning) was chosen for:

	- End-to-End Efficiency: Combines acoustic modeling and vocoding in a single pass
	- High Quality: Natural-sounding speech comparable to two-stage systems
	- Multi-Speaker Support: Supports different speakers via embeddings
	- Fast Inference: TorchScript JIT compilation for speed

	### 4.3 Style/Accent Transfer Implementation

	Our style transfer uses post-processing approach for simplicity and reliability:

	1. Pitch Shifting: Phase vocoder via librosa

	```python
	semitones = 12 * np.log2(pitch_factor)
	shifted = librosa.effects.pitch_shift(audio, sr=sr, n_steps=semitones)
	```

	2. Time Stretching: WSOLA algorithm

	```python
	stretched = librosa.effects.time_stretch(audio, rate=speed_factor)
	```

	3. Energy Control: Soft clipping for natural sound
	```python
	modified = audio * energy_factor
	if energy_factor > 1.0:
	modified = np.tanh(modified * 2) * 0.95 # Soft clip
	```

	### 4.4 Key Design Decisions

	1. TorchScript Models: JIT-compiled for faster inference
	2. Lazy Loading: Models loaded on-demand to minimize memory
	3. CPU Fallback: Apple Silicon MPS compatibility issues handled
	4. Streaming Support: Progressive audio delivery for real-time apps

	---

	## 5. Usage Examples

	### 5.1 Python API

	```python
	from src.engine import TTSEngine

	# Initialize engine
	engine = TTSEngine(device="auto")

	# Basic synthesis
	output = engine.synthesize(
	text="गर्भावस्था में स्वस्थ आहार बहुत महत्वपूर्ण है",
	voice="hi_female"
	)

	# With style control
	output = engine.synthesize(
	text="आपका दिन शुभ हो",
	voice="hi_male",
	style="happy",
	pitch=1.1
	)

	# Gujarati
	output = engine.synthesize(
	text="સ્વસ્થ રહો, ખુશ રહો",
	voice="gu_mms",
	style="calm"
	)
	```

	### 5.2 REST API

	```bash
	# Basic synthesis
	curl -X POST "http://localhost:8000/synthesize" \
	-H "Content-Type: application/json" \
	-d '{"text": "नमस्ते", "voice": "hi_male"}' \
	--output speech.wav

	# With style
	curl -X POST "http://localhost:8000/synthesize" \
	-H "Content-Type: application/json" \
	-d '{"text": "आपका स्वागत है", "voice": "hi_female", "style": "happy"}' \
	--output welcome.wav

	# Gujarati
	curl -X POST "http://localhost:8000/synthesize" \
	-H "Content-Type: application/json" \
	-d '{"text": "નમસ્તે", "voice": "gu_mms"}' \
	--output gujarati.wav
	```

	### 5.3 Command Line

	```bash
	# Download models
	python -m src.cli download --voice hi_male
	python -m src.cli download --lang hi # All Hindi voices

	# Synthesize
	python -m src.cli synthesize --text "नमस्ते" --voice hi_male --output hello.wav

	# Start server
	python -m src.cli serve --port 8000
	```

	---

	## 6. Healthcare Use Case

	### 6.1 Target Application

	The TTS system is designed for integration with an LLM-based healthcare assistant for pregnant mothers in low-income communities.

	### 6.2 Key Features for Healthcare

	1. Multi-lingual Support: Information in native languages
	2. Calm Style Preset: Reassuring tone for medical guidance
	3. Slow Speed Option: Clear pronunciation for instructions
	4. Low Latency: Real-time conversational responses

	### 6.3 Example Healthcare Dialogue

	```
	User: "ગર્ભાવસ્થામાં શું ખાવું જોઈએ?"

	System Response (TTS with calm style in Gujarati):
	"ગર્ભાવસ્થામાં તમારે પ્રોટીન, આયર્ન અને ફોલિક એસિડથી ભરપૂર
	ખોરાક લેવો જોઈએ. દાળ, પાલક, ઈંડા અને દૂધ સારા વિકલ્પો છે."
	```

	---

	## 7. Performance Benchmarks

	\| Test \| Time \| Notes \|
	\| ----------------------- \| ----- \| ---------------------------------- \|
	\| Hindi synthesis (short) \| 0.25s \| "नमस्ते" \|
	\| Hindi synthesis (long) \| 0.45s \| 50-word sentence \|
	\| Gujarati MMS \| 0.35s \| First load includes model download \|
	\| Style processing \| +0.1s \| Pitch + speed adjustment \|
	\| API round-trip \| 0.5s \| Including network overhead \|

	Hardware: Apple M2 Pro, 16GB RAM, CPU inference

	---

	## 8. Deployment

	### 8.1 Quick Start

	```bash
	# Clone repository
	git clone https://github.com/harshil748/VoiceAPI
	cd VoiceAPI

	# Setup environment
	python3 -m venv tts
	source tts/bin/activate
	pip install -r requirements.txt

	# Download a model
	python -m src.cli download --voice hi_male

	# Start server
	python -m src.cli serve --port 8000
	```

	### 8.2 Docker

	```dockerfile
	FROM python:3.10-slim
	WORKDIR /app
	COPY . .
	RUN pip install -r requirements.txt
	RUN python -m src.cli download --lang hi
	EXPOSE 8000
	CMD ["python", "-m", "src.cli", "serve"]
	```

	---

	## 9. Limitations and Future Work

	### 9.1 Current Limitations

	1. Model Size: Each VITS model is ~300MB
	2. MPS Compatibility: Apple Silicon MPS not fully supported
	3. Real-time Streaming: Limited to sentence-level
	4. Gujarati Gender: MMS has only neutral voice

	### 9.2 Future Improvements

	1. Model Quantization: INT8 for smaller size
	2. Voice Cloning: Reference audio-based synthesis
	3. SSML Support: Markup language for fine control
	4. More Languages: Odia, Assamese, Punjabi
	5. Fine-tuning: Custom voice training on SPICOR data

	---

	## 10. Credits

	### Model Sources

	\| Source \| Models \| License \|
	\| ----------------------- \| --------------------- \| ------------ \|
	\| SYSPIN (IISc Bangalore) \| VITS for 10 languages \| CC BY 4.0 \|
	\| Facebook MMS \| Gujarati VITS \| CC BY-NC 4.0 \|

	### Dataset

	- SPICOR TTS Project: IISc SPIRE Lab, Bangalore
	- Audio Quality: 48kHz, 24-bit, mono

	### Frameworks

	- Coqui TTS, Hugging Face Transformers, FastAPI, librosa

	---

	## 11. Conclusion

	We have developed a comprehensive multi-lingual TTS system that:

	✅ Supports 11 Indian languages with 21 voice variants
	✅ Provides 9 style presets for prosody control
	✅ Offers a REST API with OpenAPI documentation
	✅ Achieves <500ms latency for typical sentences
	✅ Is production-ready with proper error handling

	The system is well-suited for the healthcare assistant use case, providing clear, natural-sounding speech in native languages to help pregnant mothers access healthcare information.

	---

	Repository: https://github.com/harshil748/VoiceAPI
	API Documentation: http://localhost:8000/docs