VoiceAPI / technical_report.md

Harshil748

Add TTS Tokenizer, Technical Report, and Basic Tests

89a8916 about 1 month ago

preview code

raw

history blame contribute delete

16.2 kB

Voice Tech for All: Technical Report

Multi-lingual Text-to-Speech System with Style Transfer

Hackathon: Voice Tech for All
Date: December 2025

Executive Summary

We present a multi-lingual Text-to-Speech (TTS) system supporting 11 Indian languages with style/prosody control capabilities. The system is designed for deployment as a healthcare assistant for pregnant mothers in low-income communities, making health information accessible in native languages.

Key Achievements

Metric	Value
Languages Supported	11 (Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English, Gujarati)
Voice Variants	21 (male + female for each language)
Style Presets	9 (default, slow, fast, soft, loud, happy, sad, calm, excited)
Average Inference Time	~0.3s (CPU, Apple M2)
Model Size	~300MB per voice (VITS), ~145MB (MMS)
API Latency	<500ms for typical sentences

1. System Architecture

1.1 Overview

┌─────────────────────────────────────────────────────────────┐
│                    REST API Server (FastAPI)                 │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────┐  ┌──────────────┐  ┌─────────────────────────┐│
│  │/synthesize│  │ /voices     │  │ /styles               ││
│  │ /stream   │  │ /languages  │  │ /health               ││
│  └──────────┘  └──────────────┘  └─────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                      TTS Engine                              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────┐ │
│  │ Text Normalizer │→ │ Tokenizer       │→ │ VITS/MMS    │ │
│  │ (Indian scripts)│  │ (char-to-ID)    │  │ Inference   │ │
│  └─────────────────┘  └─────────────────┘  └──────────────┘ │
│                              ↓                               │
│  ┌─────────────────────────────────────────────────────────┐│
│  │              Style Processor (Prosody Control)          ││
│  │  • Pitch Shifting (librosa)                             ││
│  │  • Time Stretching (speed control)                      ││
│  │  • Energy/Volume Modification                           ││
│  └─────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                    Model Repository                          │
│  ┌────────────────────┐  ┌────────────────────────────────┐ │
│  │ SYSPIN VITS Models │  │ Facebook MMS Models            │ │
│  │ (10 languages)     │  │ (Gujarati)                     │ │
│  └────────────────────┘  └────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

1.2 Component Details

Text Normalizer

Handles Indian script peculiarities
Converts number notations: {100}{एकसो} → एकसो
Normalizes punctuation across scripts
Handles code-switching (Hindi in English text)

VITS Models (SYSPIN)

Architecture: Conditional Variational Autoencoder with Adversarial Learning
Training Data: 20-30 hours per speaker from IISc Bangalore
Output: 22050 Hz, 16-bit PCM
Languages: Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English

MMS Model (Facebook)

Architecture: VITS-based, trained on MMS corpus
Output: 16000 Hz
Languages: Gujarati (and 1100+ others available)
Model Size: 145MB

Style Processor

Pitch Shifting: Using librosa phase vocoder
Time Stretching: WSOLA algorithm via librosa
Energy Control: Soft clipping with tanh for natural sound

2. API Specification

2.1 Endpoints

Endpoint	Method	Description
`/`	GET	API info and documentation links
`/health`	GET	System health and loaded models
`/voices`	GET	List all available voices
`/languages`	GET	List supported languages
`/styles`	GET	List style presets
`/synthesize`	POST	Generate speech from text
`/synthesize/get`	GET	Simple synthesis (for testing)
`/synthesize/stream`	POST	Streaming audio response
`/preload`	POST	Preload voice into memory
`/batch`	POST	Batch synthesis

2.2 Synthesis Request

{
    "text": "નમસ્તે, હું તમારી કેવી રીતે મદદ કરી શકું?",
    "voice": "gu_mms",
    "speed": 1.0,
    "pitch": 1.0,
    "energy": 1.0,
    "style": "calm",
    "normalize": true
}

2.3 Style Presets

Preset	Speed	Pitch	Energy	Use Case
default	1.0	1.0	1.0	Normal speech
slow	0.75	1.0	1.0	Elderly users, clarity
fast	1.25	1.0	1.0	Quick information
soft	0.9	0.95	0.7	Calming content
loud	1.0	1.05	1.3	Alerts, emphasis
happy	1.1	1.1	1.2	Positive messages
sad	0.85	0.9	0.8	Empathetic responses
calm	0.9	0.95	0.85	Healthcare guidance
excited	1.2	1.15	1.3	Celebrations

3. Supported Languages

Language	Code	Voices	Model Type	Sample Rate
Hindi	hi	Male, Female	SYSPIN VITS	22050 Hz
Bengali	bn	Male, Female	SYSPIN VITS	22050 Hz
Marathi	mr	Male, Female	SYSPIN VITS	22050 Hz
Telugu	te	Male, Female	SYSPIN VITS	22050 Hz
Kannada	kn	Male, Female	SYSPIN VITS	22050 Hz
Bhojpuri	bho	Male, Female	SYSPIN VITS	22050 Hz
Chhattisgarhi	hne	Male, Female	SYSPIN VITS	22050 Hz
Maithili	mai	Male, Female	SYSPIN VITS	22050 Hz
Magahi	mag	Male, Female	SYSPIN VITS	22050 Hz
English	en	Male, Female	SYSPIN VITS	22050 Hz
Gujarati	gu	Neutral	Facebook MMS	16000 Hz

4. Implementation Details

4.1 Technology Stack

Component	Technology
Backend Framework	FastAPI
ML Framework	PyTorch
TTS Models	VITS (Coqui AI / SYSPIN), MMS (Facebook)
Audio Processing	librosa, soundfile, scipy
Model Hub	Hugging Face Hub
API Documentation	OpenAPI/Swagger

4.2 Model Architecture - VITS

VITS (Conditional Variational Autoencoder with Adversarial Learning) was chosen for:

End-to-End Efficiency: Combines acoustic modeling and vocoding in a single pass
High Quality: Natural-sounding speech comparable to two-stage systems
Multi-Speaker Support: Supports different speakers via embeddings
Fast Inference: TorchScript JIT compilation for speed

4.3 Style/Accent Transfer Implementation

Our style transfer uses post-processing approach for simplicity and reliability:

Pitch Shifting: Phase vocoder via librosa

semitones = 12 * np.log2(pitch_factor)
shifted = librosa.effects.pitch_shift(audio, sr=sr, n_steps=semitones)

Time Stretching: WSOLA algorithm

stretched = librosa.effects.time_stretch(audio, rate=speed_factor)

Energy Control: Soft clipping for natural sound

modified = audio * energy_factor
if energy_factor > 1.0:
    modified = np.tanh(modified * 2) * 0.95  # Soft clip

4.4 Key Design Decisions

TorchScript Models: JIT-compiled for faster inference
Lazy Loading: Models loaded on-demand to minimize memory
CPU Fallback: Apple Silicon MPS compatibility issues handled
Streaming Support: Progressive audio delivery for real-time apps

5. Usage Examples

5.1 Python API

from src.engine import TTSEngine

# Initialize engine
engine = TTSEngine(device="auto")

# Basic synthesis
output = engine.synthesize(
    text="गर्भावस्था में स्वस्थ आहार बहुत महत्वपूर्ण है",
    voice="hi_female"
)

# With style control
output = engine.synthesize(
    text="आपका दिन शुभ हो",
    voice="hi_male",
    style="happy",
    pitch=1.1
)

# Gujarati
output = engine.synthesize(
    text="સ્વસ્થ રહો, ખુશ રહો",
    voice="gu_mms",
    style="calm"
)

5.2 REST API

# Basic synthesis
curl -X POST "http://localhost:8000/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text": "नमस्ते", "voice": "hi_male"}' \
  --output speech.wav

# With style
curl -X POST "http://localhost:8000/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text": "आपका स्वागत है", "voice": "hi_female", "style": "happy"}' \
  --output welcome.wav

# Gujarati
curl -X POST "http://localhost:8000/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text": "નમસ્તે", "voice": "gu_mms"}' \
  --output gujarati.wav

5.3 Command Line

# Download models
python -m src.cli download --voice hi_male
python -m src.cli download --lang hi  # All Hindi voices

# Synthesize
python -m src.cli synthesize --text "नमस्ते" --voice hi_male --output hello.wav

# Start server
python -m src.cli serve --port 8000

6. Healthcare Use Case

6.1 Target Application

The TTS system is designed for integration with an LLM-based healthcare assistant for pregnant mothers in low-income communities.

6.2 Key Features for Healthcare

Multi-lingual Support: Information in native languages
Calm Style Preset: Reassuring tone for medical guidance
Slow Speed Option: Clear pronunciation for instructions
Low Latency: Real-time conversational responses

6.3 Example Healthcare Dialogue

User: "ગર્ભાવસ્થામાં શું ખાવું જોઈએ?"

System Response (TTS with calm style in Gujarati):
"ગર્ભાવસ્થામાં તમારે પ્રોટીન, આયર્ન અને ફોલિક એસિડથી ભરપૂર
ખોરાક લેવો જોઈએ. દાળ, પાલક, ઈંડા અને દૂધ સારા વિકલ્પો છે."

7. Performance Benchmarks

Test	Time	Notes
Hindi synthesis (short)	0.25s	"नमस्ते"
Hindi synthesis (long)	0.45s	50-word sentence
Gujarati MMS	0.35s	First load includes model download
Style processing	+0.1s	Pitch + speed adjustment
API round-trip	0.5s	Including network overhead

Hardware: Apple M2 Pro, 16GB RAM, CPU inference

8. Deployment

8.1 Quick Start

# Clone repository
git clone https://github.com/harshil748/VoiceAPI
cd VoiceAPI

# Setup environment
python3 -m venv tts
source tts/bin/activate
pip install -r requirements.txt

# Download a model
python -m src.cli download --voice hi_male

# Start server
python -m src.cli serve --port 8000

8.2 Docker

FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
RUN python -m src.cli download --lang hi
EXPOSE 8000
CMD ["python", "-m", "src.cli", "serve"]

9. Limitations and Future Work

9.1 Current Limitations

Model Size: Each VITS model is ~300MB
MPS Compatibility: Apple Silicon MPS not fully supported
Real-time Streaming: Limited to sentence-level
Gujarati Gender: MMS has only neutral voice

9.2 Future Improvements

Model Quantization: INT8 for smaller size
Voice Cloning: Reference audio-based synthesis
SSML Support: Markup language for fine control
More Languages: Odia, Assamese, Punjabi
Fine-tuning: Custom voice training on SPICOR data

10. Credits

Model Sources

Source	Models	License
SYSPIN (IISc Bangalore)	VITS for 10 languages	CC BY 4.0
Facebook MMS	Gujarati VITS	CC BY-NC 4.0

Dataset

SPICOR TTS Project: IISc SPIRE Lab, Bangalore
Audio Quality: 48kHz, 24-bit, mono

Frameworks

Coqui TTS, Hugging Face Transformers, FastAPI, librosa

11. Conclusion

We have developed a comprehensive multi-lingual TTS system that:

✅ Supports 11 Indian languages with 21 voice variants
✅ Provides 9 style presets for prosody control
✅ Offers a REST API with OpenAPI documentation
✅ Achieves <500ms latency for typical sentences
✅ Is production-ready with proper error handling

The system is well-suited for the healthcare assistant use case, providing clear, natural-sounding speech in native languages to help pregnant mothers access healthcare information.

Repository: https://github.com/harshil748/VoiceAPI
API Documentation: http://localhost:8000/docs