Voice Tech for All: Technical Report
Multi-lingual Text-to-Speech System with Style Transfer
Hackathon: Voice Tech for All
Date: December 2025
Executive Summary
We present a multi-lingual Text-to-Speech (TTS) system supporting 11 Indian languages with style/prosody control capabilities. The system is designed for deployment as a healthcare assistant for pregnant mothers in low-income communities, making health information accessible in native languages.
Key Achievements
| Metric | Value |
|---|---|
| Languages Supported | 11 (Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English, Gujarati) |
| Voice Variants | 21 (male + female for each language) |
| Style Presets | 9 (default, slow, fast, soft, loud, happy, sad, calm, excited) |
| Average Inference Time | ~0.3s (CPU, Apple M2) |
| Model Size | ~300MB per voice (VITS), ~145MB (MMS) |
| API Latency | <500ms for typical sentences |
1. System Architecture
1.1 Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ REST API Server (FastAPI) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ/synthesizeโ โ /voices โ โ /styles โโ
โ โ /stream โ โ /languages โ โ /health โโ
โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ TTS Engine โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Text Normalizer โโ โ Tokenizer โโ โ VITS/MMS โ โ
โ โ (Indian scripts)โ โ (char-to-ID) โ โ Inference โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Style Processor (Prosody Control) โโ
โ โ โข Pitch Shifting (librosa) โโ
โ โ โข Time Stretching (speed control) โโ
โ โ โข Energy/Volume Modification โโ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Model Repository โ
โ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ SYSPIN VITS Models โ โ Facebook MMS Models โ โ
โ โ (10 languages) โ โ (Gujarati) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1.2 Component Details
Text Normalizer
- Handles Indian script peculiarities
- Converts number notations:
{100}{เคเคเคธเฅ}โเคเคเคธเฅ - Normalizes punctuation across scripts
- Handles code-switching (Hindi in English text)
VITS Models (SYSPIN)
- Architecture: Conditional Variational Autoencoder with Adversarial Learning
- Training Data: 20-30 hours per speaker from IISc Bangalore
- Output: 22050 Hz, 16-bit PCM
- Languages: Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
MMS Model (Facebook)
- Architecture: VITS-based, trained on MMS corpus
- Output: 16000 Hz
- Languages: Gujarati (and 1100+ others available)
- Model Size: 145MB
Style Processor
- Pitch Shifting: Using librosa phase vocoder
- Time Stretching: WSOLA algorithm via librosa
- Energy Control: Soft clipping with tanh for natural sound
2. API Specification
2.1 Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | API info and documentation links |
/health |
GET | System health and loaded models |
/voices |
GET | List all available voices |
/languages |
GET | List supported languages |
/styles |
GET | List style presets |
/synthesize |
POST | Generate speech from text |
/synthesize/get |
GET | Simple synthesis (for testing) |
/synthesize/stream |
POST | Streaming audio response |
/preload |
POST | Preload voice into memory |
/batch |
POST | Batch synthesis |
2.2 Synthesis Request
{
"text": "เชจเชฎเชธเซเชคเซ, เชนเซเช เชคเชฎเชพเชฐเซ เชเซเชตเซ เชฐเซเชคเซ เชฎเชฆเชฆ เชเชฐเซ เชถเชเซเช?",
"voice": "gu_mms",
"speed": 1.0,
"pitch": 1.0,
"energy": 1.0,
"style": "calm",
"normalize": true
}
2.3 Style Presets
| Preset | Speed | Pitch | Energy | Use Case |
|---|---|---|---|---|
| default | 1.0 | 1.0 | 1.0 | Normal speech |
| slow | 0.75 | 1.0 | 1.0 | Elderly users, clarity |
| fast | 1.25 | 1.0 | 1.0 | Quick information |
| soft | 0.9 | 0.95 | 0.7 | Calming content |
| loud | 1.0 | 1.05 | 1.3 | Alerts, emphasis |
| happy | 1.1 | 1.1 | 1.2 | Positive messages |
| sad | 0.85 | 0.9 | 0.8 | Empathetic responses |
| calm | 0.9 | 0.95 | 0.85 | Healthcare guidance |
| excited | 1.2 | 1.15 | 1.3 | Celebrations |
3. Supported Languages
| Language | Code | Voices | Model Type | Sample Rate |
|---|---|---|---|---|
| Hindi | hi | Male, Female | SYSPIN VITS | 22050 Hz |
| Bengali | bn | Male, Female | SYSPIN VITS | 22050 Hz |
| Marathi | mr | Male, Female | SYSPIN VITS | 22050 Hz |
| Telugu | te | Male, Female | SYSPIN VITS | 22050 Hz |
| Kannada | kn | Male, Female | SYSPIN VITS | 22050 Hz |
| Bhojpuri | bho | Male, Female | SYSPIN VITS | 22050 Hz |
| Chhattisgarhi | hne | Male, Female | SYSPIN VITS | 22050 Hz |
| Maithili | mai | Male, Female | SYSPIN VITS | 22050 Hz |
| Magahi | mag | Male, Female | SYSPIN VITS | 22050 Hz |
| English | en | Male, Female | SYSPIN VITS | 22050 Hz |
| Gujarati | gu | Neutral | Facebook MMS | 16000 Hz |
4. Implementation Details
4.1 Technology Stack
| Component | Technology |
|---|---|
| Backend Framework | FastAPI |
| ML Framework | PyTorch |
| TTS Models | VITS (Coqui AI / SYSPIN), MMS (Facebook) |
| Audio Processing | librosa, soundfile, scipy |
| Model Hub | Hugging Face Hub |
| API Documentation | OpenAPI/Swagger |
4.2 Model Architecture - VITS
VITS (Conditional Variational Autoencoder with Adversarial Learning) was chosen for:
- End-to-End Efficiency: Combines acoustic modeling and vocoding in a single pass
- High Quality: Natural-sounding speech comparable to two-stage systems
- Multi-Speaker Support: Supports different speakers via embeddings
- Fast Inference: TorchScript JIT compilation for speed
4.3 Style/Accent Transfer Implementation
Our style transfer uses post-processing approach for simplicity and reliability:
Pitch Shifting: Phase vocoder via librosa
semitones = 12 * np.log2(pitch_factor) shifted = librosa.effects.pitch_shift(audio, sr=sr, n_steps=semitones)Time Stretching: WSOLA algorithm
stretched = librosa.effects.time_stretch(audio, rate=speed_factor)Energy Control: Soft clipping for natural sound
modified = audio * energy_factor if energy_factor > 1.0: modified = np.tanh(modified * 2) * 0.95 # Soft clip
4.4 Key Design Decisions
- TorchScript Models: JIT-compiled for faster inference
- Lazy Loading: Models loaded on-demand to minimize memory
- CPU Fallback: Apple Silicon MPS compatibility issues handled
- Streaming Support: Progressive audio delivery for real-time apps
5. Usage Examples
5.1 Python API
from src.engine import TTSEngine
# Initialize engine
engine = TTSEngine(device="auto")
# Basic synthesis
output = engine.synthesize(
text="เคเคฐเฅเคญเคพเคตเคธเฅเคฅเคพ เคฎเฅเค เคธเฅเคตเคธเฅเคฅ เคเคนเคพเคฐ เคฌเคนเฅเคค เคฎเคนเคคเฅเคตเคชเฅเคฐเฅเคฃ เคนเฅ",
voice="hi_female"
)
# With style control
output = engine.synthesize(
text="เคเคชเคเคพ เคฆเคฟเคจ เคถเฅเคญ เคนเฅ",
voice="hi_male",
style="happy",
pitch=1.1
)
# Gujarati
output = engine.synthesize(
text="เชธเซเชตเชธเซเชฅ เชฐเชนเซ, เชเซเชถ เชฐเชนเซ",
voice="gu_mms",
style="calm"
)
5.2 REST API
# Basic synthesis
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{"text": "เคจเคฎเคธเฅเคคเฅ", "voice": "hi_male"}' \
--output speech.wav
# With style
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{"text": "เคเคชเคเคพ เคธเฅเคตเคพเคเคค เคนเฅ", "voice": "hi_female", "style": "happy"}' \
--output welcome.wav
# Gujarati
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{"text": "เชจเชฎเชธเซเชคเซ", "voice": "gu_mms"}' \
--output gujarati.wav
5.3 Command Line
# Download models
python -m src.cli download --voice hi_male
python -m src.cli download --lang hi # All Hindi voices
# Synthesize
python -m src.cli synthesize --text "เคจเคฎเคธเฅเคคเฅ" --voice hi_male --output hello.wav
# Start server
python -m src.cli serve --port 8000
6. Healthcare Use Case
6.1 Target Application
The TTS system is designed for integration with an LLM-based healthcare assistant for pregnant mothers in low-income communities.
6.2 Key Features for Healthcare
- Multi-lingual Support: Information in native languages
- Calm Style Preset: Reassuring tone for medical guidance
- Slow Speed Option: Clear pronunciation for instructions
- Low Latency: Real-time conversational responses
6.3 Example Healthcare Dialogue
User: "เชเชฐเซเชญเชพเชตเชธเซเชฅเชพเชฎเชพเช เชถเซเช เชเชพเชตเซเช เชเซเชเช?"
System Response (TTS with calm style in Gujarati):
"เชเชฐเซเชญเชพเชตเชธเซเชฅเชพเชฎเชพเช เชคเชฎเชพเชฐเซ เชชเซเชฐเซเชเซเชจ, เชเชฏเชฐเซเชจ เช
เชจเซ เชซเซเชฒเชฟเช เชเชธเชฟเชกเชฅเซ เชญเชฐเชชเซเชฐ
เชเซเชฐเชพเช เชฒเซเชตเซ เชเซเชเช. เชฆเชพเชณ, เชชเชพเชฒเช, เชเชเชกเชพ เช
เชจเซ เชฆเซเชง เชธเชพเชฐเชพ เชตเชฟเชเชฒเซเชชเซ เชเซ."
7. Performance Benchmarks
| Test | Time | Notes |
|---|---|---|
| Hindi synthesis (short) | 0.25s | "เคจเคฎเคธเฅเคคเฅ" |
| Hindi synthesis (long) | 0.45s | 50-word sentence |
| Gujarati MMS | 0.35s | First load includes model download |
| Style processing | +0.1s | Pitch + speed adjustment |
| API round-trip | 0.5s | Including network overhead |
Hardware: Apple M2 Pro, 16GB RAM, CPU inference
8. Deployment
8.1 Quick Start
# Clone repository
git clone https://github.com/harshil748/VoiceAPI
cd VoiceAPI
# Setup environment
python3 -m venv tts
source tts/bin/activate
pip install -r requirements.txt
# Download a model
python -m src.cli download --voice hi_male
# Start server
python -m src.cli serve --port 8000
8.2 Docker
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
RUN python -m src.cli download --lang hi
EXPOSE 8000
CMD ["python", "-m", "src.cli", "serve"]
9. Limitations and Future Work
9.1 Current Limitations
- Model Size: Each VITS model is ~300MB
- MPS Compatibility: Apple Silicon MPS not fully supported
- Real-time Streaming: Limited to sentence-level
- Gujarati Gender: MMS has only neutral voice
9.2 Future Improvements
- Model Quantization: INT8 for smaller size
- Voice Cloning: Reference audio-based synthesis
- SSML Support: Markup language for fine control
- More Languages: Odia, Assamese, Punjabi
- Fine-tuning: Custom voice training on SPICOR data
10. Credits
Model Sources
| Source | Models | License |
|---|---|---|
| SYSPIN (IISc Bangalore) | VITS for 10 languages | CC BY 4.0 |
| Facebook MMS | Gujarati VITS | CC BY-NC 4.0 |
Dataset
- SPICOR TTS Project: IISc SPIRE Lab, Bangalore
- Audio Quality: 48kHz, 24-bit, mono
Frameworks
- Coqui TTS, Hugging Face Transformers, FastAPI, librosa
11. Conclusion
We have developed a comprehensive multi-lingual TTS system that:
โ
Supports 11 Indian languages with 21 voice variants
โ
Provides 9 style presets for prosody control
โ
Offers a REST API with OpenAPI documentation
โ
Achieves <500ms latency for typical sentences
โ
Is production-ready with proper error handling
The system is well-suited for the healthcare assistant use case, providing clear, natural-sounding speech in native languages to help pregnant mothers access healthcare information.
Repository: https://github.com/harshil748/VoiceAPI
API Documentation: http://localhost:8000/docs