🗣️ Sinhala TTS VITS 🇱🇰

Sinhala Text-to-Speech — A Coqui TTS VITS model that generates natural Sinhala speech from text, with 16 distinct voices to choose from.

🎯 Model Details

Attribute	Value
Architecture	VITS (Variational Inference Text-to-Speech)
Language	🇱🇰 Sinhala (සිංහල)
Speakers	16 voices
Sample Rate	16 kHz
Parameters	~30M
Vocab	97 characters (74 Sinhala Unicode + 19 punctuation + 4 special tokens)
Framework	Coqui TTS 0.27.x
License	Apache 2.0
Model Format	SafeTensors (.safetensors)

🗣️ Available Speakers

ID	Speaker Name	Description
0	mettananda	Male voice 1
1	oshadi	Female voice 1
2	pn_sin_01	Voice 3
3	sin_01	Voice 4
4	sin_2241	Voice 5
5	sin_2282	Voice 6
6	sin_3531	Voice 7
7	sin_3688	Voice 8
8	sin_3976	Voice 9
9	sin_4191	Voice 10
10	sin_4499	Voice 11
11	sin_5681	Voice 12
12	sin_6314	Voice 13
13	sin_6897	Voice 14
14	sin_7183	Voice 15
15	sin_9228	Voice 16

🚀 Usage

Option 1: Coqui TTS (Recommended)

import torch
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.models.vits import Vits
from TTS.tts.utils.text import TTSTokenizer
from TTS.tts.utils.speakers import SpeakerManager
from TTS.utils.audio import AudioProcessor

# Load config
config = VitsConfig()
config.load_json("config.json")

# Initialize components
ap = AudioProcessor.init_from_config(config)
tokenizer, new_config = TTSTokenizer.init_from_config(config)
speaker_manager = SpeakerManager()
speaker_manager.load_ids_from_file("speakers.json")

# Create and load model
model = Vits(new_config, ap, tokenizer, speaker_manager)
from safetensors.torch import load_file
state_dict = load_file("sinhala_tts_vits_model.safetensors")
model.load_state_dict(state_dict, strict=False)
model.eval()

# Synthesize
text = "ආයුබෝවන්! ඔබට කොහොමද?"
outputs = model.synthesize(text, config=new_config, speaker="mettananda")

# Save audio
import soundfile as sf
sf.write("output.wav", outputs["wav"], 16000)

Option 2: REST API (with included server.py)

# Start the server
python server.py

# Generate speech
curl -X POST http://localhost:8081/tts \
  -H "Content-Type: application/json" \
  -d '{
    "text": "ආයුබෝවන්!",
    "speaker": "mettananda",
    "emotion": "neutral"
  }' \
  --output output.wav

# Health check
curl http://localhost:8081/health

# List speakers
curl http://localhost:8081/speakers

Option 3: HuggingFace Inference API

⚠️ This model uses Coqui TTS (not Transformers) and cannot be used via the standard HF Inference API. Use Coqui TTS directly or the included REST API server.

Option 4: Docker Deployment

docker build -t sinhala-tts-server .
docker run -p 8081:8081 sinhala-tts-server

🛠️ Development Platforms

GPU	Cost	Best For
P100/T4	Free (~30 hrs/week)	Quick experiments
T4/A100	Free / $10/mo Pro	Training runs
A100 80GB	$20 free credit	Full training
RTX 4090/A100	$0.34–$2.00/hr	Production

📦 Files

File	Description	Size
`sinhala_tts_vits_model.safetensors`	Model weights (SafeTensors)	316 MB
`config.json`	Model configuration	8 KB
`speakers.json`	Speaker ID mapping	300 B
`server.py`	FastAPI REST inference server	6 KB
`Dockerfile`	Docker build for production	2 KB
`DEVELOPER_GUIDE.md`	Training & development guide	15 KB

🎓 Training & Fine-Tuning

For detailed instructions, see the DEVELOPER_GUIDE.md which covers:

Setup: Environment configuration and dependency installation
Training from scratch: Full training pipeline with the Sinhala dataset
Fine-tuning: Adapting the model to new voices or domains
Dataset preparation: Preprocessing Sinhala audio data
Export to SafeTensors: Converting PyTorch checkpoints to SafeTensors format
Cloud GPU training: Step-by-step guides for Kaggle, Colab, and Modal

🌐 Deployment Options

Method	Description	Best For
HuggingFace Space	Gradio web UI (live demo)	Quick testing
FastAPI Server	REST API with Docker	Production APIs
Local Python	Direct model loading	Development
Kubernetes	Docker container in K8s	Scalable deployment

⚠️ Limitations

Audio quality: Trained on a limited dataset (~200 samples × 16 speakers) — quality may vary
Inference speed: CPU inference is slower; GPU recommended for production
Emotion control: Basic emotion prefixes are supported but effects are subtle
Proper nouns: May struggle with non-Sinhala words or names
Out-of-vocabulary characters: Limited to the 93-character vocabulary

📝 License

This model is released under the Apache 2.0 License.

🙏 Maintainer

Death Legion Team — 🤗 HuggingFace

🎧 Try the Live Demo • 📖 Developer Guide • 🏠 Death Legion Team

Downloads last month: 26

deathlegionteam
/

sinhala-tts-vits