π£οΈ Sinhala TTS VITS π±π°
Sinhala Text-to-Speech β A Coqui TTS VITS model that generates natural Sinhala speech from text, with 16 distinct voices to choose from.
π― Model Details
| Attribute | Value |
|---|---|
| Architecture | VITS (Variational Inference Text-to-Speech) |
| Language | π±π° Sinhala (ΰ·ΰ·ΰΆΰ·ΰΆ½) |
| Speakers | 16 voices |
| Sample Rate | 16 kHz |
| Parameters | ~30M |
| Vocab | 97 characters (74 Sinhala Unicode + 19 punctuation + 4 special tokens) |
| Framework | Coqui TTS 0.27.x |
| License | Apache 2.0 |
| Model Format | SafeTensors (.safetensors) |
π£οΈ Available Speakers
| ID | Speaker Name | Description |
|---|---|---|
| 0 | mettananda | Male voice 1 |
| 1 | oshadi | Female voice 1 |
| 2 | pn_sin_01 | Voice 3 |
| 3 | sin_01 | Voice 4 |
| 4 | sin_2241 | Voice 5 |
| 5 | sin_2282 | Voice 6 |
| 6 | sin_3531 | Voice 7 |
| 7 | sin_3688 | Voice 8 |
| 8 | sin_3976 | Voice 9 |
| 9 | sin_4191 | Voice 10 |
| 10 | sin_4499 | Voice 11 |
| 11 | sin_5681 | Voice 12 |
| 12 | sin_6314 | Voice 13 |
| 13 | sin_6897 | Voice 14 |
| 14 | sin_7183 | Voice 15 |
| 15 | sin_9228 | Voice 16 |
π Usage
Option 1: Coqui TTS (Recommended)
import torch
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.models.vits import Vits
from TTS.tts.utils.text import TTSTokenizer
from TTS.tts.utils.speakers import SpeakerManager
from TTS.utils.audio import AudioProcessor
# Load config
config = VitsConfig()
config.load_json("config.json")
# Initialize components
ap = AudioProcessor.init_from_config(config)
tokenizer, new_config = TTSTokenizer.init_from_config(config)
speaker_manager = SpeakerManager()
speaker_manager.load_ids_from_file("speakers.json")
# Create and load model
model = Vits(new_config, ap, tokenizer, speaker_manager)
from safetensors.torch import load_file
state_dict = load_file("sinhala_tts_vits_model.safetensors")
model.load_state_dict(state_dict, strict=False)
model.eval()
# Synthesize
text = "ΰΆΰΆΊΰ·ΰΆΆΰ·ΰ·ΰΆ±ΰ·! ΰΆΰΆΆΰΆ§ ΰΆΰ·ΰ·ΰ·ΰΆΈΰΆ―?"
outputs = model.synthesize(text, config=new_config, speaker="mettananda")
# Save audio
import soundfile as sf
sf.write("output.wav", outputs["wav"], 16000)
Option 2: REST API (with included server.py)
# Start the server
python server.py
# Generate speech
curl -X POST http://localhost:8081/tts \
-H "Content-Type: application/json" \
-d '{
"text": "ΰΆΰΆΊΰ·ΰΆΆΰ·ΰ·ΰΆ±ΰ·!",
"speaker": "mettananda",
"emotion": "neutral"
}' \
--output output.wav
# Health check
curl http://localhost:8081/health
# List speakers
curl http://localhost:8081/speakers
Option 3: HuggingFace Inference API
β οΈ This model uses Coqui TTS (not Transformers) and cannot be used via the standard HF Inference API. Use Coqui TTS directly or the included REST API server.
Option 4: Docker Deployment
docker build -t sinhala-tts-server .
docker run -p 8081:8081 sinhala-tts-server
π οΈ Development Platforms
| Platform | GPU | Cost | Best For |
|---|---|---|---|
| P100/T4 | Free (~30 hrs/week) | Quick experiments | |
| T4/A100 | Free / $10/mo Pro | Training runs | |
| A100 80GB | $20 free credit | Full training | |
| RTX 4090/A100 | $0.34β$2.00/hr | Production |
π¦ Files
| File | Description | Size |
|---|---|---|
sinhala_tts_vits_model.safetensors |
Model weights (SafeTensors) | 316 MB |
config.json |
Model configuration | 8 KB |
speakers.json |
Speaker ID mapping | 300 B |
server.py |
FastAPI REST inference server | 6 KB |
Dockerfile |
Docker build for production | 2 KB |
DEVELOPER_GUIDE.md |
Training & development guide | 15 KB |
π Training & Fine-Tuning
For detailed instructions, see the DEVELOPER_GUIDE.md which covers:
- Setup: Environment configuration and dependency installation
- Training from scratch: Full training pipeline with the Sinhala dataset
- Fine-tuning: Adapting the model to new voices or domains
- Dataset preparation: Preprocessing Sinhala audio data
- Export to SafeTensors: Converting PyTorch checkpoints to SafeTensors format
- Cloud GPU training: Step-by-step guides for Kaggle, Colab, and Modal
π Deployment Options
| Method | Description | Best For |
|---|---|---|
| HuggingFace Space | Gradio web UI (live demo) | Quick testing |
| FastAPI Server | REST API with Docker | Production APIs |
| Local Python | Direct model loading | Development |
| Kubernetes | Docker container in K8s | Scalable deployment |
β οΈ Limitations
- Audio quality: Trained on a limited dataset (~200 samples Γ 16 speakers) β quality may vary
- Inference speed: CPU inference is slower; GPU recommended for production
- Emotion control: Basic emotion prefixes are supported but effects are subtle
- Proper nouns: May struggle with non-Sinhala words or names
- Out-of-vocabulary characters: Limited to the 93-character vocabulary
π License
This model is released under the Apache 2.0 License.
π Maintainer
Death Legion Team β π€ HuggingFace
π§ Try the Live Demo β’ π Developer Guide β’ π Death Legion Team
- Downloads last month
- -