Vira-TTS
Vietnamese Text-to-Speech model with Voice Cloning capability, finetuned from MiraTTS for Vietnamese language.
Model Description
Vira-TTS is a neural TTS model that can synthesize natural Vietnamese speech from text while cloning the voice characteristics from a reference audio sample.
| Property | Value |
|---|---|
| Base Architecture | Qwen2-0.5B |
| Audio Codec | Fash-BiCodec |
| Sample Rate | 16kHz (native), 48kHz (with FlashSR) |
| Language | Vietnamese |
Features
- Zero-shot Voice Cloning: Clone any voice from a short reference audio (3-12 seconds recommended)
- Vietnamese Optimized: Finetuned specifically for Vietnamese pronunciation and prosody
- Text Normalization: Automatic conversion of numbers and abbreviations to spoken form
- High Quality Output: 48kHz audio with FlashSR upsampling
Usage
Installation
pip install git+https://github.com/iamdinhthuan/Vira-tts.git
Quick Start
from mira.model import MiraTTS
# Model will be downloaded automatically
mira_tts = MiraTTS('model_pretrained')
# Provide reference audio for voice cloning
reference_audio = "speaker.wav"
text = "Xin chào, đây là giọng nói tiếng Việt."
context_tokens = mira_tts.encode_audio(reference_audio)
audio = mira_tts.generate(text, context_tokens)
# Save output
import soundfile as sf
sf.write("output.wav", audio.float().cpu().numpy(), 48000)
Web UI
python app.py
CLI
python infer.py --text "Xin chào" --reference speaker.wav --output output.wav
Text Normalization
The model includes automatic Vietnamese text normalization:
| Input | Output |
|---|---|
Năm 2024 |
Năm hai nghìn không trăm hai mươi tư |
100.000 VNĐ |
một trăm nghìn đồng |
TP.HCM |
thành phố Hồ Chí Minh |
Requirements
- Python >= 3.10
- CUDA compatible GPU (recommended: 6GB+ VRAM)
- Dependencies: lmdeploy, fastaudiosr, ncodec, gradio
Limitations
- Best performance with reference audio 3-10 seconds long
- Reference audio should be clear without background noise
- Some rare Vietnamese words may not be pronounced correctly
Citation
@misc{vira-tts,
author = {Dinh Thuan},
title = {Vira-TTS: Vietnamese Text-to-Speech with Voice Cloning},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/dolly-vn/Vira-TTS}
}
Acknowledgements
- MiraTTS - Original model architecture
- Spark-TTS - Base TTS model
- FlashSR - Audio super-resolution
- LMDeploy - Inference optimization
- soe-vinorm - Vietnamese text normalization
License
MIT
Contact
- GitHub: @iamdinhthuan
- Repository: Vira-TTS
- Downloads last month
- 124