Vira-TTS

Vietnamese Text-to-Speech model with Voice Cloning capability, finetuned from MiraTTS for Vietnamese language.

Model Description

Vira-TTS is a neural TTS model that can synthesize natural Vietnamese speech from text while cloning the voice characteristics from a reference audio sample.

Property	Value
Base Architecture	Qwen2-0.5B
Audio Codec	Fash-BiCodec
Sample Rate	16kHz (native), 48kHz (with FlashSR)
Language	Vietnamese

Features

Zero-shot Voice Cloning: Clone any voice from a short reference audio (3-12 seconds recommended)
Vietnamese Optimized: Finetuned specifically for Vietnamese pronunciation and prosody
Text Normalization: Automatic conversion of numbers and abbreviations to spoken form
High Quality Output: 48kHz audio with FlashSR upsampling

Usage

Installation

pip install git+https://github.com/iamdinhthuan/Vira-tts.git

Quick Start

from mira.model import MiraTTS

# Model will be downloaded automatically
mira_tts = MiraTTS('model_pretrained')

# Provide reference audio for voice cloning
reference_audio = "speaker.wav"
text = "Xin chào, đây là giọng nói tiếng Việt."

context_tokens = mira_tts.encode_audio(reference_audio)
audio = mira_tts.generate(text, context_tokens)

# Save output
import soundfile as sf
sf.write("output.wav", audio.float().cpu().numpy(), 48000)

Web UI

python app.py

CLI

python infer.py --text "Xin chào" --reference speaker.wav --output output.wav

Text Normalization

The model includes automatic Vietnamese text normalization:

Input	Output
`Năm 2024`	`Năm hai nghìn không trăm hai mươi tư`
`100.000 VNĐ`	`một trăm nghìn đồng`
`TP.HCM`	`thành phố Hồ Chí Minh`

Requirements

Python >= 3.10
CUDA compatible GPU (recommended: 6GB+ VRAM)
Dependencies: lmdeploy, fastaudiosr, ncodec, gradio

Limitations

Best performance with reference audio 3-10 seconds long
Reference audio should be clear without background noise
Some rare Vietnamese words may not be pronounced correctly

Citation

@misc{vira-tts,
  author = {Dinh Thuan},
  title = {Vira-TTS: Vietnamese Text-to-Speech with Voice Cloning},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/dolly-vn/Vira-TTS}
}

Acknowledgements

MiraTTS - Original model architecture
Spark-TTS - Base TTS model
FlashSR - Audio super-resolution
LMDeploy - Inference optimization
soe-vinorm - Vietnamese text normalization

License

MIT

Contact

GitHub: @iamdinhthuan
Repository: Vira-TTS

Downloads last month: 12

Model tree for dolly-vn/Vira-TTS

Base model

Qwen/Qwen2-0.5B

Finetuned

(144)

this model

Quantizations

1 model

dolly-vn
/

Vira-TTS