--- license: apache-2.0 datasets: - pnnbao-ump/VieNeu-TTS-140h - pnnbao-ump/VieNeuCodec-dataset language: - vi base_model: - neuphonic/neutts-air pipeline_tag: text-to-speech --- # VieNeu-TTS [![GitHub](https://img.shields.io/badge/GitHub-Repository-blue)](https://github.com/pnnbao97/VieNeu-TTS) [![Model](https://img.shields.io/badge/Hugging%20Face-Model-yellow)](https://huggingface.co/pnnbao-ump/VieNeu-TTS) ![Untitled](https://cdn-uploads.huggingface.co/production/uploads/68b923a86c86c127a1975eda/vd7kW8h7ooSafcIhEQtyr.png) > 📢 **Upcoming Release** > > **VieNeu-TTS-1000h** is currently in training, using ~1000 hours of high-quality Vietnamese speech **combined with English speech data**. > This next version will support **bilingual voice synthesis (Vietnamese + English)** with consistent speaker identity. > > Expected improvements: > - More accurate and stable Vietnamese pronunciation > - Improved English pronunciation and code-switching > - Higher voice cloning fidelity and speaker consistency > > A **GGUF version** is also planned for the earliest possible release. > > **Current release:** VieNeu-TTS-140h (stable & production-ready) ## Overview **VieNeu-TTS** is an on-device Vietnamese Text-to-Speech (TTS) model with **instant voice cloning**. It is fine-tuned from **NeuTTS Air** and synthesizes natural **24 kHz speech** in real time on CPU or GPU. ## Support This Project Training high-quality TTS models requires significant GPU resources and compute time. If you find this model useful, please consider supporting the development: [![Buy Me a Coffee](https://img.shields.io/badge/Buy%20Me%20a%20Coffee-Support-orange?logo=buy-me-a-coffee)](https://buymeacoffee.com/pnnbao) Your support helps maintain and improve VieNeu-TTS! 🙏 --- ## Voice Cloning Inference **Reference Voice (Speaker Example):** **Input Text:** > Trên bầu trời xanh thẳm, những đám mây trắng lửng lờ trôi như những chiếc thuyền nhỏ đang lướt nhẹ theo dòng gió. Dưới mặt đất, cánh đồng lúa vàng rực trải dài tới tận chân trời, những bông lúa nghiêng mình theo từng làn gió. **Generated Output (Cloned Voice):** ## Long Text Inference VieNeu-TTS supports long-form text synthesis (multiple sentences, paragraphs, or entire articles). For efficient sentence splitting, text normalization, and streaming playback, please refer to the example script in the repository: 🔗 https://github.com/pnnbao97/VieNeu-TTS Example file: `examples/infer_long_text.py` **Long-form speech output example:** --- ## Model Architecture | Component | Description | |----------|-------------| | Backbone | Qwen 0.5B (chat-format LM) | | Codec | NeuCodec (supports ONNX + quantization) | | Output | 24 kHz waveform synthesis | | Context Window | 2048 tokens shared text + speech | | Watermark | Enabled | | Training Data | VieNeuCodec-dataset + Emilia dataset pretraining | ## Features - High-quality Vietnamese speech - Instant **voice cloning** (3–5 second reference audio) - Fully **offline** - Runs real-time or faster - Multi-voice reference support - Python API + CLI + Gradio ## Quick Usage (Python) ```python from pathlib import Path from vieneu_tts import VieNeuTTS from utils.normalize_text import VietnameseTTSNormalizer import soundfile as sf ref_audio = "sample/id_0001.wav" ref_text = Path("sample/id_0001.txt").read_text(encoding="utf-8") normalizer = VietnameseTTSNormalizer() ref_text_norm = normalizer.normalize(ref_text) tts = VieNeuTTS( backbone_repo="pnnbao-ump/VieNeu-TTS", backbone_device="cuda", codec_repo="neuphonic/neucodec", codec_device="cuda" ) ref_codes = tts.encode_reference(ref_audio) text = "Công nghệ giọng nói đang phát triển rất nhanh." text_norm = normalizer.normalize(text) wav = tts.infer(text_norm, ref_codes, ref_text_norm) sf.write("output.wav", wav, 24000) ``` ## Gradio Demo ```bash python gradio_app.py ``` Open your browser at `http://127.0.0.1:7860`. ## Reference Voices | File | Gender | Accent | |------|--------|--------| | id_0001 | Male | South | | id_0002 | Female | South | | id_0003 | Male | South | | id_0004 | Female | South | | id_0005 | Male | South | | id_0007 | Male | South | Odd numbers = Male Even numbers = Female ## Best Practices - Keep input ≤ 250 characters per call - Normalize both text and reference transcript - Use clean reference audio (~3–5s) - For long text, use chunked inference ## Troubleshooting | Issue | Cause | Solution | |------|-------|----------| | Missing `libespeak` | System dependency | Install eSpeak NG | | GPU OOM | VRAM too small | Use CPU or quantized model | | Poor voice match | Bad reference sample | Try a clearer reference clip | ## License Apache 2.0 ## Citation ```bibtex @misc{vieneutts2025, title = {VieNeu-TTS: Vietnamese Text-to-Speech with Instant Voice Cloning}, author = {Pham Nguyen Ngoc Bao}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS}} } ``` Please also cite the base model: ```bibtex @misc{neuttsair2025, title = {NeuTTS Air: On-Device Speech Language Model with Instant Voice Cloning}, author = {Neuphonic}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/neuphonic/neutts-air}} } ```