Configuration Parsing Warning:Invalid JSON for config file config.json

🛑 Important Note ⚠️

This Text-to-Speech (TTS) model is provided solely for research, experimentation, and technology development purposes. Any audio content generated by the model does not represent the voice, identity, opinions, or endorsement of any real individual or organization. The authors and related parties assume no responsibility for any misuse, unlawful activities, violations of privacy, personality rights, intellectual property rights, or any direct or indirect damages arising from the use of this model.

Users bear full responsibility and legal liability for the deployment, distribution, and use of the model. The use of this model for impersonation, voice cloning of individuals without lawful consent, creating misleading content, fraud, manipulation of public opinion, or any purpose that violates applicable laws is strictly prohibited. When using or sharing generated audio, it is strongly recommended to clearly disclose that the content is AI-generated and to comply fully with all applicable legal regulations, platform policies, and ethical standards.

🎙️ F5-TTS-Vietnamese-1000h

A compact fine-tuned version of F5-TTS trained on 1000 hours of Vietnamese speech.

🔗 For more fine-tuning experiments, visit: https://github.com/nguyenthienhy/F5-TTS-Vietnamese.

📜 License: CC-BY-NC-SA-4.0 — Non-commercial research use only.


📌 Model Details

  • Dataset: Vi-Voice, VLSP 2021, VLSP 2022, VLSP 2023
  • Total dataset durations: 1000 hours
  • Data processing Technique:
    • Remove all music background from audios, using facebook demucs model: https://github.com/facebookresearch/demucs
    • Do not use audio files shorter than 1 second or longer than 30 seconds.
    • Using Chunk-Large-Former Speech2Text model by Zalo-AI to filter audio which has bad transcript
    • Keep the default punctuation marks unchanged.
    • Normalize to lowercase format.
  • Training Configuration:
    • Base Model: F5-TTS_Base
    • GPU: RTX 3090
    • Batch Size: 3200 frames - 1.5 months for training

📝 Usage

To load and use the model, follow the example below:

git clone https://github.com/nguyenthienhy/F5-TTS-Vietnamese
cd F5-TTS-Vietnamese
python -m pip install -e.
mv F5-TTS-Vietnamese-ViVoice/config.json F5-TTS-Vietnamese-ViVoice/vocab.txt
f5-tts_infer-cli \
--model "F5TTS_Base" \
--ref_audio ref.wav \
--ref_text "cả hai bên hãy cố gắng hiểu cho nhau" \
--gen_text "mình muốn ra nước ngoài để tiếp xúc nhiều công ty lớn, sau đó mang những gì học được về việt nam giúp xây dựng các công trình tốt hơn" \
--speed 1.0 \
--vocoder_name vocos \
--vocab_file F5-TTS-Vietnamese-ViVoice/vocab.txt \
--ckpt_file F5-TTS-Vietnamese-ViVoice/model_last.pt \

---
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support