ticchien
/

F5-TTS-Vietnamese-Test

+---
+language:
+- vi
+license: cc-by-nc-4.0
+tags:
+- text-to-speech
+- vietnamese
+- f5-tts
+pipeline_tag: text-to-speech
+---
+# F5-TTS Vietnamese Model
+Vietnamese Text-to-Speech model based on F5-TTS architecture.
+## Model Details
+- **Base Model**: F5-TTS
+- **Language**: Vietnamese
+- **Training Steps**: 71,000
+- **Sample Rate**: 24kHz
+- **Mel Channels**: 100
+## Usage
+```python
+import torch
+import soundfile as sf
+from f5_tts.model import CFM, DiT
+from f5_tts.model.utils import get_tokenizer, convert_char_to_pinyin
+from vocos import Vocos
+# Load model
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Download checkpoint
+from huggingface_hub import hf_hub_download
+checkpoint_path = hf_hub_download(repo_id="YOUR_USERNAME/F5-TTS-Vietnamese", filename="model_71000.pt")
+vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/F5-TTS-Vietnamese", filename="vocab.txt")
+# Load vocab
+vocab_char_map, vocab_size = get_tokenizer(vocab_path, tokenizer="custom")
+# Initialize model
+model = CFM(
+    transformer=DiT(
+        dim=1024,
+        depth=22,
+        heads=16,
+        ff_mult=2,
+        text_dim=512,
+        conv_layers=4,
+        text_num_embeds=vocab_size,
+        mel_dim=100
+    ),
+    mel_spec_kwargs=dict(
+        n_fft=1024,
+        hop_length=256,
+        win_length=1024,
+        n_mel_channels=100,
+        target_sample_rate=24000,
+        mel_spec_type="vocos",
+    ),
+    odeint_kwargs=dict(method="euler"),
+    vocab_char_map=vocab_char_map,
+).to(device)
+# Load checkpoint
+checkpoint = torch.load(checkpoint_path, map_location=device)
+state_dict = {k.replace("ema_model.", ""): v for k, v in checkpoint["ema_model_state_dict"].items() if k not in ["initted", "step"]}
+model.load_state_dict(state_dict)
+model.eval()
+# Load vocoder
+vocoder = Vocos.from_pretrained("charactr/vocos-mel-24khz").to(device)
+# Inference
+ref_audio = "reference.wav"  # Your reference audio
+ref_text = "Đây là văn bản tham chiếu"
+gen_text = "Đây là văn bản cần tạo giọng nói"
+# ... (see full example in repository)
+```
+## Training Details
+- Dataset: Vietnamese speech dataset
+- Optimizer: AdamW
+- Scheduler: Linear warmup + decay
+- Batch size: Dynamic (frame-based)
+## Limitations
+- Best quality with reference audio 3-15 seconds
+- Vietnamese language only
+- Requires good quality reference audio
+## Citation
+```bibtex
+@article{chen2024f5tts,
+  title={F5-TTS: A Fairerr, Faster, and Fully Non-Autoregressive Text-to-Speech System},
+  author={Chen, Yushen and others},
+  journal={arXiv preprint},
+  year={2024}
+}
+```
+## License
+CC-BY-NC-4.0