ticchien commited on
Commit
50f2999
·
verified ·
1 Parent(s): aeb74ef

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +111 -0
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - vi
4
+ license: cc-by-nc-4.0
5
+ tags:
6
+ - text-to-speech
7
+ - vietnamese
8
+ - f5-tts
9
+ pipeline_tag: text-to-speech
10
+ ---
11
+
12
+ # F5-TTS Vietnamese Model
13
+
14
+ Vietnamese Text-to-Speech model based on F5-TTS architecture.
15
+
16
+ ## Model Details
17
+
18
+ - **Base Model**: F5-TTS
19
+ - **Language**: Vietnamese
20
+ - **Training Steps**: 71,000
21
+ - **Sample Rate**: 24kHz
22
+ - **Mel Channels**: 100
23
+
24
+ ## Usage
25
+
26
+ ```python
27
+ import torch
28
+ import soundfile as sf
29
+ from f5_tts.model import CFM, DiT
30
+ from f5_tts.model.utils import get_tokenizer, convert_char_to_pinyin
31
+ from vocos import Vocos
32
+
33
+ # Load model
34
+ device = "cuda" if torch.cuda.is_available() else "cpu"
35
+
36
+ # Download checkpoint
37
+ from huggingface_hub import hf_hub_download
38
+ checkpoint_path = hf_hub_download(repo_id="YOUR_USERNAME/F5-TTS-Vietnamese", filename="model_71000.pt")
39
+ vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/F5-TTS-Vietnamese", filename="vocab.txt")
40
+
41
+ # Load vocab
42
+ vocab_char_map, vocab_size = get_tokenizer(vocab_path, tokenizer="custom")
43
+
44
+ # Initialize model
45
+ model = CFM(
46
+ transformer=DiT(
47
+ dim=1024,
48
+ depth=22,
49
+ heads=16,
50
+ ff_mult=2,
51
+ text_dim=512,
52
+ conv_layers=4,
53
+ text_num_embeds=vocab_size,
54
+ mel_dim=100
55
+ ),
56
+ mel_spec_kwargs=dict(
57
+ n_fft=1024,
58
+ hop_length=256,
59
+ win_length=1024,
60
+ n_mel_channels=100,
61
+ target_sample_rate=24000,
62
+ mel_spec_type="vocos",
63
+ ),
64
+ odeint_kwargs=dict(method="euler"),
65
+ vocab_char_map=vocab_char_map,
66
+ ).to(device)
67
+
68
+ # Load checkpoint
69
+ checkpoint = torch.load(checkpoint_path, map_location=device)
70
+ state_dict = {k.replace("ema_model.", ""): v for k, v in checkpoint["ema_model_state_dict"].items() if k not in ["initted", "step"]}
71
+ model.load_state_dict(state_dict)
72
+ model.eval()
73
+
74
+ # Load vocoder
75
+ vocoder = Vocos.from_pretrained("charactr/vocos-mel-24khz").to(device)
76
+
77
+ # Inference
78
+ ref_audio = "reference.wav" # Your reference audio
79
+ ref_text = "Đây là văn bản tham chiếu"
80
+ gen_text = "Đây là văn bản cần tạo giọng nói"
81
+
82
+ # ... (see full example in repository)
83
+ ```
84
+
85
+ ## Training Details
86
+
87
+ - Dataset: Vietnamese speech dataset
88
+ - Optimizer: AdamW
89
+ - Scheduler: Linear warmup + decay
90
+ - Batch size: Dynamic (frame-based)
91
+
92
+ ## Limitations
93
+
94
+ - Best quality with reference audio 3-15 seconds
95
+ - Vietnamese language only
96
+ - Requires good quality reference audio
97
+
98
+ ## Citation
99
+
100
+ ```bibtex
101
+ @article{chen2024f5tts,
102
+ title={F5-TTS: A Fairerr, Faster, and Fully Non-Autoregressive Text-to-Speech System},
103
+ author={Chen, Yushen and others},
104
+ journal={arXiv preprint},
105
+ year={2024}
106
+ }
107
+ ```
108
+
109
+ ## License
110
+
111
+ CC-BY-NC-4.0