Text-to-Speech
F5-TTS
Files changed (1) hide show
  1. README.md +81 -1
README.md CHANGED
@@ -17,4 +17,84 @@ ckpts/
17
  model_1200000.safetensors
18
  ```
19
  Github: https://github.com/SWivid/F5-TTS
20
- Paper: [F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching](https://huggingface.co/papers/2410.06885)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  model_1200000.safetensors
18
  ```
19
  Github: https://github.com/SWivid/F5-TTS
20
+ Paper: [F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching](https://huggingface.co/papers/2410.06885)
21
+
22
+ ## Model Description
23
+
24
+ F5-TTS is a non-autoregressive, flow-matching based text-to-speech model that generates high-quality, natural-sounding speech. The model uses a diffusion-based approach with flow matching to achieve fluent and faithful speech synthesis.
25
+
26
+ ### Key Features
27
+ - **Non-autoregressive generation**: Fast inference speed
28
+ - **Flow matching**: High-quality audio generation
29
+ - **Multi-speaker support**: Trained on the Emilia dataset
30
+ - **Flexible duration control**: Natural speech rhythm
31
+
32
+ ## Usage
33
+
34
+ ### Installation
35
+
36
+ ```bash
37
+ pip install f5-tts
38
+ ```
39
+
40
+ ### Quick Start
41
+
42
+ ```python
43
+ from f5_tts.api import F5TTS
44
+
45
+ # Initialize the model
46
+ tts = F5TTS(model_type="F5-TTS", ckpt_file="path/to/model_1250000.safetensors")
47
+
48
+ # Generate speech
49
+ wav_file = tts.infer(
50
+ gen_text="This is a sample text for speech synthesis.",
51
+ ref_file="reference_audio.wav", # Reference audio for voice cloning
52
+ ref_text="Reference text spoken in the audio."
53
+ )
54
+
55
+ print(f"Generated audio saved to: {wav_file}")
56
+ ```
57
+
58
+ ### Advanced Usage
59
+
60
+ ```python
61
+ # Custom generation parameters
62
+ wav_file = tts.infer(
63
+ gen_text="Your text here",
64
+ ref_file="reference.wav",
65
+ ref_text="Reference transcript",
66
+ nfe_step=32, # Number of function evaluations
67
+ speed=1.0, # Speech speed multiplier
68
+ )
69
+ ```
70
+
71
+ ## Model Variants
72
+
73
+ - **F5TTS_Base**: Standard model (1.2M steps)
74
+ - **F5TTS_v1_Base**: Improved version (1.25M steps)
75
+ - **F5TTS_Base_bigvgan**: With BigVGAN vocoder
76
+
77
+ ## Training Data
78
+
79
+ Trained on the [Emilia dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset), a large-scale multilingual speech dataset.
80
+
81
+ ## Limitations
82
+
83
+ - Best performance with clear reference audio
84
+ - May require fine-tuning for specific voices or accents
85
+ - Generation quality depends on reference audio quality
86
+
87
+ ## Citation
88
+
89
+ ```bibtex
90
+ @article{chen2024f5tts,
91
+ title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
92
+ author={Chen, Yushen and others},
93
+ journal={arXiv preprint arXiv:2410.06885},
94
+ year={2024}
95
+ }
96
+ ```
97
+
98
+ ## License
99
+
100
+ This model is released under the CC-BY-NC-4.0 license. See the [LICENSE](LICENSE) file for details.