SWivid
/

F5-TTS

@@ -17,4 +17,84 @@ ckpts/
         model_1200000.safetensors
 ```
 Github: https://github.com/SWivid/F5-TTS
-Paper: [F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching](https://huggingface.co/papers/2410.06885)

         model_1200000.safetensors
 ```
 Github: https://github.com/SWivid/F5-TTS
+Paper: [F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching](https://huggingface.co/papers/2410.06885)
+## Model Description
+F5-TTS is a non-autoregressive, flow-matching based text-to-speech model that generates high-quality, natural-sounding speech. The model uses a diffusion-based approach with flow matching to achieve fluent and faithful speech synthesis.
+### Key Features
+- **Non-autoregressive generation**: Fast inference speed
+- **Flow matching**: High-quality audio generation
+- **Multi-speaker support**: Trained on the Emilia dataset
+- **Flexible duration control**: Natural speech rhythm
+## Usage
+### Installation
+```bash
+pip install f5-tts
+```
+### Quick Start
+```python
+from f5_tts.api import F5TTS
+# Initialize the model
+tts = F5TTS(model_type="F5-TTS", ckpt_file="path/to/model_1250000.safetensors")
+# Generate speech
+wav_file = tts.infer(
+    gen_text="This is a sample text for speech synthesis.",
+    ref_file="reference_audio.wav",  # Reference audio for voice cloning
+    ref_text="Reference text spoken in the audio."
+)
+print(f"Generated audio saved to: {wav_file}")
+```
+### Advanced Usage
+```python
+# Custom generation parameters
+wav_file = tts.infer(
+    gen_text="Your text here",
+    ref_file="reference.wav",
+    ref_text="Reference transcript",
+    nfe_step=32,  # Number of function evaluations
+    speed=1.0,     # Speech speed multiplier
+)
+```
+## Model Variants
+- **F5TTS_Base**: Standard model (1.2M steps)
+- **F5TTS_v1_Base**: Improved version (1.25M steps)
+- **F5TTS_Base_bigvgan**: With BigVGAN vocoder
+## Training Data
+Trained on the [Emilia dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset), a large-scale multilingual speech dataset.
+## Limitations
+- Best performance with clear reference audio
+- May require fine-tuning for specific voices or accents
+- Generation quality depends on reference audio quality
+## Citation
+```bibtex
+@article{chen2024f5tts,
+  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
+  author={Chen, Yushen and others},
+  journal={arXiv preprint arXiv:2410.06885},
+  year={2024}
+}
+```
+## License
+This model is released under the CC-BY-NC-4.0 license. See the [LICENSE](LICENSE) file for details.