F5TTS_ft / README.md
Yougen's picture
Update README.md
4070722 verified
---
license: apache-2.0
language:
- zh
---
# Model Card for F5TTS_ft
F5TTS_ft is a **fine-tuned Chinese text-to-speech (TTS) model** based on the original F5-TTS architecture, optimized for improved naturalness, prosody, and stability in Mandarin Chinese speech synthesis.
## Model Details
### Model Description
- **Developed by:** Yougen Yuan
- **Funded by [optional]:** Personal research project
- **Shared by [optional]:** Yougen Yuan
- **Model type:** Text-to-Speech (TTS), Diffusion-based TTS
- **Language(s) (NLP):** Chinese (Mandarin, zh-CN)
- **License:** Apache-2.0
- **Finetuned from model [optional]:** Original F5-TTS base model
### Model Sources [optional]
- **Repository:** https://huggingface.co/Yougen/F5TTS_ft
- **Paper [optional]:** F5-TTS original research (https://github.com/SWivid/F5-TTS)
- **Demo [optional]:** Not publicly available
## Uses
### Direct Use
This model can be directly used for end-to-end Chinese text-to-speech synthesis:
- Convert clean Chinese text input into natural-sounding speech audio
- Be used in voice generation, audiobook creation, voice assistants, and multimedia dubbing
- Run with minimal inference code compatible with F5-TTS pipeline
### Downstream Use [optional]
- Integrated into larger voice systems: TTS services, real-time voice generation pipelines
- Further fine-tuned on custom datasets for specific speakers, styles, or domains
- Used as a backbone for voice cloning, style transfer, or multilingual TTS extensions
### Out-of-Scope Use
- Not intended for malicious use, deepfake voice impersonation, or deceptive voice generation
- Not optimized for extremely noisy text, code-mixed heavy slang, or non-Chinese languages
- Not designed for real-time low-latency embedded devices without optimization
- Not suitable for high-stakes applications (legal, medical announcements) without human verification
## Bias, Risks, and Limitations
- Speech style and prosody are constrained by the fine-tuning data distribution; may lack diversity in emotional expression
- Pronunciation accuracy depends on text normalization; rare words, proper nouns, or ancient Chinese may be mispronounced
- Audio quality degrades with extremely long sentences or unformatted messy text
- Potential bias in voice characteristics reflects the training dataset’s speaker and accent distribution
- No built-in content safety; may synthesize harmful or inappropriate text if provided as input
### Recommendations
Users should:
- Clean and normalize input text (punctuation, proper nouns, numbers) for best results
- Avoid using the model for deceptive, harmful, or non-consensual voice generation
- Add content moderation layers when deploying in public or commercial systems
- Conduct further fine-tuning if domain-specific pronunciation or voice style is required
## How to Get Started with the Model
This model follows the original F5-TTS inference framework. Example usage:
```python
# Load model from Hugging Face Hub
from f5_tts.model import DiT, UNetT
from f5_tts.infer.utils_infer import load_vocoder, load_model
model = load_model(
model_name="Yougen/F5TTS_ft",
device="cuda" # or "cpu"
)
vocoder = load_vocoder()
# Run TTS inference
# Refer to official F5-TTS inference code for full pipeline
```
Full inference code is available in the original F5-TTS repository:
https://github.com/SWivid/F5-TTS
## Training Details
### Training Data
Fine-tuned on a **private Chinese Mandarin speech dataset** with:
- Clean, single-speaker or multi-speaker audio
- Aligned text transcripts
- Standard Mandarin pronunciation (Putonghua)
- Preprocessed to 24kHz audio, clipped silences, normalized volume
### Training Procedure
#### Preprocessing [optional]
- Text: Chinese tokenization, phoneme / prosody annotation
- Audio: Mel-spectrogram extraction, 24kHz sampling rate
- Data filtering: removed low-quality, truncated, or misaligned samples
#### Training Hyperparameters
- **Training regime:** fp16 mixed precision
- Optimizer: AdamW
- Learning rate: standard for diffusion TTS
- Batch size and steps adjusted for fine-tuning
#### Speeds, Sizes, Times [optional]
Training performed on single NVIDIA GPU with sufficient VRAM.
Checkpoint size matches original F5-TTS architecture.
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Internal held-out Chinese test set with diverse sentences and scenarios.
#### Factors
- Speech naturalness
- Pronunciation accuracy
- Intelligibility
- Prosodic consistency
#### Metrics
- Subjective MOS (Mean Opinion Score)
- Objective mel-spectrogram reconstruction loss
- Intelligibility validation
### Results
Fine-tuned version shows improved stability and naturalness compared to the baseline on Chinese speech.
#### Summary
F5TTS_ft improves Mandarin TTS quality with better prosody, clearer pronunciation, and more consistent audio generation.
## Model Examination [optional]
No additional interpretability analysis provided beyond standard diffusion TTS behavior.
## Environmental Impact
- **Hardware Type:** NVIDIA GPU (CUDA-enabled)
- **Hours used:** Not precisely recorded
- **Cloud Provider:** None (local training)
- **Compute Region:** N/A
- **Carbon Emitted:** Not calculated
## Technical Specifications [optional]
### Model Architecture and Objective
- Architecture: Diffusion transformer (DiT) based sequence-to-sequence TTS
- Objective: Predict mel-spectrograms from text tokens via diffusion steps
- Vocoder: Compatible with the official F5-TTS vocoder
### Compute Infrastructure
#### Hardware
NVIDIA GPU with CUDA support (recommended >= 12GB VRAM for inference)
#### Software
- PyTorch
- F5-TTS official codebase
- Hugging Face Hub library
## Citation [optional]
**BibTeX:**
```bibtex
@misc{F5TTS,
author = {SWivid},
title = {F5-TTS: A Non-Autoregressive Diffusion TTS Model},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/SWivid/F5-TTS}}
}
@misc{F5TTS_ft,
author = {Yougen Yuan},
title = {F5TTS_ft: Fine-tuned Chinese F5-TTS Model},
year = {2026},
publisher = {Hugging Face Hub},
howpublished = {\url{https://huggingface.co/Yougen/F5TTS_ft}}
}
```
**APA:**
SWivid. (2024). F5-TTS: A Non-Autoregressive Diffusion TTS Model. GitHub. https://github.com/SWivid/F5-TTS
Yuan, Y. (2026). F5TTS_ft: Fine-tuned Chinese F5-TTS Model. Hugging Face Hub. https://huggingface.co/Yougen/F5TTS_ft
## Glossary [optional]
- **TTS**: Text-to-Speech
- **F5-TTS**: Original diffusion-based TTS architecture
- **Mel-spectrogram**: Audio frequency representation used in TTS
- **Fine-tuned**: Model adapted from a pre-trained checkpoint on new data
## More Information [optional]
This model is a research-oriented fine-tune for Chinese speech synthesis and is not officially affiliated with the original F5-TTS authors.
## Model Card Authors [optional]
Yougen Yuan
## Model Card Contact
Yougen Yuan (via Hugging Face Hub)