|
|
| --- |
| license: apache-2.0 |
| language: |
| - zh |
| --- |
| # Model Card for F5TTS_ft |
| |
| F5TTS_ft is a **fine-tuned Chinese text-to-speech (TTS) model** based on the original F5-TTS architecture, optimized for improved naturalness, prosody, and stability in Mandarin Chinese speech synthesis. |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| - **Developed by:** Yougen Yuan |
| - **Funded by [optional]:** Personal research project |
| - **Shared by [optional]:** Yougen Yuan |
| - **Model type:** Text-to-Speech (TTS), Diffusion-based TTS |
| - **Language(s) (NLP):** Chinese (Mandarin, zh-CN) |
| - **License:** Apache-2.0 |
| - **Finetuned from model [optional]:** Original F5-TTS base model |
|
|
| ### Model Sources [optional] |
|
|
| - **Repository:** https://huggingface.co/Yougen/F5TTS_ft |
| - **Paper [optional]:** F5-TTS original research (https://github.com/SWivid/F5-TTS) |
| - **Demo [optional]:** Not publicly available |
| |
| ## Uses |
| |
| ### Direct Use |
| |
| This model can be directly used for end-to-end Chinese text-to-speech synthesis: |
| - Convert clean Chinese text input into natural-sounding speech audio |
| - Be used in voice generation, audiobook creation, voice assistants, and multimedia dubbing |
| - Run with minimal inference code compatible with F5-TTS pipeline |
| |
| ### Downstream Use [optional] |
| |
| - Integrated into larger voice systems: TTS services, real-time voice generation pipelines |
| - Further fine-tuned on custom datasets for specific speakers, styles, or domains |
| - Used as a backbone for voice cloning, style transfer, or multilingual TTS extensions |
| |
| ### Out-of-Scope Use |
| |
| - Not intended for malicious use, deepfake voice impersonation, or deceptive voice generation |
| - Not optimized for extremely noisy text, code-mixed heavy slang, or non-Chinese languages |
| - Not designed for real-time low-latency embedded devices without optimization |
| - Not suitable for high-stakes applications (legal, medical announcements) without human verification |
| |
| ## Bias, Risks, and Limitations |
| |
| - Speech style and prosody are constrained by the fine-tuning data distribution; may lack diversity in emotional expression |
| - Pronunciation accuracy depends on text normalization; rare words, proper nouns, or ancient Chinese may be mispronounced |
| - Audio quality degrades with extremely long sentences or unformatted messy text |
| - Potential bias in voice characteristics reflects the training dataset’s speaker and accent distribution |
| - No built-in content safety; may synthesize harmful or inappropriate text if provided as input |
| |
| ### Recommendations |
| |
| Users should: |
| - Clean and normalize input text (punctuation, proper nouns, numbers) for best results |
| - Avoid using the model for deceptive, harmful, or non-consensual voice generation |
| - Add content moderation layers when deploying in public or commercial systems |
| - Conduct further fine-tuning if domain-specific pronunciation or voice style is required |
| |
| ## How to Get Started with the Model |
| |
| This model follows the original F5-TTS inference framework. Example usage: |
| |
| ```python |
| # Load model from Hugging Face Hub |
| from f5_tts.model import DiT, UNetT |
| from f5_tts.infer.utils_infer import load_vocoder, load_model |
|
|
| model = load_model( |
| model_name="Yougen/F5TTS_ft", |
| device="cuda" # or "cpu" |
| ) |
| vocoder = load_vocoder() |
|
|
| # Run TTS inference |
| # Refer to official F5-TTS inference code for full pipeline |
| ``` |
| |
| Full inference code is available in the original F5-TTS repository: |
| https://github.com/SWivid/F5-TTS |
| |
| ## Training Details |
| |
| ### Training Data |
| |
| Fine-tuned on a **private Chinese Mandarin speech dataset** with: |
| - Clean, single-speaker or multi-speaker audio |
| - Aligned text transcripts |
| - Standard Mandarin pronunciation (Putonghua) |
| - Preprocessed to 24kHz audio, clipped silences, normalized volume |
| |
| ### Training Procedure |
| |
| #### Preprocessing [optional] |
| |
| - Text: Chinese tokenization, phoneme / prosody annotation |
| - Audio: Mel-spectrogram extraction, 24kHz sampling rate |
| - Data filtering: removed low-quality, truncated, or misaligned samples |
| |
| #### Training Hyperparameters |
| |
| - **Training regime:** fp16 mixed precision |
| - Optimizer: AdamW |
| - Learning rate: standard for diffusion TTS |
| - Batch size and steps adjusted for fine-tuning |
| |
| #### Speeds, Sizes, Times [optional] |
| |
| Training performed on single NVIDIA GPU with sufficient VRAM. |
| Checkpoint size matches original F5-TTS architecture. |
| |
| ## Evaluation |
| |
| ### Testing Data, Factors & Metrics |
| |
| #### Testing Data |
| |
| Internal held-out Chinese test set with diverse sentences and scenarios. |
| |
| #### Factors |
| |
| - Speech naturalness |
| - Pronunciation accuracy |
| - Intelligibility |
| - Prosodic consistency |
| |
| #### Metrics |
| |
| - Subjective MOS (Mean Opinion Score) |
| - Objective mel-spectrogram reconstruction loss |
| - Intelligibility validation |
| |
| ### Results |
| |
| Fine-tuned version shows improved stability and naturalness compared to the baseline on Chinese speech. |
| |
| #### Summary |
| |
| F5TTS_ft improves Mandarin TTS quality with better prosody, clearer pronunciation, and more consistent audio generation. |
| |
| ## Model Examination [optional] |
| |
| No additional interpretability analysis provided beyond standard diffusion TTS behavior. |
| |
| ## Environmental Impact |
| |
| - **Hardware Type:** NVIDIA GPU (CUDA-enabled) |
| - **Hours used:** Not precisely recorded |
| - **Cloud Provider:** None (local training) |
| - **Compute Region:** N/A |
| - **Carbon Emitted:** Not calculated |
| |
| ## Technical Specifications [optional] |
| |
| ### Model Architecture and Objective |
| |
| - Architecture: Diffusion transformer (DiT) based sequence-to-sequence TTS |
| - Objective: Predict mel-spectrograms from text tokens via diffusion steps |
| - Vocoder: Compatible with the official F5-TTS vocoder |
| |
| ### Compute Infrastructure |
| |
| #### Hardware |
| |
| NVIDIA GPU with CUDA support (recommended >= 12GB VRAM for inference) |
| |
| #### Software |
| |
| - PyTorch |
| - F5-TTS official codebase |
| - Hugging Face Hub library |
| |
| ## Citation [optional] |
| |
| **BibTeX:** |
| ```bibtex |
| @misc{F5TTS, |
| author = {SWivid}, |
| title = {F5-TTS: A Non-Autoregressive Diffusion TTS Model}, |
| year = {2024}, |
| publisher = {GitHub}, |
| journal = {GitHub repository}, |
| howpublished = {\url{https://github.com/SWivid/F5-TTS}} |
| } |
|
|
| @misc{F5TTS_ft, |
| author = {Yougen Yuan}, |
| title = {F5TTS_ft: Fine-tuned Chinese F5-TTS Model}, |
| year = {2026}, |
| publisher = {Hugging Face Hub}, |
| howpublished = {\url{https://huggingface.co/Yougen/F5TTS_ft}} |
| } |
| ``` |
| |
| **APA:** |
| |
| SWivid. (2024). F5-TTS: A Non-Autoregressive Diffusion TTS Model. GitHub. https://github.com/SWivid/F5-TTS |
| |
| Yuan, Y. (2026). F5TTS_ft: Fine-tuned Chinese F5-TTS Model. Hugging Face Hub. https://huggingface.co/Yougen/F5TTS_ft |
| |
| ## Glossary [optional] |
| |
| - **TTS**: Text-to-Speech |
| - **F5-TTS**: Original diffusion-based TTS architecture |
| - **Mel-spectrogram**: Audio frequency representation used in TTS |
| - **Fine-tuned**: Model adapted from a pre-trained checkpoint on new data |
| |
| ## More Information [optional] |
| |
| This model is a research-oriented fine-tune for Chinese speech synthesis and is not officially affiliated with the original F5-TTS authors. |
| |
| ## Model Card Authors [optional] |
| |
| Yougen Yuan |
| |
| ## Model Card Contact |
| |
| Yougen Yuan (via Hugging Face Hub) |
| |