F5TTS_ft / README.md

Update README.md

4070722 verified about 1 month ago

6.97 kB


	---
	license: apache-2.0
	language:
	- zh
	---
	# Model Card for F5TTS_ft

	F5TTS_ft is a fine-tuned Chinese text-to-speech (TTS) model based on the original F5-TTS architecture, optimized for improved naturalness, prosody, and stability in Mandarin Chinese speech synthesis.

	## Model Details

	### Model Description

	- Developed by: Yougen Yuan
	- Funded by [optional]: Personal research project
	- Shared by [optional]: Yougen Yuan
	- Model type: Text-to-Speech (TTS), Diffusion-based TTS
	- Language(s) (NLP): Chinese (Mandarin, zh-CN)
	- License: Apache-2.0
	- Finetuned from model [optional]: Original F5-TTS base model

	### Model Sources [optional]

	- Repository: https://huggingface.co/Yougen/F5TTS_ft
	- Paper [optional]: F5-TTS original research (https://github.com/SWivid/F5-TTS)
	- Demo [optional]: Not publicly available

	## Uses

	### Direct Use

	This model can be directly used for end-to-end Chinese text-to-speech synthesis:
	- Convert clean Chinese text input into natural-sounding speech audio
	- Be used in voice generation, audiobook creation, voice assistants, and multimedia dubbing
	- Run with minimal inference code compatible with F5-TTS pipeline

	### Downstream Use [optional]

	- Integrated into larger voice systems: TTS services, real-time voice generation pipelines
	- Further fine-tuned on custom datasets for specific speakers, styles, or domains
	- Used as a backbone for voice cloning, style transfer, or multilingual TTS extensions

	### Out-of-Scope Use

	- Not intended for malicious use, deepfake voice impersonation, or deceptive voice generation
	- Not optimized for extremely noisy text, code-mixed heavy slang, or non-Chinese languages
	- Not designed for real-time low-latency embedded devices without optimization
	- Not suitable for high-stakes applications (legal, medical announcements) without human verification

	## Bias, Risks, and Limitations

	- Speech style and prosody are constrained by the fine-tuning data distribution; may lack diversity in emotional expression
	- Pronunciation accuracy depends on text normalization; rare words, proper nouns, or ancient Chinese may be mispronounced
	- Audio quality degrades with extremely long sentences or unformatted messy text
	- Potential bias in voice characteristics reflects the training dataset’s speaker and accent distribution
	- No built-in content safety; may synthesize harmful or inappropriate text if provided as input

	### Recommendations

	Users should:
	- Clean and normalize input text (punctuation, proper nouns, numbers) for best results
	- Avoid using the model for deceptive, harmful, or non-consensual voice generation
	- Add content moderation layers when deploying in public or commercial systems
	- Conduct further fine-tuning if domain-specific pronunciation or voice style is required

	## How to Get Started with the Model

	This model follows the original F5-TTS inference framework. Example usage:

	```python
	# Load model from Hugging Face Hub
	from f5_tts.model import DiT, UNetT
	from f5_tts.infer.utils_infer import load_vocoder, load_model

	model = load_model(
	model_name="Yougen/F5TTS_ft",
	device="cuda" # or "cpu"
	)
	vocoder = load_vocoder()

	# Run TTS inference
	# Refer to official F5-TTS inference code for full pipeline
	```

	Full inference code is available in the original F5-TTS repository:
	https://github.com/SWivid/F5-TTS

	## Training Details

	### Training Data

	Fine-tuned on a private Chinese Mandarin speech dataset with:
	- Clean, single-speaker or multi-speaker audio
	- Aligned text transcripts
	- Standard Mandarin pronunciation (Putonghua)
	- Preprocessed to 24kHz audio, clipped silences, normalized volume

	### Training Procedure

	#### Preprocessing [optional]

	- Text: Chinese tokenization, phoneme / prosody annotation
	- Audio: Mel-spectrogram extraction, 24kHz sampling rate
	- Data filtering: removed low-quality, truncated, or misaligned samples

	#### Training Hyperparameters

	- Training regime: fp16 mixed precision
	- Optimizer: AdamW
	- Learning rate: standard for diffusion TTS
	- Batch size and steps adjusted for fine-tuning

	#### Speeds, Sizes, Times [optional]

	Training performed on single NVIDIA GPU with sufficient VRAM.
	Checkpoint size matches original F5-TTS architecture.

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	Internal held-out Chinese test set with diverse sentences and scenarios.

	#### Factors

	- Speech naturalness
	- Pronunciation accuracy
	- Intelligibility
	- Prosodic consistency

	#### Metrics

	- Subjective MOS (Mean Opinion Score)
	- Objective mel-spectrogram reconstruction loss
	- Intelligibility validation

	### Results

	Fine-tuned version shows improved stability and naturalness compared to the baseline on Chinese speech.

	#### Summary

	F5TTS_ft improves Mandarin TTS quality with better prosody, clearer pronunciation, and more consistent audio generation.

	## Model Examination [optional]

	No additional interpretability analysis provided beyond standard diffusion TTS behavior.

	## Environmental Impact

	- Hardware Type: NVIDIA GPU (CUDA-enabled)
	- Hours used: Not precisely recorded
	- Cloud Provider: None (local training)
	- Compute Region: N/A
	- Carbon Emitted: Not calculated

	## Technical Specifications [optional]

	### Model Architecture and Objective

	- Architecture: Diffusion transformer (DiT) based sequence-to-sequence TTS
	- Objective: Predict mel-spectrograms from text tokens via diffusion steps
	- Vocoder: Compatible with the official F5-TTS vocoder

	### Compute Infrastructure

	#### Hardware

	NVIDIA GPU with CUDA support (recommended >= 12GB VRAM for inference)

	#### Software

	- PyTorch
	- F5-TTS official codebase
	- Hugging Face Hub library

	## Citation [optional]

	BibTeX:
	```bibtex
	@misc{F5TTS,
	author = {SWivid},
	title = {F5-TTS: A Non-Autoregressive Diffusion TTS Model},
	year = {2024},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/SWivid/F5-TTS}}
	}

	@misc{F5TTS_ft,
	author = {Yougen Yuan},
	title = {F5TTS_ft: Fine-tuned Chinese F5-TTS Model},
	year = {2026},
	publisher = {Hugging Face Hub},
	howpublished = {\url{https://huggingface.co/Yougen/F5TTS_ft}}
	}
	```

	APA:

	SWivid. (2024). F5-TTS: A Non-Autoregressive Diffusion TTS Model. GitHub. https://github.com/SWivid/F5-TTS

	Yuan, Y. (2026). F5TTS_ft: Fine-tuned Chinese F5-TTS Model. Hugging Face Hub. https://huggingface.co/Yougen/F5TTS_ft

	## Glossary [optional]

	- TTS: Text-to-Speech
	- F5-TTS: Original diffusion-based TTS architecture
	- Mel-spectrogram: Audio frequency representation used in TTS
	- Fine-tuned: Model adapted from a pre-trained checkpoint on new data

	## More Information [optional]

	This model is a research-oriented fine-tune for Chinese speech synthesis and is not officially affiliated with the original F5-TTS authors.

	## Model Card Authors [optional]

	Yougen Yuan

	## Model Card Contact

	Yougen Yuan (via Hugging Face Hub)