Instructions to use jiaaom/CosyVoice3-TalkingFlowerZH with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- CosyVoice
How to use jiaaom/CosyVoice3-TalkingFlowerZH with CosyVoice:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
CosyVoice3-TalkingFlowerZH
CosyVoice3 SFT fine-tune for a Mandarin-speaking talking flower character. Outputs 24 kHz mono WAV. Single speaker (TalkingFlower), Mandarin only.
Quick start
This bundle is self-contained — no separate CosyVoice repository clone required.
Launching the Web UI
To easily launch the interactive web interface, use the included shell script:
./launch-web-app.sh
The launcher serves the memory-optimized fp16 path by default and uses the bundled *.fp16.pt checkpoints when available.
Option 1: Using uv (Recommended)
This is the fastest and most reliable way to run the model, leveraging a fully locked and isolated environment.
# `uv run` automatically sets up the environment and installs dependencies from pyproject.toml
uv run inference.py "会讲话的花很奇怪吗?" output.wav
Option 2: Using standard pip
pip install -r requirements.txt
python inference.py "会讲话的花很奇怪吗?" output.wav
Training details
- Data: 654 train + 30 dev utterances (~11 min total) of studio-recorded Mandarin speech, all 48 kHz mono 16-bit. Volume-normalized to −23 dBFS voiced RMS with a 0.95 peak ceiling.
- LLM (Qwen2-0.5B): 100 epochs at lr 1e-5 (constant), accum_grad 4. CV loss minimum at epoch 4 (3.17); overfits sharply after ~10 epochs on this small dataset. Selected epoch 4.
- Flow (DiT CFM): 70 epochs (session died before reaching the planned 100). Epoch 47 had the lowest CV loss (0.552) but epoch 66 sounded better in a direct listen test — flow-matching CV loss is noisy and not a reliable proxy for output quality. Selected epoch 66.
- Hardware: single NVIDIA A5000 24 GB. Roughly 8 h LLM + 12 h flow.
- Inference defaults:
ras_sampling(top_p=0.8, top_k=25),text_frontend=False. Serving entrypoints use fp16 checkpoints, disable unused text-normalization startup, drop the unused Qwen2 CausalLM head, and store Qwen2 text embeddings as per-row int8 with fp16 scales to reduce Jetson peak memory. RTF ≈ 0.5–0.7 on A5000.
Limitations
- Tail-click artifact.
50–75 % of generations contain a brief (20 ms) high-amplitude transient at the very end — an LLM-level sampling instability caused by the small training set.inference.pyincludes aremove_tail_clickpost-processing fade that suppresses this without affecting speech content. - Mandarin only. No English or other languages.
- Single speaker. Speaker embedding is baked in (
spk2info.pt); zero-shot voice cloning is not supported in this slim bundle. - Small training corpus. ~11 min covers limited prosodic variation. Long, structurally unusual sentences may sound stilted; closing phoneme sequences outside the training distribution are the most likely failure cases.
Notes
SFT-only slim bundle. Three large redundant files are omitted to save ~1.9 GB:
campplus.onnx(27 MB) — speaker encoder, only used for zero-shot prompts.speech_tokenizer_v3.onnx(925 MB) — audio→token, only used for prompt audio / training.CosyVoice-BlankEN/model.safetensors(988 MB) — base Qwen2 weights, overwritten byllm.ptimmediately on load.
inference.pymonkey-patches the load paths so these absences are silently handled. To restore zero-shot / cross-lingual / training capability, see the two comment blocks ininference.py.No internet access needed at inference (
text_frontend=False).The memory-optimized serving path can be adjusted with environment variables:
COSYVOICE_DISABLE_TEXT_FRONTEND=0restores text-normalization frontend initialization.COSYVOICE_LLM_EMBED_INT8=0disables Qwen2 text-embedding int8 storage.COSYVOICE_SHOW_INVALID_HTTP_WARNINGS=1shows Uvicorn warnings from invalid/non-HTTP probes on the Gradio port.
Acknowledgements
Built on CosyVoice by FunAudioLLM (Apache 2.0) and Matcha-TTS by Mehta et al. (MIT). The cosyvoice/ package and third_party/Matcha-TTS/ library are bundled from their respective repositories; cosyvoice/ includes serving-memory optimizations for this Hugging Face bundle.
The Web UI visual treatment, including the talking flower character art, background image, speech-bubble character area, and playful outlined card/button styling, is adapted from the original AI_TalkingFlower Hugging Face Space.
- Downloads last month
- 536
Model tree for jiaaom/CosyVoice3-TalkingFlowerZH
Base model
FunAudioLLM/CosyVoice2-0.5B