CosyVoice3-TalkingFlowerZH

CosyVoice3 SFT fine-tune for a Mandarin-speaking talking flower character. Outputs 24 kHz mono WAV. Single speaker (TalkingFlower), Mandarin only.

Quick start

This bundle is self-contained — no separate CosyVoice repository clone required.

Launching the Web UI

To easily launch the interactive web interface, use the included shell script:

./launch-web-app.sh

The launcher serves the memory-optimized fp16 path by default and uses the bundled *.fp16.pt checkpoints when available.

Option 1: Using uv (Recommended)

This is the fastest and most reliable way to run the model, leveraging a fully locked and isolated environment.

# `uv run` automatically sets up the environment and installs dependencies from pyproject.toml
uv run inference.py "会讲话的花很奇怪吗?" output.wav

Option 2: Using standard pip

pip install -r requirements.txt
python inference.py "会讲话的花很奇怪吗?" output.wav

Training details

  • Data: 654 train + 30 dev utterances (~11 min total) of studio-recorded Mandarin speech, all 48 kHz mono 16-bit. Volume-normalized to −23 dBFS voiced RMS with a 0.95 peak ceiling.
  • LLM (Qwen2-0.5B): 100 epochs at lr 1e-5 (constant), accum_grad 4. CV loss minimum at epoch 4 (3.17); overfits sharply after ~10 epochs on this small dataset. Selected epoch 4.
  • Flow (DiT CFM): 70 epochs (session died before reaching the planned 100). Epoch 47 had the lowest CV loss (0.552) but epoch 66 sounded better in a direct listen test — flow-matching CV loss is noisy and not a reliable proxy for output quality. Selected epoch 66.
  • Hardware: single NVIDIA A5000 24 GB. Roughly 8 h LLM + 12 h flow.
  • Inference defaults: ras_sampling(top_p=0.8, top_k=25), text_frontend=False. Serving entrypoints use fp16 checkpoints, disable unused text-normalization startup, drop the unused Qwen2 CausalLM head, and store Qwen2 text embeddings as per-row int8 with fp16 scales to reduce Jetson peak memory. RTF ≈ 0.5–0.7 on A5000.

Limitations

  • Tail-click artifact. 50–75 % of generations contain a brief (20 ms) high-amplitude transient at the very end — an LLM-level sampling instability caused by the small training set. inference.py includes a remove_tail_click post-processing fade that suppresses this without affecting speech content.
  • Mandarin only. No English or other languages.
  • Single speaker. Speaker embedding is baked in (spk2info.pt); zero-shot voice cloning is not supported in this slim bundle.
  • Small training corpus. ~11 min covers limited prosodic variation. Long, structurally unusual sentences may sound stilted; closing phoneme sequences outside the training distribution are the most likely failure cases.

Notes

  • SFT-only slim bundle. Three large redundant files are omitted to save ~1.9 GB:

    • campplus.onnx (27 MB) — speaker encoder, only used for zero-shot prompts.
    • speech_tokenizer_v3.onnx (925 MB) — audio→token, only used for prompt audio / training.
    • CosyVoice-BlankEN/model.safetensors (988 MB) — base Qwen2 weights, overwritten by llm.pt immediately on load.

    inference.py monkey-patches the load paths so these absences are silently handled. To restore zero-shot / cross-lingual / training capability, see the two comment blocks in inference.py.

  • No internet access needed at inference (text_frontend=False).

  • The memory-optimized serving path can be adjusted with environment variables:

    • COSYVOICE_DISABLE_TEXT_FRONTEND=0 restores text-normalization frontend initialization.
    • COSYVOICE_LLM_EMBED_INT8=0 disables Qwen2 text-embedding int8 storage.
    • COSYVOICE_SHOW_INVALID_HTTP_WARNINGS=1 shows Uvicorn warnings from invalid/non-HTTP probes on the Gradio port.

Acknowledgements

Built on CosyVoice by FunAudioLLM (Apache 2.0) and Matcha-TTS by Mehta et al. (MIT). The cosyvoice/ package and third_party/Matcha-TTS/ library are bundled from their respective repositories; cosyvoice/ includes serving-memory optimizations for this Hugging Face bundle.

The Web UI visual treatment, including the talking flower character art, background image, speech-bubble character area, and playful outlined card/button styling, is adapted from the original AI_TalkingFlower Hugging Face Space.

Downloads last month
536
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jiaaom/CosyVoice3-TalkingFlowerZH

Finetuned
(9)
this model