CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Paper
โข 2505.17589 โข Published
โข 5
CosyVoice 3 text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference.
Converted from FunAudioLLM/Fun-CosyVoice3-0.5B-2512.
Swift inference: ivan-digital/qwen3-asr-swift
| Component | Architecture | Size |
|---|---|---|
| LLM | Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) | 467 MB (4-bit) |
| DiT Flow Matching | 22-layer DiT (1024d, 16 heads, 10 ODE steps) | 634 MB (fp16) |
| HiFi-GAN Vocoder | NSF + F0 predictor + ISTFT | 79 MB (fp16) |
| Total | ~1.2 GB |
Text โ LLM (Qwen2.5-0.5B) โ Speech Tokens (FSQ 6561) โ DiT Flow Matching โ Mel (80-band) โ HiFi-GAN โ Audio (24kHz)
Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
llm.safetensors โ LLM weights (4-bit quantized)flow.safetensors โ DiT flow matching decoder (fp16)hifigan.safetensors โ HiFi-GAN vocoder (fp16, weight-norm folded)config.json โ Model configurationw = g * v / ||v||)[out, in, kernel] to MLX [out, kernel, in]For use with ivan-digital/qwen3-asr-swift:
import CosyVoiceTTS
let model = try await CosyVoiceTTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello, how are you?", language: "english")
swift run cosyvoice-tts-cli --text "Hello, how are you?" --lang english --output hello.wav
Apache 2.0 (same as upstream CosyVoice 3)
@article{du2025cosyvoice3,
title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
author={Du, Zhihao and others},
journal={arXiv preprint arXiv:2505.17589},
year={2025}
}
Quantized
Base model
FunAudioLLM/Fun-CosyVoice3-0.5B-2512