| | --- |
| | language: |
| | - zh |
| | - en |
| | - ja |
| | - ko |
| | - de |
| | - es |
| | - fr |
| | - it |
| | - ru |
| | license: apache-2.0 |
| | tags: |
| | - tts |
| | - text-to-speech |
| | - speech-synthesis |
| | - mlx |
| | - apple-silicon |
| | - cosyvoice |
| | base_model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512 |
| | pipeline_tag: text-to-speech |
| | --- |
| | |
| | # CosyVoice3-0.5B MLX 4-bit |
| |
|
| | [CosyVoice 3](https://arxiv.org/abs/2505.17589) text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference. |
| |
|
| | Converted from [FunAudioLLM/Fun-CosyVoice3-0.5B-2512](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512). |
| |
|
| | **Swift inference**: [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift) |
| |
|
| | ## Model Details |
| |
|
| | | Component | Architecture | Size | |
| | |-----------|-------------|------| |
| | | LLM | Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) | 467 MB (4-bit) | |
| | | DiT Flow Matching | 22-layer DiT (1024d, 16 heads, 10 ODE steps) | 634 MB (fp16) | |
| | | HiFi-GAN Vocoder | NSF + F0 predictor + ISTFT | 79 MB (fp16) | |
| | | **Total** | | **~1.2 GB** | |
| |
|
| | ## Pipeline |
| |
|
| | ``` |
| | Text β LLM (Qwen2.5-0.5B) β Speech Tokens (FSQ 6561) β DiT Flow Matching β Mel (80-band) β HiFi-GAN β Audio (24kHz) |
| | ``` |
| |
|
| | ## Languages |
| |
|
| | Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian |
| |
|
| | ## Files |
| |
|
| | - `llm.safetensors` β LLM weights (4-bit quantized) |
| | - `flow.safetensors` β DiT flow matching decoder (fp16) |
| | - `hifigan.safetensors` β HiFi-GAN vocoder (fp16, weight-norm folded) |
| | - `config.json` β Model configuration |
| |
|
| | ## Conversion Details |
| |
|
| | - LLM: 4-bit quantization (group_size=64) of attention projections, MLP, and speech head |
| | - Flow: fp16 (flow matching is sensitive to quantization) |
| | - HiFi-GAN: fp16 with weight normalization folded (`w = g * v / ||v||`) |
| | - Conv1d weights transposed from PyTorch `[out, in, kernel]` to MLX `[out, kernel, in]` |
| | |
| | ## Usage |
| | |
| | For use with [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift): |
| | |
| | ```swift |
| | import CosyVoiceTTS |
| | |
| | let model = try await CosyVoiceTTSModel.fromPretrained() |
| | let audio = model.synthesize(text: "Hello, how are you?", language: "english") |
| | ``` |
| | |
| | ### CLI |
| | |
| | ```bash |
| | swift run cosyvoice-tts-cli --text "Hello, how are you?" --lang english --output hello.wav |
| | ``` |
| | |
| | ## License |
| | |
| | Apache 2.0 (same as upstream CosyVoice 3) |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @article{du2025cosyvoice3, |
| | title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training}, |
| | author={Du, Zhihao and others}, |
| | journal={arXiv preprint arXiv:2505.17589}, |
| | year={2025} |
| | } |
| | ``` |
| | |