metadata
language:
- zh
- en
- ja
- ko
- de
- es
- fr
- it
- ru
license: apache-2.0
tags:
- tts
- text-to-speech
- speech-synthesis
- mlx
- apple-silicon
- cosyvoice
base_model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512
pipeline_tag: text-to-speech
CosyVoice3-0.5B MLX 4-bit
CosyVoice 3 text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference.
Converted from FunAudioLLM/Fun-CosyVoice3-0.5B-2512.
Swift inference: ivan-digital/qwen3-asr-swift
Model Details
| Component | Architecture | Size |
|---|---|---|
| LLM | Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) | 467 MB (4-bit) |
| DiT Flow Matching | 22-layer DiT (1024d, 16 heads, 10 ODE steps) | 634 MB (fp16) |
| HiFi-GAN Vocoder | NSF + F0 predictor + ISTFT | 79 MB (fp16) |
| Total | ~1.2 GB |
Pipeline
Text → LLM (Qwen2.5-0.5B) → Speech Tokens (FSQ 6561) → DiT Flow Matching → Mel (80-band) → HiFi-GAN → Audio (24kHz)
Languages
Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
Files
llm.safetensors— LLM weights (4-bit quantized)flow.safetensors— DiT flow matching decoder (fp16)hifigan.safetensors— HiFi-GAN vocoder (fp16, weight-norm folded)config.json— Model configuration
Conversion Details
- LLM: 4-bit quantization (group_size=64) of attention projections, MLP, and speech head
- Flow: fp16 (flow matching is sensitive to quantization)
- HiFi-GAN: fp16 with weight normalization folded (
w = g * v / ||v||) - Conv1d weights transposed from PyTorch
[out, in, kernel]to MLX[out, kernel, in]
Usage
For use with ivan-digital/qwen3-asr-swift:
import CosyVoiceTTS
let model = try await CosyVoiceTTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello, how are you?", language: "english")
CLI
swift run cosyvoice-tts-cli --text "Hello, how are you?" --lang english --output hello.wav
License
Apache 2.0 (same as upstream CosyVoice 3)
Citation
@article{du2025cosyvoice3,
title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
author={Du, Zhihao and others},
journal={arXiv preprint arXiv:2505.17589},
year={2025}
}