--- language: - zh - en - ja - ko - de - es - fr - it - ru license: apache-2.0 tags: - tts - text-to-speech - speech-synthesis - mlx - apple-silicon - cosyvoice base_model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512 pipeline_tag: text-to-speech --- # CosyVoice3-0.5B MLX 4-bit [CosyVoice 3](https://arxiv.org/abs/2505.17589) text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference. Converted from [FunAudioLLM/Fun-CosyVoice3-0.5B-2512](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512). **Swift inference**: [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift) ## Model Details | Component | Architecture | Size | |-----------|-------------|------| | LLM | Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) | 467 MB (4-bit) | | DiT Flow Matching | 22-layer DiT (1024d, 16 heads, 10 ODE steps) | 634 MB (fp16) | | HiFi-GAN Vocoder | NSF + F0 predictor + ISTFT | 79 MB (fp16) | | **Total** | | **~1.2 GB** | ## Pipeline ``` Text → LLM (Qwen2.5-0.5B) → Speech Tokens (FSQ 6561) → DiT Flow Matching → Mel (80-band) → HiFi-GAN → Audio (24kHz) ``` ## Languages Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian ## Files - `llm.safetensors` — LLM weights (4-bit quantized) - `flow.safetensors` — DiT flow matching decoder (fp16) - `hifigan.safetensors` — HiFi-GAN vocoder (fp16, weight-norm folded) - `config.json` — Model configuration ## Conversion Details - LLM: 4-bit quantization (group_size=64) of attention projections, MLP, and speech head - Flow: fp16 (flow matching is sensitive to quantization) - HiFi-GAN: fp16 with weight normalization folded (`w = g * v / ||v||`) - Conv1d weights transposed from PyTorch `[out, in, kernel]` to MLX `[out, kernel, in]` ## Usage For use with [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift): ```swift import CosyVoiceTTS let model = try await CosyVoiceTTSModel.fromPretrained() let audio = model.synthesize(text: "Hello, how are you?", language: "english") ``` ### CLI ```bash swift run cosyvoice-tts-cli --text "Hello, how are you?" --lang english --output hello.wav ``` ## License Apache 2.0 (same as upstream CosyVoice 3) ## Citation ```bibtex @article{du2025cosyvoice3, title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training}, author={Du, Zhihao and others}, journal={arXiv preprint arXiv:2505.17589}, year={2025} } ```