leduclinh's picture
Duplicate from aufklarer/CosyVoice3-0.5B-MLX-4bit
88c0abe
---
language:
- zh
- en
- ja
- ko
- de
- es
- fr
- it
- ru
license: apache-2.0
tags:
- tts
- text-to-speech
- speech-synthesis
- mlx
- apple-silicon
- cosyvoice
base_model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512
pipeline_tag: text-to-speech
---
# CosyVoice3-0.5B MLX 4-bit
[CosyVoice 3](https://arxiv.org/abs/2505.17589) text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference.
Converted from [FunAudioLLM/Fun-CosyVoice3-0.5B-2512](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512).
**Swift inference**: [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift)
## Model Details
| Component | Architecture | Size |
|-----------|-------------|------|
| LLM | Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) | 467 MB (4-bit) |
| DiT Flow Matching | 22-layer DiT (1024d, 16 heads, 10 ODE steps) | 634 MB (fp16) |
| HiFi-GAN Vocoder | NSF + F0 predictor + ISTFT | 79 MB (fp16) |
| **Total** | | **~1.2 GB** |
## Pipeline
```
Text β†’ LLM (Qwen2.5-0.5B) β†’ Speech Tokens (FSQ 6561) β†’ DiT Flow Matching β†’ Mel (80-band) β†’ HiFi-GAN β†’ Audio (24kHz)
```
## Languages
Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
## Files
- `llm.safetensors` β€” LLM weights (4-bit quantized)
- `flow.safetensors` β€” DiT flow matching decoder (fp16)
- `hifigan.safetensors` β€” HiFi-GAN vocoder (fp16, weight-norm folded)
- `config.json` β€” Model configuration
## Conversion Details
- LLM: 4-bit quantization (group_size=64) of attention projections, MLP, and speech head
- Flow: fp16 (flow matching is sensitive to quantization)
- HiFi-GAN: fp16 with weight normalization folded (`w = g * v / ||v||`)
- Conv1d weights transposed from PyTorch `[out, in, kernel]` to MLX `[out, kernel, in]`
## Usage
For use with [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift):
```swift
import CosyVoiceTTS
let model = try await CosyVoiceTTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello, how are you?", language: "english")
```
### CLI
```bash
swift run cosyvoice-tts-cli --text "Hello, how are you?" --lang english --output hello.wav
```
## License
Apache 2.0 (same as upstream CosyVoice 3)
## Citation
```bibtex
@article{du2025cosyvoice3,
title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
author={Du, Zhihao and others},
journal={arXiv preprint arXiv:2505.17589},
year={2025}
}
```