File size: 2,515 Bytes
88c0abe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
language:
  - zh
  - en
  - ja
  - ko
  - de
  - es
  - fr
  - it
  - ru
license: apache-2.0
tags:
  - tts
  - text-to-speech
  - speech-synthesis
  - mlx
  - apple-silicon
  - cosyvoice
base_model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512
pipeline_tag: text-to-speech
---

# CosyVoice3-0.5B MLX 4-bit

[CosyVoice 3](https://arxiv.org/abs/2505.17589) text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference.

Converted from [FunAudioLLM/Fun-CosyVoice3-0.5B-2512](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512).

**Swift inference**: [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift)

## Model Details

| Component | Architecture | Size |
|-----------|-------------|------|
| LLM | Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) | 467 MB (4-bit) |
| DiT Flow Matching | 22-layer DiT (1024d, 16 heads, 10 ODE steps) | 634 MB (fp16) |
| HiFi-GAN Vocoder | NSF + F0 predictor + ISTFT | 79 MB (fp16) |
| **Total** | | **~1.2 GB** |

## Pipeline

```
Text → LLM (Qwen2.5-0.5B) → Speech Tokens (FSQ 6561) → DiT Flow Matching → Mel (80-band) → HiFi-GAN → Audio (24kHz)
```

## Languages

Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian

## Files

- `llm.safetensors` — LLM weights (4-bit quantized)
- `flow.safetensors` — DiT flow matching decoder (fp16)
- `hifigan.safetensors` — HiFi-GAN vocoder (fp16, weight-norm folded)
- `config.json` — Model configuration

## Conversion Details

- LLM: 4-bit quantization (group_size=64) of attention projections, MLP, and speech head
- Flow: fp16 (flow matching is sensitive to quantization)
- HiFi-GAN: fp16 with weight normalization folded (`w = g * v / ||v||`)
- Conv1d weights transposed from PyTorch `[out, in, kernel]` to MLX `[out, kernel, in]`

## Usage

For use with [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift):

```swift
import CosyVoiceTTS

let model = try await CosyVoiceTTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello, how are you?", language: "english")
```

### CLI

```bash
swift run cosyvoice-tts-cli --text "Hello, how are you?" --lang english --output hello.wav
```

## License

Apache 2.0 (same as upstream CosyVoice 3)

## Citation

```bibtex
@article{du2025cosyvoice3,
  title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
  author={Du, Zhihao and others},
  journal={arXiv preprint arXiv:2505.17589},
  year={2025}
}
```