CosyVoice3-0.5B MLX 4-bit

CosyVoice 3 text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference.

Converted from FunAudioLLM/Fun-CosyVoice3-0.5B-2512.

Swift inference: ivan-digital/qwen3-asr-swift

Model Details

Component	Architecture	Size
LLM	Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads)	467 MB (4-bit)
DiT Flow Matching	22-layer DiT (1024d, 16 heads, 10 ODE steps)	634 MB (fp16)
HiFi-GAN Vocoder	NSF + F0 predictor + ISTFT	79 MB (fp16)
Total		~1.2 GB

Pipeline

Text → LLM (Qwen2.5-0.5B) → Speech Tokens (FSQ 6561) → DiT Flow Matching → Mel (80-band) → HiFi-GAN → Audio (24kHz)

Languages

Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian

Files

llm.safetensors — LLM weights (4-bit quantized)
flow.safetensors — DiT flow matching decoder (fp16)
hifigan.safetensors — HiFi-GAN vocoder (fp16, weight-norm folded)
config.json — Model configuration

Conversion Details

LLM: 4-bit quantization (group_size=64) of attention projections, MLP, and speech head
Flow: fp16 (flow matching is sensitive to quantization)
HiFi-GAN: fp16 with weight normalization folded (w = g * v / ||v||)
Conv1d weights transposed from PyTorch [out, in, kernel] to MLX [out, kernel, in]

Usage

For use with ivan-digital/qwen3-asr-swift:

import CosyVoiceTTS

let model = try await CosyVoiceTTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello, how are you?", language: "english")

CLI

swift run cosyvoice-tts-cli --text "Hello, how are you?" --lang english --output hello.wav

License

Apache 2.0 (same as upstream CosyVoice 3)

Citation

@article{du2025cosyvoice3,
  title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
  author={Du, Zhihao and others},
  journal={arXiv preprint arXiv:2505.17589},
  year={2025}
}

Downloads last month: 23

MLX

Hardware compatibility

Quantized

Model tree for aitytech/CosyVoice3-0.5B-MLX-4bit

Base model

FunAudioLLM/Fun-CosyVoice3-0.5B-2512

Finetuned

(8)

this model

Paper for aitytech/CosyVoice3-0.5B-MLX-4bit

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Paper • 2505.17589 • Published May 23, 2025 • 5