CosyVoice3-0.5B MLX 4-bit

CosyVoice 3 text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference.

Converted from FunAudioLLM/Fun-CosyVoice3-0.5B-2512.

Swift inference: ivan-digital/qwen3-asr-swift

Model Details

Component Architecture Size
LLM Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) 467 MB (4-bit)
DiT Flow Matching 22-layer DiT (1024d, 16 heads, 10 ODE steps) 634 MB (fp16)
HiFi-GAN Vocoder NSF + F0 predictor + ISTFT 79 MB (fp16)
Total ~1.2 GB

Pipeline

Text โ†’ LLM (Qwen2.5-0.5B) โ†’ Speech Tokens (FSQ 6561) โ†’ DiT Flow Matching โ†’ Mel (80-band) โ†’ HiFi-GAN โ†’ Audio (24kHz)

Languages

Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian

Files

  • llm.safetensors โ€” LLM weights (4-bit quantized)
  • flow.safetensors โ€” DiT flow matching decoder (fp16)
  • hifigan.safetensors โ€” HiFi-GAN vocoder (fp16, weight-norm folded)
  • config.json โ€” Model configuration

Conversion Details

  • LLM: 4-bit quantization (group_size=64) of attention projections, MLP, and speech head
  • Flow: fp16 (flow matching is sensitive to quantization)
  • HiFi-GAN: fp16 with weight normalization folded (w = g * v / ||v||)
  • Conv1d weights transposed from PyTorch [out, in, kernel] to MLX [out, kernel, in]

Usage

For use with ivan-digital/qwen3-asr-swift:

import CosyVoiceTTS

let model = try await CosyVoiceTTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello, how are you?", language: "english")

CLI

swift run cosyvoice-tts-cli --text "Hello, how are you?" --lang english --output hello.wav

License

Apache 2.0 (same as upstream CosyVoice 3)

Citation

@article{du2025cosyvoice3,
  title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
  author={Du, Zhihao and others},
  journal={arXiv preprint arXiv:2505.17589},
  year={2025}
}
Downloads last month
8
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for aitytech/CosyVoice3-0.5B-MLX-4bit

Finetuned
(7)
this model

Paper for aitytech/CosyVoice3-0.5B-MLX-4bit