Duplicated from aufklarer/CosyVoice3-0.5B-MLX-4bit

aitytech
/

CosyVoice3-0.5B-MLX-4bit

speech-synthesis

Model card Files Files and versions

CosyVoice3-0.5B-MLX-4bit / README.md

leduclinh's picture

Duplicate from aufklarer/CosyVoice3-0.5B-MLX-4bit

88c0abe 4 days ago

|

history blame contribute delete

2.52 kB

	---
	language:
	- zh
	- en
	- ja
	- ko
	- de
	- es
	- fr
	- it
	- ru
	license: apache-2.0
	tags:
	- tts
	- text-to-speech
	- speech-synthesis
	- mlx
	- apple-silicon
	- cosyvoice
	base_model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512
	pipeline_tag: text-to-speech
	---

	# CosyVoice3-0.5B MLX 4-bit

	[CosyVoice 3](https://arxiv.org/abs/2505.17589) text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference.

	Converted from [FunAudioLLM/Fun-CosyVoice3-0.5B-2512](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512).

	Swift inference: [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift)

	## Model Details

	\| Component \| Architecture \| Size \|
	\|-----------\|-------------\|------\|
	\| LLM \| Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) \| 467 MB (4-bit) \|
	\| DiT Flow Matching \| 22-layer DiT (1024d, 16 heads, 10 ODE steps) \| 634 MB (fp16) \|
	\| HiFi-GAN Vocoder \| NSF + F0 predictor + ISTFT \| 79 MB (fp16) \|
	\| Total \| \| ~1.2 GB \|

	## Pipeline

	```
	Text → LLM (Qwen2.5-0.5B) → Speech Tokens (FSQ 6561) → DiT Flow Matching → Mel (80-band) → HiFi-GAN → Audio (24kHz)
	```

	## Languages

	Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian

	## Files

	- `llm.safetensors` — LLM weights (4-bit quantized)
	- `flow.safetensors` — DiT flow matching decoder (fp16)
	- `hifigan.safetensors` — HiFi-GAN vocoder (fp16, weight-norm folded)
	- `config.json` — Model configuration

	## Conversion Details

	- LLM: 4-bit quantization (group_size=64) of attention projections, MLP, and speech head
	- Flow: fp16 (flow matching is sensitive to quantization)
	- HiFi-GAN: fp16 with weight normalization folded (`w = g * v / \|\|v\|\|`)
	- Conv1d weights transposed from PyTorch `[out, in, kernel]` to MLX `[out, kernel, in]`

	## Usage

	For use with [ivan-digital/qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift):

	```swift
	import CosyVoiceTTS

	let model = try await CosyVoiceTTSModel.fromPretrained()
	let audio = model.synthesize(text: "Hello, how are you?", language: "english")
	```

	### CLI

	```bash
	swift run cosyvoice-tts-cli --text "Hello, how are you?" --lang english --output hello.wav
	```

	## License

	Apache 2.0 (same as upstream CosyVoice 3)

	## Citation

	```bibtex
	@article{du2025cosyvoice3,
	title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
	author={Du, Zhihao and others},
	journal={arXiv preprint arXiv:2505.17589},
	year={2025}
	}
	```