Upload 14 files

d4a719e verified 30 days ago

4.34 kB

	---
	license: apache-2.0
	language:
	- zh
	pipeline_tag: text-to-speech
	tags:
	- tts
	- cosyvoice3
	- coreml
	- apple-silicon
	- ane
	- mandarin
	library_name: fluidaudio
	---

	# CosyVoice3 (Mandarin) — CoreML Models for FluidAudio

	CoreML conversions of CosyVoice3's four inference stages, frozen to the exact
	shapes the [FluidAudio](https://github.com/FluidInference/FluidAudio) Swift
	package's `CosyVoice3TtsManager` loads at runtime. Targets Apple Silicon
	(M-series) with the Neural Engine for LLM + HiFT, CPU for Flow.

	A default voice ships in `voices/` so the repo is self-contained. Additional
	voices (as they're extracted) live in the companion repo
	`FluidInference/cosyvoice3-voices-zh`.

	## Shipping configuration (frozen)

	Each model is shipped in two formats: `.mlpackage` (source, portable) and
	`.mlmodelc` (pre-compiled for macOS 14 / iOS 17 + Apple Silicon). Swift can
	load either; `.mlmodelc` skips the one-time compile step on first use
	(~20-30 s for Flow without it).

	\| Model \| Compute \| Purpose \| dtype \|
	\|---\|---\|---\|---\|
	\| `LLM-Prefill-T256-M768-fp16` \| CPU + ANE \| Qwen2-0.5B prefill, 256-token context, 768-slot KV cache \| fp16 \|
	\| `LLM-Decode-M768-fp16` \| CPU + ANE \| Single-step AR decode, 768-slot KV cache, 24 layers × 2 KV heads × 64 dim \| fp16 \|
	\| `Flow-N250-fp16` \| CPU + GPU \| Speech-token → mel (80-bin, 24 kHz), N_total=250 \| fp16 (pure CPU overflows fused LayerNorm → NaN; ANE refuses to compile; GPU path uses fp32 accumulators internally and is stable) \|
	\| `HiFT-T500-fp16` \| CPU + ANE \| Mel → 24 kHz PCM, T=500 frames \| fp16 \|

	Total disk footprint (`.mlmodelc` + `.mlpackage` + runtime tables): ~6.6 GB on
	disk. If you only need one format, delete the other after download.

	## Runtime tables

	`embeddings/`
	- `embeddings-runtime-fp32.safetensors` — 542 MB. Qwen2 `model.embed_tokens.weight`
	at runtime (post-`.float()`) dtype. Required for bit-exact parity with
	the Python reference — shipping raw `.pt` weights introduces ~4.7e-4 error
	through the HuggingFace dtype round-trip. Swift mmaps this file.
	- `speech_embedding-fp16.safetensors` — 12 MB. CosyVoice3 `speech_embedding`
	table (6761 × 896 fp16); row-lookup per decoded speech token.

	`voices/` — 11 zero-shot voice bundles (~1 MB total)
	- `cosyvoice3-default-zh.safetensors` — default voice from CosyVoice upstream
	`zero_shot_prompt.wav` (female, 希望你以后能够做的比我还好呦。, N_speech = 87).
	- `aishell3-zh-SSB*.safetensors` — 10 AISHELL-3 speakers bootstrapped via
	`verify/bootstrap_aishell3_voices.py` (5 female + 5 male, north + south
	accents). See `aishell3-bootstrap.json` for per-voice provenance.
	- Each `.safetensors` ships with a `.json` prompt-text sidecar and follows the
	schema documented in the companion `cosyvoice3-voices-zh` repo.

	`tokenizer/`
	- `vocab.json` + `merges.txt` + `tokenizer_config.json` — stock Qwen2 BPE
	tokenizer assets (copied from HuggingFace `FunAudioLLM/CosyVoice-BlankEN`).
	- `special_tokens.json` — 281 runtime-added CosyVoice3 special token → ID map
	(`<\|endofprompt\|>`, `[breath]`, ARPAbet phonemes, etc.). Covers IDs
	151643..151923.

	## Swift usage (FluidAudio)

	```swift
	import FluidAudio

	let manager = CosyVoice3TtsManager(
	modelsDirectory: modelsURL, // this repo root
	tokenizerDirectory: modelsURL.appendingPathComponent("tokenizer"),
	textEmbeddingsFile: modelsURL.appendingPathComponent("embeddings/embeddings-runtime-fp32.safetensors"),
	specialTokensFile: modelsURL.appendingPathComponent("tokenizer/special_tokens.json"))
	try await manager.initialize()

	let prompt = try CosyVoice3PromptAssets.load(
	from: voiceURL.appendingPathComponent("cosyvoice3-default-zh.safetensors"))

	let result = try await manager.synthesize(
	text: "今天天气真的很不错，适合出门散步。",
	promptAssets: prompt)
	// result.samples — [Float] @ 24 kHz mono
	```

	## Model graph quick reference

	- Qwen2 decoder: hidden=896, 24 layers, 14 Q heads, 2 KV heads, head_dim=64
	- Speech vocab: 6761 (6561 tokens + sos/eos/task_id/stops)
	- SOS=6561, EOS=6562, TASK_ID=6563
	- Flow: 80-bin mel @ 24 kHz, hop=480, n_fft=1920
	- HiFT: iSTFT-based vocoder, upsamples mel to 24 kHz PCM

	## License

	Apache-2.0. Derived from FunAudioLLM/CosyVoice3 weights; see upstream license.