| --- |
| license: apache-2.0 |
| language: |
| - zh |
| pipeline_tag: text-to-speech |
| tags: |
| - tts |
| - cosyvoice3 |
| - coreml |
| - apple-silicon |
| - ane |
| - mandarin |
| library_name: fluidaudio |
| --- |
| |
| # CosyVoice3 (Mandarin) β CoreML Models for FluidAudio |
|
|
| CoreML conversions of CosyVoice3's four inference stages, frozen to the exact |
| shapes the [FluidAudio](https://github.com/FluidInference/FluidAudio) Swift |
| package's `CosyVoice3TtsManager` loads at runtime. Targets Apple Silicon |
| (M-series) with the Neural Engine for LLM + HiFT, CPU for Flow. |
|
|
| A default voice ships in `voices/` so the repo is self-contained. Additional |
| voices (as they're extracted) live in the companion repo |
| `FluidInference/cosyvoice3-voices-zh`. |
|
|
| ## Shipping configuration (frozen) |
|
|
| Each model is shipped in two formats: `.mlpackage` (source, portable) and |
| `.mlmodelc` (pre-compiled for macOS 14 / iOS 17 + Apple Silicon). Swift can |
| load either; `.mlmodelc` skips the one-time compile step on first use |
| (~20-30 s for Flow without it). |
|
|
| | Model | Compute | Purpose | dtype | |
| |---|---|---|---| |
| | `LLM-Prefill-T256-M768-fp16` | CPU + ANE | Qwen2-0.5B prefill, 256-token context, 768-slot KV cache | fp16 | |
| | `LLM-Decode-M768-fp16` | CPU + ANE | Single-step AR decode, 768-slot KV cache, 24 layers Γ 2 KV heads Γ 64 dim | fp16 | |
| | `Flow-N250-fp16` | CPU + GPU | Speech-token β mel (80-bin, 24 kHz), N_total=250 | fp16 (pure CPU overflows fused LayerNorm β NaN; ANE refuses to compile; GPU path uses fp32 accumulators internally and is stable) | |
| | `HiFT-T500-fp16` | CPU + ANE | Mel β 24 kHz PCM, T=500 frames | fp16 | |
| |
| Total disk footprint (`.mlmodelc` + `.mlpackage` + runtime tables): ~6.6 GB on |
| disk. If you only need one format, delete the other after download. |
| |
| ## Runtime tables |
| |
| `embeddings/` |
| - `embeddings-runtime-fp32.safetensors` β 542 MB. Qwen2 `model.embed_tokens.weight` |
| at **runtime** (post-`.float()`) dtype. Required for bit-exact parity with |
| the Python reference β shipping raw `.pt` weights introduces ~4.7e-4 error |
| through the HuggingFace dtype round-trip. Swift mmaps this file. |
| - `speech_embedding-fp16.safetensors` β 12 MB. CosyVoice3 `speech_embedding` |
| table (6761 Γ 896 fp16); row-lookup per decoded speech token. |
|
|
| `voices/` β 11 zero-shot voice bundles (~1 MB total) |
| - `cosyvoice3-default-zh.safetensors` β default voice from CosyVoice upstream |
| `zero_shot_prompt.wav` (female, εΈζδ½ δ»₯εθ½ε€εηζ―ζθΏε₯½ε¦γ, N_speech = 87). |
| - `aishell3-zh-SSB*.safetensors` β 10 AISHELL-3 speakers bootstrapped via |
| `verify/bootstrap_aishell3_voices.py` (5 female + 5 male, north + south |
| accents). See `aishell3-bootstrap.json` for per-voice provenance. |
| - Each `.safetensors` ships with a `.json` prompt-text sidecar and follows the |
| schema documented in the companion `cosyvoice3-voices-zh` repo. |
| |
| `tokenizer/` |
| - `vocab.json` + `merges.txt` + `tokenizer_config.json` β stock Qwen2 BPE |
| tokenizer assets (copied from HuggingFace `FunAudioLLM/CosyVoice-BlankEN`). |
| - `special_tokens.json` β 281 runtime-added CosyVoice3 special token β ID map |
| (`<|endofprompt|>`, `[breath]`, ARPAbet phonemes, etc.). Covers IDs |
| 151643..151923. |
|
|
| ## Swift usage (FluidAudio) |
|
|
| ```swift |
| import FluidAudio |
| |
| let manager = CosyVoice3TtsManager( |
| modelsDirectory: modelsURL, // this repo root |
| tokenizerDirectory: modelsURL.appendingPathComponent("tokenizer"), |
| textEmbeddingsFile: modelsURL.appendingPathComponent("embeddings/embeddings-runtime-fp32.safetensors"), |
| specialTokensFile: modelsURL.appendingPathComponent("tokenizer/special_tokens.json")) |
| try await manager.initialize() |
| |
| let prompt = try CosyVoice3PromptAssets.load( |
| from: voiceURL.appendingPathComponent("cosyvoice3-default-zh.safetensors")) |
| |
| let result = try await manager.synthesize( |
| text: "δ»ε€©ε€©ζ°ηηεΎδΈιοΌιεεΊι¨ζ£ζ₯γ", |
| promptAssets: prompt) |
| // result.samples β [Float] @ 24 kHz mono |
| ``` |
|
|
| ## Model graph quick reference |
|
|
| - Qwen2 decoder: hidden=896, 24 layers, 14 Q heads, 2 KV heads, head_dim=64 |
| - Speech vocab: 6761 (6561 tokens + sos/eos/task_id/stops) |
| - SOS=6561, EOS=6562, TASK_ID=6563 |
| - Flow: 80-bin mel @ 24 kHz, hop=480, n_fft=1920 |
| - HiFT: iSTFT-based vocoder, upsamples mel to 24 kHz PCM |
|
|
| ## License |
|
|
| Apache-2.0. Derived from FunAudioLLM/CosyVoice3 weights; see upstream license. |
|
|