File size: 4,337 Bytes
d988946
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4a719e
d988946
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4a719e
 
 
 
 
 
 
 
d988946
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
license: apache-2.0
language:
  - zh
pipeline_tag: text-to-speech
tags:
  - tts
  - cosyvoice3
  - coreml
  - apple-silicon
  - ane
  - mandarin
library_name: fluidaudio
---

# CosyVoice3 (Mandarin) β€” CoreML Models for FluidAudio

CoreML conversions of CosyVoice3's four inference stages, frozen to the exact
shapes the [FluidAudio](https://github.com/FluidInference/FluidAudio) Swift
package's `CosyVoice3TtsManager` loads at runtime. Targets Apple Silicon
(M-series) with the Neural Engine for LLM + HiFT, CPU for Flow.

A default voice ships in `voices/` so the repo is self-contained. Additional
voices (as they're extracted) live in the companion repo
`FluidInference/cosyvoice3-voices-zh`.

## Shipping configuration (frozen)

Each model is shipped in two formats: `.mlpackage` (source, portable) and
`.mlmodelc` (pre-compiled for macOS 14 / iOS 17 + Apple Silicon). Swift can
load either; `.mlmodelc` skips the one-time compile step on first use
(~20-30 s for Flow without it).

| Model | Compute | Purpose | dtype |
|---|---|---|---|
| `LLM-Prefill-T256-M768-fp16` | CPU + ANE | Qwen2-0.5B prefill, 256-token context, 768-slot KV cache | fp16 |
| `LLM-Decode-M768-fp16` | CPU + ANE | Single-step AR decode, 768-slot KV cache, 24 layers Γ— 2 KV heads Γ— 64 dim | fp16 |
| `Flow-N250-fp16` | CPU + GPU | Speech-token β†’ mel (80-bin, 24 kHz), N_total=250 | fp16 (pure CPU overflows fused LayerNorm β†’ NaN; ANE refuses to compile; GPU path uses fp32 accumulators internally and is stable) |
| `HiFT-T500-fp16` | CPU + ANE | Mel β†’ 24 kHz PCM, T=500 frames | fp16 |

Total disk footprint (`.mlmodelc` + `.mlpackage` + runtime tables): ~6.6 GB on
disk. If you only need one format, delete the other after download.

## Runtime tables

`embeddings/`
- `embeddings-runtime-fp32.safetensors` β€” 542 MB. Qwen2 `model.embed_tokens.weight`
  at **runtime** (post-`.float()`) dtype. Required for bit-exact parity with
  the Python reference β€” shipping raw `.pt` weights introduces ~4.7e-4 error
  through the HuggingFace dtype round-trip. Swift mmaps this file.
- `speech_embedding-fp16.safetensors` β€” 12 MB. CosyVoice3 `speech_embedding`
  table (6761 Γ— 896 fp16); row-lookup per decoded speech token.

`voices/` β€” 11 zero-shot voice bundles (~1 MB total)
- `cosyvoice3-default-zh.safetensors` β€” default voice from CosyVoice upstream
  `zero_shot_prompt.wav` (female, εΈŒζœ›δ½ δ»₯εŽθƒ½ε€Ÿεšηš„ζ―”ζˆ‘θΏ˜ε₯½ε‘¦γ€‚, N_speech = 87).
- `aishell3-zh-SSB*.safetensors` β€” 10 AISHELL-3 speakers bootstrapped via
  `verify/bootstrap_aishell3_voices.py` (5 female + 5 male, north + south
  accents). See `aishell3-bootstrap.json` for per-voice provenance.
- Each `.safetensors` ships with a `.json` prompt-text sidecar and follows the
  schema documented in the companion `cosyvoice3-voices-zh` repo.

`tokenizer/`
- `vocab.json` + `merges.txt` + `tokenizer_config.json` β€” stock Qwen2 BPE
  tokenizer assets (copied from HuggingFace `FunAudioLLM/CosyVoice-BlankEN`).
- `special_tokens.json` β€” 281 runtime-added CosyVoice3 special token β†’ ID map
  (`<|endofprompt|>`, `[breath]`, ARPAbet phonemes, etc.). Covers IDs
  151643..151923.

## Swift usage (FluidAudio)

```swift
import FluidAudio

let manager = CosyVoice3TtsManager(
    modelsDirectory:     modelsURL,                            // this repo root
    tokenizerDirectory:  modelsURL.appendingPathComponent("tokenizer"),
    textEmbeddingsFile:  modelsURL.appendingPathComponent("embeddings/embeddings-runtime-fp32.safetensors"),
    specialTokensFile:   modelsURL.appendingPathComponent("tokenizer/special_tokens.json"))
try await manager.initialize()

let prompt = try CosyVoice3PromptAssets.load(
    from: voiceURL.appendingPathComponent("cosyvoice3-default-zh.safetensors"))

let result = try await manager.synthesize(
    text: "δ»Šε€©ε€©ζ°”ηœŸηš„εΎˆδΈι”™οΌŒι€‚εˆε‡Ίι—¨ζ•£ζ­₯。",
    promptAssets: prompt)
// result.samples β€” [Float] @ 24 kHz mono
```

## Model graph quick reference

- Qwen2 decoder: hidden=896, 24 layers, 14 Q heads, 2 KV heads, head_dim=64
- Speech vocab: 6761 (6561 tokens + sos/eos/task_id/stops)
- SOS=6561, EOS=6562, TASK_ID=6563
- Flow: 80-bin mel @ 24 kHz, hop=480, n_fft=1920
- HiFT: iSTFT-based vocoder, upsamples mel to 24 kHz PCM

## License

Apache-2.0. Derived from FunAudioLLM/CosyVoice3 weights; see upstream license.