| --- |
| license: apache-2.0 |
| tags: |
| - coreml |
| - tts |
| - text-to-speech |
| - apple |
| - qwen3 |
| language: |
| - en |
| - zh |
| library_name: coremltools |
| --- |
| |
| # Qwen3-TTS CoreML |
|
|
| CoreML conversion of [Qwen/Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) (0.6B) for on-device inference on Apple platforms. |
|
|
| Supports English and Chinese text-to-speech synthesis. |
|
|
| ## Models |
|
|
| | Model | Description | Size | |
| |-------|-------------|------| |
| | `qwen3_tts_lm_prefill_v9` | LM KV-cache prefill (text + speaker conditioning) | ~2.8 GB | |
| | `qwen3_tts_lm_decode_v10` | Autoregressive LM decode (CB0 codec token generation) | ~1.8 GB | |
| | `qwen3_tts_cp_prefill` | Code predictor prefill (CB1-15 conditioning) | ~432 MB | |
| | `qwen3_tts_cp_decode` | Code predictor decode (CB1-15 generation) | ~420 MB | |
| | `qwen3_tts_decoder_10s` | Audio decoder (16-codebook codes β 24kHz waveform) | ~436 MB | |
| | `speaker_embedding_official.npy` | Default speaker embedding (1024-dim) | 4 KB | |
|
|
| **Total: ~5.9 GB** |
|
|
| ## Pipeline |
|
|
| ``` |
| Text tokens + Speaker embedding |
| β |
| LM Prefill (KV cache initialization) |
| β |
| LM Decode (CB0 codec tokens, temperature=0.9, top_k=50) |
| β |
| Code Predictor Prefill + Decode (CB1-15 per frame) |
| β |
| Audio Decoder (16 codebooks β 24kHz waveform) |
| β |
| Silence trimming β Final audio |
| ``` |
|
|
| ## Key Parameters |
|
|
| - **Sample rate:** 24,000 Hz |
| - **Codebooks:** 16 (CB0 from LM, CB1-15 from code predictor) |
| - **Max codec tokens:** 125 frames (~10s audio) |
| - **Sampling:** temperature=0.9, top_k=50 (both CB0 and CB1-15) |
| - **EOS token ID:** 2150 (in codec logit space) |
| |
| ## Usage |
| |
| ```swift |
| import FluidAudioTTS |
| |
| let manager = Qwen3TtsManager() |
| try await manager.loadFromDirectory(modelDir) |
| |
| let wav = try await manager.synthesize( |
| text: "Hello world", |
| tokenIds: [9707, 1879, ...], // Pre-tokenized with Qwen3 processor |
| useSpeaker: true |
| ) |
| ``` |
| |
| See [FluidAudio](https://github.com/FluidInference/FluidAudio) for the full Swift framework. |
| |
| ## Conversion |
| |
| Converted using [coremltools](https://github.com/apple/coremltools) from the original PyTorch weights. Conversion scripts are in the [mobius](https://github.com/FluidInference/mobius) repository. |
| |
| ## License |
| |
| **Apache-2.0**, inherited from [Qwen/Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS). |
| |
| ## References |
| |
| - [Qwen/Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) |
| - [FluidAudio](https://github.com/FluidInference/FluidAudio) |
| - [mobius](https://github.com/FluidInference/mobius) |
| |