| --- |
| license: cc-by-4.0 |
| library_name: coreml |
| tags: |
| - tts |
| - text-to-speech |
| - coreml |
| - apple |
| - on-device |
| language: |
| - en |
| pipeline_tag: text-to-speech |
| base_model: |
| - kyutai/pocket-tts |
| base_model_relation: finetune |
| --- |
| |
| # PocketTTS CoreML |
|
|
| CoreML conversion of [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) for on-device |
| inference on Apple platforms. |
|
|
| ## Models |
|
|
| | Model | Description | Size | |
| |-------|-------------|------| |
| | cond_step | KV cache prefill (voice + text conditioning) | ~200MB | |
| | flowlm_step | Autoregressive generation (transformer_out + EOS) | ~200MB | |
| | flow_decoder | Flow matching denoiser (8 Euler steps per frame) | ~190MB | |
| | mimi_decoder | Streaming audio codec (1920 samples per frame) | ~11MB | |
| |
| ## Voices |
| |
| 4 pre-encoded voices in `constants_bin/`: |
| - `alba` (default), `azelma`, `cosette`, `javert` |
|
|
| Voice cloning weights are **not included** — they are gated separately by Kyutai. |
|
|
| ## Usage |
|
|
| ```swift |
| import FluidAudioTTS |
| |
| let manager = PocketTtsManager() |
| try await manager.initialize() |
| let audio = try await manager.synthesize(text: "Hello, world!") |
| |
| See https://github.com/FluidInference/FluidAudio for the full Swift framework. |
| |
| License |
| |
| CC-BY-4.0, inherited from https://huggingface.co/kyutai/pocket-tts. Attribution to Kyutai is required. |
| |
| References |
| |
| - https://huggingface.co/kyutai/pocket-tts |
| - https://arxiv.org/abs/2410.00037 |
| - https://github.com/FluidInference/FluidAudio |