|
|
|
|
|
|
|
|
--- |
|
|
license: cc-by-4.0 |
|
|
library_name: coreml |
|
|
tags: |
|
|
- tts |
|
|
- text-to-speech |
|
|
- coreml |
|
|
- apple |
|
|
- on-device |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# PocketTTS CoreML |
|
|
|
|
|
CoreML conversion of [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) for on-device |
|
|
inference on Apple platforms. |
|
|
|
|
|
## Models |
|
|
|
|
|
| Model | Description | Size | |
|
|
|-------|-------------|------| |
|
|
| cond_step | KV cache prefill (voice + text conditioning) | ~200MB | |
|
|
| flowlm_step | Autoregressive generation (transformer_out + EOS) | ~200MB | |
|
|
| flow_decoder | Flow matching denoiser (8 Euler steps per frame) | ~190MB | |
|
|
| mimi_decoder | Streaming audio codec (1920 samples per frame) | ~11MB | |
|
|
|
|
|
## Voices |
|
|
|
|
|
4 pre-encoded voices in `constants_bin/`: |
|
|
- `alba` (default), `azelma`, `cosette`, `javert` |
|
|
|
|
|
Voice cloning weights are **not included** — they are gated separately by Kyutai. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```swift |
|
|
import FluidAudioTTS |
|
|
|
|
|
let manager = PocketTtsManager() |
|
|
try await manager.initialize() |
|
|
let audio = try await manager.synthesize(text: "Hello, world!") |
|
|
|
|
|
See https://github.com/FluidInference/FluidAudio for the full Swift framework. |
|
|
|
|
|
License |
|
|
|
|
|
CC-BY-4.0, inherited from https://huggingface.co/kyutai/pocket-tts. Attribution to Kyutai is required. |
|
|
|
|
|
References |
|
|
|
|
|
- https://huggingface.co/kyutai/pocket-tts |
|
|
- https://arxiv.org/abs/2410.00037 |
|
|
- https://github.com/FluidInference/FluidAudio |