Update README.md
Browse files
README.md
CHANGED
|
@@ -1,55 +1,56 @@
|
|
| 1 |
|
| 2 |
-
---
|
| 3 |
-
license: cc-by-4.0
|
| 4 |
-
library_name: coreml
|
| 5 |
-
tags:
|
| 6 |
-
- tts
|
| 7 |
-
- text-to-speech
|
| 8 |
-
- coreml
|
| 9 |
-
- apple
|
| 10 |
-
- on-device
|
| 11 |
-
language:
|
| 12 |
-
- en
|
| 13 |
-
---
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
-
inference on Apple platforms.
|
| 19 |
|
| 20 |
-
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
-
|-------|-------------|------|
|
| 24 |
-
| cond_step | KV cache prefill (voice + text conditioning) | ~200MB |
|
| 25 |
-
| flowlm_step | Autoregressive generation (transformer_out + EOS) | ~200MB |
|
| 26 |
-
| flow_decoder | Flow matching denoiser (8 Euler steps per frame) | ~190MB |
|
| 27 |
-
| mimi_decoder | Streaming audio codec (1920 samples per frame) | ~11MB |
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
-
- `alba` (default), `azelma`, `cosette`, `javert`
|
| 33 |
|
| 34 |
-
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
-
import FluidAudioTTS
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
let audio = try await manager.synthesize(text: "Hello, world!")
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
|
|
|
|
|
|
|
|
| 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
---
|
| 4 |
+
license: cc-by-4.0
|
| 5 |
+
library_name: coreml
|
| 6 |
+
tags:
|
| 7 |
+
- tts
|
| 8 |
+
- text-to-speech
|
| 9 |
+
- coreml
|
| 10 |
+
- apple
|
| 11 |
+
- on-device
|
| 12 |
+
language:
|
| 13 |
+
- en
|
| 14 |
+
---
|
| 15 |
|
| 16 |
+
# PocketTTS CoreML
|
|
|
|
| 17 |
|
| 18 |
+
CoreML conversion of [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) for on-device
|
| 19 |
+
inference on Apple platforms.
|
| 20 |
|
| 21 |
+
## Models
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
| Model | Description | Size |
|
| 24 |
+
|-------|-------------|------|
|
| 25 |
+
| cond_step | KV cache prefill (voice + text conditioning) | ~200MB |
|
| 26 |
+
| flowlm_step | Autoregressive generation (transformer_out + EOS) | ~200MB |
|
| 27 |
+
| flow_decoder | Flow matching denoiser (8 Euler steps per frame) | ~190MB |
|
| 28 |
+
| mimi_decoder | Streaming audio codec (1920 samples per frame) | ~11MB |
|
| 29 |
|
| 30 |
+
## Voices
|
|
|
|
| 31 |
|
| 32 |
+
4 pre-encoded voices in `constants_bin/`:
|
| 33 |
+
- `alba` (default), `azelma`, `cosette`, `javert`
|
| 34 |
|
| 35 |
+
Voice cloning weights are **not included** — they are gated separately by Kyutai.
|
| 36 |
|
| 37 |
+
## Usage
|
|
|
|
| 38 |
|
| 39 |
+
```swift
|
| 40 |
+
import FluidAudioTTS
|
|
|
|
| 41 |
|
| 42 |
+
let manager = PocketTtsManager()
|
| 43 |
+
try await manager.initialize()
|
| 44 |
+
let audio = try await manager.synthesize(text: "Hello, world!")
|
| 45 |
|
| 46 |
+
See https://github.com/FluidInference/FluidAudio for the full Swift framework.
|
| 47 |
|
| 48 |
+
License
|
| 49 |
|
| 50 |
+
CC-BY-4.0, inherited from https://huggingface.co/kyutai/pocket-tts. Attribution to Kyutai is required.
|
| 51 |
|
| 52 |
+
References
|
| 53 |
+
|
| 54 |
+
- https://huggingface.co/kyutai/pocket-tts
|
| 55 |
+
- https://arxiv.org/abs/2410.00037
|
| 56 |
+
- https://github.com/FluidInference/FluidAudio
|