Instructions to use slaughters85j/pocket-tts-coreml with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Pocket-TTS
How to use slaughters85j/pocket-tts-coreml with Pocket-TTS:
from pocket_tts import TTSModel import scipy.io.wavfile tts_model = TTSModel.load_model("slaughters85j/pocket-tts-coreml") voice_state = tts_model.get_state_for_audio_prompt( "hf://kyutai/tts-voices/alba-mackenna/casual.wav" ) audio = tts_model.generate_audio(voice_state, "Hello world, this is a test.") # Audio is a 1D torch tensor containing PCM data. scipy.io.wavfile.write("output.wav", tts_model.sample_rate, audio.numpy()) - Notebooks
- Google Colab
- Kaggle
Pocket TTS (Core ML)
Core ML conversion of Kyutai's pocket-tts text-to-speech model (CaLM + Mimi) for Apple Silicon. Real-time-capable on M1/M2/M3 CPUs. Produced by the pocket-tts-macos conversion pipeline.
Two-stage pipeline: a stateful CaLM autoregressive transformer (text β
continuous latents) followed by a stateful Mimi streaming codec
(latents β 24 kHz PCM). Both use ct.StateType KV caches for in-runtime
state management.
Artifacts
| File | Size | Purpose |
|---|---|---|
calm_stateful.mlpackage.zip |
~165 MB zipped (325 MB unzipped) | CaLM AR transformer (text β latent). Exposes next_latent, is_eos, and raw eos_logit (fp32) for Swift-side smoothing. |
mimi_stateful.mlpackage.zip |
~20 MB zipped (39 MB unzipped) | Mimi streaming decoder (latent β PCM). 8192-slot K/V buffer (= 512 frames = ~41 s of audio per generation). |
prompt_phase.mlpackage.zip |
~70 MB zipped (134 MB unzipped) | Voice + text prompt encoder (seeds CaLM KV cache). Stateless. |
voice_prompt_phase.mlpackage.zip |
~125 MB zipped (252 MB unzipped) | Voice prompt encoder. Stateless. |
End-to-end pipeline runs at 3.1Γ real-time on M1 Ultra (38 fps vs the 12.5 fps audio-frame-rate requirement). Validated end-to-end via a Swift harness that loads only the .mlpackage files (no Python at runtime).
Spec
CaLM (calm_stateful.mlpackage)
| Inputs | prev_latent [1, 1, 32] fp32, offset [1] int32, noise [1, 32] fp32 |
| State | 12 ct.StateType buffers (6 layers Γ {K, V}), fp16, MAX_SEQ=512 |
| Outputs | next_latent [1, 1, 32] fp32, is_eos [1, 1] fp32 (thresholded at -4.0), eos_logit [1, 1] fp32 (raw, pre-threshold) |
| Precision | fp32 compute + fp16 state (one-shot rounding at state-write boundary; no compounding through the residual stream) |
| Compute units | CPU_ONLY β see "Compute units" below |
Mimi (mimi_stateful.mlpackage)
| Inputs | latent [1, 1, 32] fp32, offset [1] int32 |
| State | 11 ct.StateType buffers (conv + attn caches), fp16, MAX_MIMI_SEQ=8192 |
| Output | pcm [1, 1, 1920] fp32 (80 ms of audio @ 24 kHz mono) |
| Precision | fp32 compute + fp16 state |
| Compute units | CPU_ONLY |
Critical: Use eos_logit with smoothing, not the binarized is_eos
The is_eos output is the model's internal eos_logit > -4.0 threshold
check, computed inside the fp16-state path. Empirically the Core ML
quantized state perturbs the eos head by 3-8 logit units (much larger
than fp16 K/V rounding alone would predict β the distilled student model
has tightly fit the teacher's residual-stream distribution and the EOS
head doesn't generalize to off-distribution inputs).
Result: EOS misfires both directions depending on the sentence (sometimes 3 frames early, sometimes 2 frames late). A fixed threshold offset won't work.
Recommended Swift integration: read eos_logit (fp32) and apply a
3-frame consecutive-fire smoothing rule (logit > -4.0 for 3 steps in a
row). Brings Core ML's effective firing point to within 1 frame of fp32
PyTorch reference on the canonical test.
Compute units β must be CPU_ONLY
Both at convert time AND at Swift load time:
let config = MLModelConfiguration()
config.computeUnits = .cpuOnly // required
let model = try MLModel(contentsOf: url, configuration: config)
Why: kyutai-pocket-tts was authored against CPU SDPA β kyutai's own
release notes mention "we tried GPU, no speedup vs CPU." ANE/GPU's fused
attention kernels produce numerical divergence the distilled student is
sensitive to. The EOS head amplifies small residual-stream perturbations
into multi-unit logit shifts. Reserve .all for size-bound models that
need it, not these.
Conversion notes
Critical precision detail: ct.StateType in coremltools 9.0 is fp16-only,
but the per-step compute runs fp32 via compute_precision=FLOAT32. The
wrappers explicitly .to(torch.float16) on state write and
.to(torch.float32) on state read inside the forward pass β those casts
become the ONLY fp16 surface in the converted graph. One-shot rounding
per position written, no compounding through the AR residual stream.
This idiom gave us 11Γ CaLM drift improvement and 1280Γ Mimi improvement
vs naive fp16-throughout conversion.
Mimi's MAX_MIMI_SEQ=8192 (= 512 frames at 16 attn-tokens/frame =
~41 s of audio) covers any realistic generation. Earlier
MAX_MIMI_SEQ=1024 silently overflowed past frame 64 β writes past the
buffer end are no-ops on Core ML's static graph, causing stale K/V reads
and audible distortion in the back half of long chunks. Don't reduce.
Prohibited uses (per upstream Kyutai license β preserved verbatim)
Use of this model must comply with all applicable laws and regulations and must not result in, involve, or facilitate any illegal, harmful, deceptive, fraudulent, or unauthorized activity. Prohibited uses include, without limitation:
- Voice impersonation or cloning without explicit and lawful consent.
- Misinformation, disinformation, or deception β including fake news, fraudulent calls, or presenting generated content as genuine recordings of real people or events.
- Generation of unlawful, harmful, libelous, abusive, harassing, discriminatory, hateful, or privacy-invasive content.
The maintainers of this conversion disclaim all liability for any non-compliant use.
Attribution
This is a derivative work of Kyutai's pocket-tts:
- Upstream model: kyutai/pocket-tts-without-voice-cloning
- Paper: Rouard, Orsini, Roebel, Zeghidour, DΓ©fossez β "Continuous Audio Language Models" (arXiv:2509.06926)
- Project page: kyutai-labs.github.io/pocket-tts
Mac port + Core ML conversion: pocket-tts-macos by @slaughters85j, listed officially on the kyutai-labs project page as the Mac port author.
License
CC-BY 4.0, inherited from the upstream Kyutai release. Attribution to Kyutai required. Prohibited-use clause above is part of the license terms and must travel with any redistribution.
Model tree for slaughters85j/pocket-tts-coreml
Base model
kyutai/pocket-tts-without-voice-cloning