Pocket TTS (Core ML)

Core ML conversion of Kyutai's pocket-tts text-to-speech model (CaLM + Mimi) for Apple Silicon. Real-time-capable on M1/M2/M3 CPUs. Produced by the pocket-tts-macos conversion pipeline.

Two-stage pipeline: a stateful CaLM autoregressive transformer (text → continuous latents) followed by a stateful Mimi streaming codec (latents → 24 kHz PCM). Both use ct.StateType KV caches for in-runtime state management.

Artifacts

File	Size	Purpose
`calm_stateful.mlpackage.zip`	~165 MB zipped (325 MB unzipped)	CaLM AR transformer (text → latent). Exposes `next_latent`, `is_eos`, and raw `eos_logit` (fp32) for Swift-side smoothing.
`mimi_stateful.mlpackage.zip`	~20 MB zipped (39 MB unzipped)	Mimi streaming decoder (latent → PCM). 8192-slot K/V buffer (= 512 frames = ~41 s of audio per generation).
`prompt_phase.mlpackage.zip`	~70 MB zipped (134 MB unzipped)	Voice + text prompt encoder (seeds CaLM KV cache). Stateless.
`voice_prompt_phase.mlpackage.zip`	~125 MB zipped (252 MB unzipped)	Voice prompt encoder. Stateless.

End-to-end pipeline runs at 3.1× real-time on M1 Ultra (38 fps vs the 12.5 fps audio-frame-rate requirement). Validated end-to-end via a Swift harness that loads only the .mlpackage files (no Python at runtime).

Spec

CaLM (`calm_stateful.mlpackage`)


Inputs	`prev_latent` `[1, 1, 32]` fp32, `offset` `[1]` int32, `noise` `[1, 32]` fp32
State	12 `ct.StateType` buffers (6 layers × {K, V}), fp16, MAX_SEQ=512
Outputs	`next_latent` `[1, 1, 32]` fp32, `is_eos` `[1, 1]` fp32 (thresholded at -4.0), `eos_logit` `[1, 1]` fp32 (raw, pre-threshold)
Precision	fp32 compute + fp16 state (one-shot rounding at state-write boundary; no compounding through the residual stream)
Compute units	CPU_ONLY — see "Compute units" below

Mimi (`mimi_stateful.mlpackage`)


Inputs	`latent` `[1, 1, 32]` fp32, `offset` `[1]` int32
State	11 `ct.StateType` buffers (conv + attn caches), fp16, MAX_MIMI_SEQ=8192
Output	`pcm` `[1, 1, 1920]` fp32 (80 ms of audio @ 24 kHz mono)
Precision	fp32 compute + fp16 state
Compute units	CPU_ONLY

Critical: Use `eos_logit` with smoothing, not the binarized `is_eos`

The is_eos output is the model's internal eos_logit > -4.0 threshold check, computed inside the fp16-state path. Empirically the Core ML quantized state perturbs the eos head by 3-8 logit units (much larger than fp16 K/V rounding alone would predict — the distilled student model has tightly fit the teacher's residual-stream distribution and the EOS head doesn't generalize to off-distribution inputs).

Result: EOS misfires both directions depending on the sentence (sometimes 3 frames early, sometimes 2 frames late). A fixed threshold offset won't work.

Recommended Swift integration: read eos_logit (fp32) and apply a 3-frame consecutive-fire smoothing rule (logit > -4.0 for 3 steps in a row). Brings Core ML's effective firing point to within 1 frame of fp32 PyTorch reference on the canonical test.

Compute units — must be CPU_ONLY

Both at convert time AND at Swift load time:

let config = MLModelConfiguration()
config.computeUnits = .cpuOnly  // required
let model = try MLModel(contentsOf: url, configuration: config)

Why: kyutai-pocket-tts was authored against CPU SDPA — kyutai's own release notes mention "we tried GPU, no speedup vs CPU." ANE/GPU's fused attention kernels produce numerical divergence the distilled student is sensitive to. The EOS head amplifies small residual-stream perturbations into multi-unit logit shifts. Reserve .all for size-bound models that need it, not these.

Conversion notes

Critical precision detail: ct.StateType in coremltools 9.0 is fp16-only, but the per-step compute runs fp32 via compute_precision=FLOAT32. The wrappers explicitly .to(torch.float16) on state write and .to(torch.float32) on state read inside the forward pass — those casts become the ONLY fp16 surface in the converted graph. One-shot rounding per position written, no compounding through the AR residual stream. This idiom gave us 11× CaLM drift improvement and 1280× Mimi improvement vs naive fp16-throughout conversion.

Mimi's MAX_MIMI_SEQ=8192 (= 512 frames at 16 attn-tokens/frame = ~41 s of audio) covers any realistic generation. Earlier MAX_MIMI_SEQ=1024 silently overflowed past frame 64 — writes past the buffer end are no-ops on Core ML's static graph, causing stale K/V reads and audible distortion in the back half of long chunks. Don't reduce.

Prohibited uses (per upstream Kyutai license — preserved verbatim)

Use of this model must comply with all applicable laws and regulations and must not result in, involve, or facilitate any illegal, harmful, deceptive, fraudulent, or unauthorized activity. Prohibited uses include, without limitation:

Voice impersonation or cloning without explicit and lawful consent.
Misinformation, disinformation, or deception — including fake news, fraudulent calls, or presenting generated content as genuine recordings of real people or events.
Generation of unlawful, harmful, libelous, abusive, harassing, discriminatory, hateful, or privacy-invasive content.

The maintainers of this conversion disclaim all liability for any non-compliant use.

Attribution

This is a derivative work of Kyutai's pocket-tts:

Upstream model: kyutai/pocket-tts-without-voice-cloning
Paper: Rouard, Orsini, Roebel, Zeghidour, Défossez — "Continuous Audio Language Models" (arXiv:2509.06926)
Project page: kyutai-labs.github.io/pocket-tts

Mac port + Core ML conversion: pocket-tts-macos by @slaughters85j, listed officially on the kyutai-labs project page as the Mac port author.

License

CC-BY 4.0, inherited from the upstream Kyutai release. Attribution to Kyutai required. Prohibited-use clause above is part of the license terms and must travel with any redistribution.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for slaughters85j/pocket-tts-coreml

Base model

kyutai/pocket-tts-without-voice-cloning

Finetuned

(1)

this model

Paper for slaughters85j/pocket-tts-coreml

Continuous Audio Language Models

Paper • 2509.06926 • Published Sep 8, 2025 • 8