Pocket TTS (Core ML)

Core ML conversion of Kyutai's pocket-tts text-to-speech model (CaLM + Mimi) for Apple Silicon. Real-time-capable on M1/M2/M3 CPUs. Produced by the pocket-tts-macos conversion pipeline.

Two-stage pipeline: a stateful CaLM autoregressive transformer (text β†’ continuous latents) followed by a stateful Mimi streaming codec (latents β†’ 24 kHz PCM). Both use ct.StateType KV caches for in-runtime state management.

Artifacts

File Size Purpose
calm_stateful.mlpackage.zip ~165 MB zipped (325 MB unzipped) CaLM AR transformer (text β†’ latent). Exposes next_latent, is_eos, and raw eos_logit (fp32) for Swift-side smoothing.
mimi_stateful.mlpackage.zip ~20 MB zipped (39 MB unzipped) Mimi streaming decoder (latent β†’ PCM). 8192-slot K/V buffer (= 512 frames = ~41 s of audio per generation).
prompt_phase.mlpackage.zip ~70 MB zipped (134 MB unzipped) Voice + text prompt encoder (seeds CaLM KV cache). Stateless.
voice_prompt_phase.mlpackage.zip ~125 MB zipped (252 MB unzipped) Voice prompt encoder. Stateless.

End-to-end pipeline runs at 3.1Γ— real-time on M1 Ultra (38 fps vs the 12.5 fps audio-frame-rate requirement). Validated end-to-end via a Swift harness that loads only the .mlpackage files (no Python at runtime).

Spec

CaLM (calm_stateful.mlpackage)

Inputs prev_latent [1, 1, 32] fp32, offset [1] int32, noise [1, 32] fp32
State 12 ct.StateType buffers (6 layers Γ— {K, V}), fp16, MAX_SEQ=512
Outputs next_latent [1, 1, 32] fp32, is_eos [1, 1] fp32 (thresholded at -4.0), eos_logit [1, 1] fp32 (raw, pre-threshold)
Precision fp32 compute + fp16 state (one-shot rounding at state-write boundary; no compounding through the residual stream)
Compute units CPU_ONLY β€” see "Compute units" below

Mimi (mimi_stateful.mlpackage)

Inputs latent [1, 1, 32] fp32, offset [1] int32
State 11 ct.StateType buffers (conv + attn caches), fp16, MAX_MIMI_SEQ=8192
Output pcm [1, 1, 1920] fp32 (80 ms of audio @ 24 kHz mono)
Precision fp32 compute + fp16 state
Compute units CPU_ONLY

Critical: Use eos_logit with smoothing, not the binarized is_eos

The is_eos output is the model's internal eos_logit > -4.0 threshold check, computed inside the fp16-state path. Empirically the Core ML quantized state perturbs the eos head by 3-8 logit units (much larger than fp16 K/V rounding alone would predict β€” the distilled student model has tightly fit the teacher's residual-stream distribution and the EOS head doesn't generalize to off-distribution inputs).

Result: EOS misfires both directions depending on the sentence (sometimes 3 frames early, sometimes 2 frames late). A fixed threshold offset won't work.

Recommended Swift integration: read eos_logit (fp32) and apply a 3-frame consecutive-fire smoothing rule (logit > -4.0 for 3 steps in a row). Brings Core ML's effective firing point to within 1 frame of fp32 PyTorch reference on the canonical test.

Compute units β€” must be CPU_ONLY

Both at convert time AND at Swift load time:

let config = MLModelConfiguration()
config.computeUnits = .cpuOnly  // required
let model = try MLModel(contentsOf: url, configuration: config)

Why: kyutai-pocket-tts was authored against CPU SDPA β€” kyutai's own release notes mention "we tried GPU, no speedup vs CPU." ANE/GPU's fused attention kernels produce numerical divergence the distilled student is sensitive to. The EOS head amplifies small residual-stream perturbations into multi-unit logit shifts. Reserve .all for size-bound models that need it, not these.

Conversion notes

Critical precision detail: ct.StateType in coremltools 9.0 is fp16-only, but the per-step compute runs fp32 via compute_precision=FLOAT32. The wrappers explicitly .to(torch.float16) on state write and .to(torch.float32) on state read inside the forward pass β€” those casts become the ONLY fp16 surface in the converted graph. One-shot rounding per position written, no compounding through the AR residual stream. This idiom gave us 11Γ— CaLM drift improvement and 1280Γ— Mimi improvement vs naive fp16-throughout conversion.

Mimi's MAX_MIMI_SEQ=8192 (= 512 frames at 16 attn-tokens/frame = ~41 s of audio) covers any realistic generation. Earlier MAX_MIMI_SEQ=1024 silently overflowed past frame 64 β€” writes past the buffer end are no-ops on Core ML's static graph, causing stale K/V reads and audible distortion in the back half of long chunks. Don't reduce.

Prohibited uses (per upstream Kyutai license β€” preserved verbatim)

Use of this model must comply with all applicable laws and regulations and must not result in, involve, or facilitate any illegal, harmful, deceptive, fraudulent, or unauthorized activity. Prohibited uses include, without limitation:

  • Voice impersonation or cloning without explicit and lawful consent.
  • Misinformation, disinformation, or deception β€” including fake news, fraudulent calls, or presenting generated content as genuine recordings of real people or events.
  • Generation of unlawful, harmful, libelous, abusive, harassing, discriminatory, hateful, or privacy-invasive content.

The maintainers of this conversion disclaim all liability for any non-compliant use.

Attribution

This is a derivative work of Kyutai's pocket-tts:

Mac port + Core ML conversion: pocket-tts-macos by @slaughters85j, listed officially on the kyutai-labs project page as the Mac port author.

License

CC-BY 4.0, inherited from the upstream Kyutai release. Attribution to Kyutai required. Prohibited-use clause above is part of the license terms and must travel with any redistribution.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for slaughters85j/pocket-tts-coreml

Finetuned
(1)
this model

Paper for slaughters85j/pocket-tts-coreml