KittenTTS CoreML
CoreML conversions of KittenTTS models for on-device text-to-speech on iOS and macOS.
Two models | 24kHz audio | FP32 CoreML | 8 voices | iOS 17+ / macOS 14+
Models
| Model | Params | 5s Model | 10s Model | Speed Control |
|---|---|---|---|---|
| Nano | 15M | 61 MB | 62 MB | No |
| Mini | 80M | 280 MB | 281 MB | Yes |
Both models produce 24kHz audio and share the same 8 voices, but Mini has higher quality output and speed control.
Files
nano/
βββ kittentts_5s.mlmodelc/ # 5-second Nano model (70 max tokens)
βββ kittentts_10s.mlmodelc/ # 10-second Nano model (140 max tokens)
βββ voices/ # 8 voice embeddings (.bin, 256-dim, 1 KB each)
βββ voices.npz # Same voices in numpy format
mini/
βββ kittentts_mini_5s.mlmodelc/ # 5-second Mini model (70 max tokens)
βββ kittentts_mini_10s.mlmodelc/ # 10-second Mini model (140 max tokens)
βββ voices/ # 8 voice embeddings (.bin, 400x256, 400 KB each)
βββ voices.npz # Same voices in numpy format
Voices
| Voice | Gender |
|---|---|
expr-voice-2-m |
Male |
expr-voice-2-f |
Female |
expr-voice-3-m |
Male |
expr-voice-3-f |
Female |
expr-voice-4-m |
Male |
expr-voice-4-f |
Female |
expr-voice-5-m |
Male |
expr-voice-5-f |
Female |
Nano Voices
Each voice is a 256-dimensional float32 vector. The .bin files are raw binary (1,024 bytes each), loadable directly with Data(contentsOf:) in Swift.
Mini Voices
Each voice is a 400x256 float32 matrix β a multi-embedding where each row is a length-dependent style vector. Select a row based on the number of phoneme tokens (clamp to 0-399). The .bin files are raw binary (409,600 bytes each).
voice_matrix = np.load("mini/voices.npz")["expr-voice-2-m"] # (400, 256)
row_index = min(num_tokens, 399)
style = voice_matrix[row_index].reshape(1, 256)
Model I/O
Nano
Inputs
| Name | Shape | Type | Description |
|---|---|---|---|
input_ids |
[1, N] | INT32 | Phoneme token IDs (0-padded) |
ref_s |
[1, 256] | FLOAT32 | Voice style vector |
random_phases |
[1, 9] | FLOAT32 | Initial harmonic phases |
attention_mask |
[1, N] | INT32 | 1=valid token, 0=padding |
source_noise |
[1, T, 9] | FLOAT32 | Stochastic noise for unvoiced regions |
- 5s model: N=70, T=120,000
- 10s model: N=140, T=240,000
Outputs
| Name | Shape | Type | Description |
|---|---|---|---|
audio |
[1, 1, T+20] | FLOAT32 | Audio waveform at 24kHz, zeroed past valid length |
audio_length_samples |
[1] | INT32 | Number of valid audio samples |
pred_dur |
[1, N] | FLOAT32 | Predicted duration per token (frames) |
Mini
Inputs
| Name | Shape | Type | Description |
|---|---|---|---|
input_ids |
[1, N] | INT32 | Phoneme token IDs (0-padded) |
attention_mask |
[1, N] | INT32 | 1=valid token, 0=padding |
style |
[1, 256] | FLOAT32 | Voice style vector (row from voice matrix) |
speed |
[1] | FLOAT32 | Speech speed multiplier (1.0 = normal) |
- 5s model: N=70
- 10s model: N=140
Outputs
| Name | Shape | Type | Description |
|---|---|---|---|
audio |
[1, 1, T+20] | FLOAT32 | Audio waveform at 24kHz, zeroed past valid length |
audio_length_samples |
[1] | INT32 | Number of valid audio samples |
pred_dur |
[1, N] | FLOAT32 | Predicted duration per token (frames) |
Architecture
Text -> Phonemes -> ALBERT -> Duration/F0/Energy -> Style -> Decoder -> ISTFTNet -> Audio
Both models are StyleTTS2-based with:
- ALBERT Encoder: Shared-weight transformer for phoneme context
- Predictor: Duration, F0, energy via bidirectional LSTMs
- Decoder: AdaIN decode blocks with style conditioning
- Generator: ISTFTNet vocoder with Snake activations, harmonic source module
| Component | Nano | Mini |
|---|---|---|
| ALBERT embed / hidden | 128 / 768 | 128 / 768 |
| ALBERT layers (shared) | 4 repeats of 1 | 12 repeats of 1 |
| Generator channels | 256β128β64 | 512β256β128 |
| Noise inputs | External | Internal |
| Speed control | No | Yes |
| Parameters | 15M | 80M |
Conversion
Both models converted from ONNX INT8 quantized originals via:
- Extract & dequantize ONNX weights (INT8 β FP32)
- Reconstruct PyTorch model from ONNX graph
- Load dequantized weights
- Trace with
torch.jit.trace - Convert to CoreML mlprogram (FP32, iOS 17+)
Verification
| Metric | Nano | Mini |
|---|---|---|
| CoreML vs PyTorch correlation | 0.963 | 0.9994 |
| RMS ratio (CoreML/ONNX) | 0.99 | 0.99 |
| Parameters loaded | 561/573 | 413/523 |
Unloaded parameters are LayerNorm/InstanceNorm layers that default to weight=1, bias=0, matching the ONNX constants.
Source
- Nano original: KittenML/kitten-tts-nano-0.1 β 15M params, distilled from Kokoro-82M
- Mini original: KittenML/kitten-tts-mini-0.8 β 80M params, StyleTTS2
- Conversion code: FluidInference/mobius
- Sample rate: 24kHz
Model tree for alexwengg/kittentts-coreml
Base model
KittenML/kitten-tts-mini-0.8