KittenTTS CoreML

CoreML conversions of KittenTTS models for on-device text-to-speech on iOS and macOS.

Two models | 24kHz audio | FP32 CoreML | 8 voices | iOS 17+ / macOS 14+

Models

Model	Params	5s Model	10s Model	Speed Control
Nano	15M	61 MB	62 MB	No
Mini	80M	280 MB	281 MB	Yes

Both models produce 24kHz audio and share the same 8 voices, but Mini has higher quality output and speed control.

Files

nano/
├── kittentts_5s.mlmodelc/          # 5-second Nano model (70 max tokens)
├── kittentts_10s.mlmodelc/         # 10-second Nano model (140 max tokens)
├── voices/                         # 8 voice embeddings (.bin, 256-dim, 1 KB each)
└── voices.npz                      # Same voices in numpy format

mini/
├── kittentts_mini_5s.mlmodelc/     # 5-second Mini model (70 max tokens)
├── kittentts_mini_10s.mlmodelc/    # 10-second Mini model (140 max tokens)
├── voices/                         # 8 voice embeddings (.bin, 400x256, 400 KB each)
└── voices.npz                      # Same voices in numpy format

Voices

Voice	Gender
`expr-voice-2-m`	Male
`expr-voice-2-f`	Female
`expr-voice-3-m`	Male
`expr-voice-3-f`	Female
`expr-voice-4-m`	Male
`expr-voice-4-f`	Female
`expr-voice-5-m`	Male
`expr-voice-5-f`	Female

Nano Voices

Each voice is a 256-dimensional float32 vector. The .bin files are raw binary (1,024 bytes each), loadable directly with Data(contentsOf:) in Swift.

Mini Voices

Each voice is a 400x256 float32 matrix — a multi-embedding where each row is a length-dependent style vector. Select a row based on the number of phoneme tokens (clamp to 0-399). The .bin files are raw binary (409,600 bytes each).

voice_matrix = np.load("mini/voices.npz")["expr-voice-2-m"]  # (400, 256)
row_index = min(num_tokens, 399)
style = voice_matrix[row_index].reshape(1, 256)

Model I/O

Nano

Inputs

Name	Shape	Type	Description
`input_ids`	[1, N]	INT32	Phoneme token IDs (0-padded)
`ref_s`	[1, 256]	FLOAT32	Voice style vector
`random_phases`	[1, 9]	FLOAT32	Initial harmonic phases
`attention_mask`	[1, N]	INT32	1=valid token, 0=padding
`source_noise`	[1, T, 9]	FLOAT32	Stochastic noise for unvoiced regions

5s model: N=70, T=120,000
10s model: N=140, T=240,000

Outputs

Name	Shape	Type	Description
`audio`	[1, 1, T+20]	FLOAT32	Audio waveform at 24kHz, zeroed past valid length
`audio_length_samples`	[1]	INT32	Number of valid audio samples
`pred_dur`	[1, N]	FLOAT32	Predicted duration per token (frames)

Mini

Inputs

Name	Shape	Type	Description
`input_ids`	[1, N]	INT32	Phoneme token IDs (0-padded)
`attention_mask`	[1, N]	INT32	1=valid token, 0=padding
`style`	[1, 256]	FLOAT32	Voice style vector (row from voice matrix)
`speed`	[1]	FLOAT32	Speech speed multiplier (1.0 = normal)

5s model: N=70
10s model: N=140

Outputs

Name	Shape	Type	Description
`audio`	[1, 1, T+20]	FLOAT32	Audio waveform at 24kHz, zeroed past valid length
`audio_length_samples`	[1]	INT32	Number of valid audio samples
`pred_dur`	[1, N]	FLOAT32	Predicted duration per token (frames)

Architecture

Text -> Phonemes -> ALBERT -> Duration/F0/Energy -> Style -> Decoder -> ISTFTNet -> Audio

Both models are StyleTTS2-based with:

ALBERT Encoder: Shared-weight transformer for phoneme context
Predictor: Duration, F0, energy via bidirectional LSTMs
Decoder: AdaIN decode blocks with style conditioning
Generator: ISTFTNet vocoder with Snake activations, harmonic source module

Component	Nano	Mini
ALBERT embed / hidden	128 / 768	128 / 768
ALBERT layers (shared)	4 repeats of 1	12 repeats of 1
Generator channels	256→128→64	512→256→128
Noise inputs	External	Internal
Speed control	No	Yes
Parameters	15M	80M

Conversion

Both models converted from ONNX INT8 quantized originals via:

Extract & dequantize ONNX weights (INT8 → FP32)
Reconstruct PyTorch model from ONNX graph
Load dequantized weights
Trace with torch.jit.trace
Convert to CoreML mlprogram (FP32, iOS 17+)

Verification

Metric	Nano	Mini
CoreML vs PyTorch correlation	0.963	0.9994
RMS ratio (CoreML/ONNX)	0.99	0.99
Parameters loaded	561/573	413/523

Unloaded parameters are LayerNorm/InstanceNorm layers that default to weight=1, bias=0, matching the ONNX constants.

Source

Nano original: KittenML/kitten-tts-nano-0.1 — 15M params, distilled from Kokoro-82M
Mini original: KittenML/kitten-tts-mini-0.8 — 80M params, StyleTTS2
Conversion code: FluidInference/mobius
Sample rate: 24kHz

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for alexwengg/kittentts-coreml

Base model

KittenML/kitten-tts-mini-0.8

Finetuned

(2)

this model