KittenTTS CoreML

CoreML conversions of KittenTTS models for on-device text-to-speech on iOS and macOS.

Two models | 24kHz audio | FP32 CoreML | 8 voices | iOS 17+ / macOS 14+

Models

Model Params 5s Model 10s Model Speed Control
Nano 15M 61 MB 62 MB No
Mini 80M 280 MB 281 MB Yes

Both models produce 24kHz audio and share the same 8 voices, but Mini has higher quality output and speed control.

Files

nano/
β”œβ”€β”€ kittentts_5s.mlmodelc/          # 5-second Nano model (70 max tokens)
β”œβ”€β”€ kittentts_10s.mlmodelc/         # 10-second Nano model (140 max tokens)
β”œβ”€β”€ voices/                         # 8 voice embeddings (.bin, 256-dim, 1 KB each)
└── voices.npz                      # Same voices in numpy format

mini/
β”œβ”€β”€ kittentts_mini_5s.mlmodelc/     # 5-second Mini model (70 max tokens)
β”œβ”€β”€ kittentts_mini_10s.mlmodelc/    # 10-second Mini model (140 max tokens)
β”œβ”€β”€ voices/                         # 8 voice embeddings (.bin, 400x256, 400 KB each)
└── voices.npz                      # Same voices in numpy format

Voices

Voice Gender
expr-voice-2-m Male
expr-voice-2-f Female
expr-voice-3-m Male
expr-voice-3-f Female
expr-voice-4-m Male
expr-voice-4-f Female
expr-voice-5-m Male
expr-voice-5-f Female

Nano Voices

Each voice is a 256-dimensional float32 vector. The .bin files are raw binary (1,024 bytes each), loadable directly with Data(contentsOf:) in Swift.

Mini Voices

Each voice is a 400x256 float32 matrix β€” a multi-embedding where each row is a length-dependent style vector. Select a row based on the number of phoneme tokens (clamp to 0-399). The .bin files are raw binary (409,600 bytes each).

voice_matrix = np.load("mini/voices.npz")["expr-voice-2-m"]  # (400, 256)
row_index = min(num_tokens, 399)
style = voice_matrix[row_index].reshape(1, 256)

Model I/O

Nano

Inputs

Name Shape Type Description
input_ids [1, N] INT32 Phoneme token IDs (0-padded)
ref_s [1, 256] FLOAT32 Voice style vector
random_phases [1, 9] FLOAT32 Initial harmonic phases
attention_mask [1, N] INT32 1=valid token, 0=padding
source_noise [1, T, 9] FLOAT32 Stochastic noise for unvoiced regions
  • 5s model: N=70, T=120,000
  • 10s model: N=140, T=240,000

Outputs

Name Shape Type Description
audio [1, 1, T+20] FLOAT32 Audio waveform at 24kHz, zeroed past valid length
audio_length_samples [1] INT32 Number of valid audio samples
pred_dur [1, N] FLOAT32 Predicted duration per token (frames)

Mini

Inputs

Name Shape Type Description
input_ids [1, N] INT32 Phoneme token IDs (0-padded)
attention_mask [1, N] INT32 1=valid token, 0=padding
style [1, 256] FLOAT32 Voice style vector (row from voice matrix)
speed [1] FLOAT32 Speech speed multiplier (1.0 = normal)
  • 5s model: N=70
  • 10s model: N=140

Outputs

Name Shape Type Description
audio [1, 1, T+20] FLOAT32 Audio waveform at 24kHz, zeroed past valid length
audio_length_samples [1] INT32 Number of valid audio samples
pred_dur [1, N] FLOAT32 Predicted duration per token (frames)

Architecture

Text -> Phonemes -> ALBERT -> Duration/F0/Energy -> Style -> Decoder -> ISTFTNet -> Audio

Both models are StyleTTS2-based with:

  • ALBERT Encoder: Shared-weight transformer for phoneme context
  • Predictor: Duration, F0, energy via bidirectional LSTMs
  • Decoder: AdaIN decode blocks with style conditioning
  • Generator: ISTFTNet vocoder with Snake activations, harmonic source module
Component Nano Mini
ALBERT embed / hidden 128 / 768 128 / 768
ALBERT layers (shared) 4 repeats of 1 12 repeats of 1
Generator channels 256β†’128β†’64 512β†’256β†’128
Noise inputs External Internal
Speed control No Yes
Parameters 15M 80M

Conversion

Both models converted from ONNX INT8 quantized originals via:

  1. Extract & dequantize ONNX weights (INT8 β†’ FP32)
  2. Reconstruct PyTorch model from ONNX graph
  3. Load dequantized weights
  4. Trace with torch.jit.trace
  5. Convert to CoreML mlprogram (FP32, iOS 17+)

Verification

Metric Nano Mini
CoreML vs PyTorch correlation 0.963 0.9994
RMS ratio (CoreML/ONNX) 0.99 0.99
Parameters loaded 561/573 413/523

Unloaded parameters are LayerNorm/InstanceNorm layers that default to weight=1, bias=0, matching the ONNX constants.

Source

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for alexwengg/kittentts-coreml

Finetuned
(1)
this model