File size: 3,430 Bytes
4e6f6d4 640f759 4e6f6d4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | # Kokoro CoreML (HAR-Optimized)
High-performance **Kokoro TTS CoreML conversion** with Apple Neural Engine (ANE) optimized HAR decoder buckets.
This repository contains precompiled `.mlpackage` models for fast on-device speech synthesis on Apple platforms.
---
*based on this this open source project*
https://github.com/mattmireles/kokoro-coreml
## π¦ Included Models
### π§ Duration Model (Stage 1)
- `kokoro_duration.mlpackage`
Handles variable-length text and predicts phoneme durations + intermediate features.
---
### π HAR Decoder Buckets (Stage 2 β ANE Optimized)
Fixed-size audio synthesis models:
- `KokoroDecoder_HAR_1s.mlpackage`
- `KokoroDecoder_HAR_2s.mlpackage`
- `KokoroDecoder_HAR_3s.mlpackage`
- `KokoroDecoder_HAR_5s.mlpackage`
- `KokoroDecoder_HAR_8s.mlpackage`
- `KokoroDecoder_HAR_10s.mlpackage`
- `KokoroDecoder_HAR_15s.mlpackage`
- `KokoroDecoder_HAR_20s.mlpackage`
- `KokoroDecoder_HAR.mlpackage`
---
### π Decoder-Only Variants
- `kokoro_decoder_only_3s.mlpackage`
- `kokoro_decoder_only_5s.mlpackage`
- `kokoro_decoder_only_10s.mlpackage`
---
### π F0 / Feature Variants
- `kokoro_f0n_3s.mlpackage`
- `kokoro_f0n_5s.mlpackage`
- `kokoro_f0n_10s.mlpackage`
---
### π Vocoder Variants
- `KokoroVocoder.mlpackage`
- `KokoroVocoder_asr64_f0128.mlpackage`
- `KokoroVocoder_asr80_f0160.mlpackage`
- `KokoroVocoder_asr96_f0192.mlpackage`
- `KokoroVocoder_asr128_f0256.mlpackage`
- `KokoroVocoder_asr160_f0320.mlpackage`
- `KokoroVocoder_asr200_f0400.mlpackage`
---
### π§ͺ Experimental / Alternative
- `kokoro_synthesizer_3s.mlpackage`
- `kokoro_synthesizer_3s_nolstm.mlpackage`
- `StyleTTS2_iSTFTNet_Decoder.mlpackage`
---
# π Architecture
This CoreML conversion uses a **two-stage pipeline** to support Kokoroβs dynamic operations while maximizing ANE performance.
---
## Stage 1 β Duration Model (CPU/GPU)
**Input:** Variable-length text (`ct.RangeDim`)
**Process:** Transformer + LSTM duration prediction
**Output:** Phoneme durations + intermediate features
**Compute:** CPU / GPU
Why CPU?
- LSTM layers are not ANE-compatible
- Dynamic shape text processing
---
## Stage 2 β HAR Decoder (ANE Optimized)
**Input:**
- Features from duration model
- Alignment matrix (built client-side)
**Process:**
Vocoder synthesis using iSTFTNet architecture
**Output:**
24kHz waveform audio
**Compute:**
Apple Neural Engine
---
# π Key Innovations
- **HAR Processing** β Harmonic/phase separation for ANE efficiency
- **Fixed-size Buckets** β Avoid CoreML dynamic shape issues
- **Client-side Alignment** β Swift/Python builds alignment matrix
- **On-demand Model Loading** β Memory optimized
- **MIL Graph Patching** β CoreML compatibility fixes
---
# β‘ Performance
### Runs on ANE (HAR Models)
- Conv1D
- ConvTranspose1D
- LeakyReLU
- Element-wise ops
**Result:** ~17Γ faster than real-time synthesis
---
### Runs on CPU/GPU (Duration Model)
- LSTM layers
- Transformer attention
- AdaLayerNorm
- Dynamic shape processing
---
# π§ Production Optimizations
- Bucket auto-selection
- ~200MB per loaded model
- Warm-up optimization
- Graceful bucket fallback
- Memory cleanup during idle
---
# π₯ Downloading
Because `.mlpackage` is a folder, download using Hugging Face CLI:
```bash
huggingface-cli download <username>/<repo> --local-dir .
---
license: mit
---
|