Kokoro-82M phoneme encoder β Hailo-10H HEF (experimental)
β οΈ Experimental / early-stage research artifact. This HEF is the ALBERT phoneme encoder subgraph of
hexgrad/Kokoro-82M(StyleTTS2-based TTS), compiled for the Hailo-10H accelerator. Reproducibility is the headline; production audio quality is not evaluated yet β only the encoder block's INT8 numerical fidelity. See "Known limits" below.
This repo contains a HEF for the ALBERT phoneme encoder subgraph of Kokoro-82M on Hailo-10H (target architecture mercury) using DFC 5.3.0.
TL;DR
| HEF | kokoro-albert-encoder.hef (65.83 MiB) |
| Target | Hailo-10H (hailo10h / mercury) |
| DFC | 5.3.0 |
| Source model | hexgrad/Kokoro-82M (Apache-2.0, 82M total params; this HEF covers the ~67M phoneme-encoder subgraph) |
| Source ONNX | onnx-community/Kokoro-82M-v1.0-ONNX model.onnx (fp32, 311 MiB) |
| Subgraph cut | start_node=/encoder/bert/embeddings/LayerNorm/LayerNormalization_output_0, end_node=/encoder/bert_encoder/MatMul_output_0 β i.e. the embedding-hidden-mapping projection + 12-layer ALBERT transformer + the final encoder MatMul (which projects 768 β 512 for the duration predictor). |
| Sequence length | 128 phoneme tokens (static) |
| Input | [1, 128, 128] NWC β output of CPU-side embedding + LayerNorm (FP32) |
| Output | [1, 128, 512] NWC β encoder MatMul output, feeds the duration predictor downstream |
| Quantization | INT8 (with ew_add* raised to a16_w16; ALBERT-specific Pow(x, 3.0) rewritten to Mul(Mul(x,x), x)) |
| Cosine vs FP32 | ~0.72 mean cosine on emulator (SDK_QUANTIZED vs SDK_FP_OPTIMIZED) over 16 random-Gaussian inputs |
| Compile wall | 2 h 25 min on Kaggle CPU (30 GB RAM session, after 10 iterations of Kaggle-side debugging) |
What this is
- A reproducible compile of Kokoro's phoneme encoder onto Hailo-10H.
- A worked example of the ALBERT-on-Hailo stack: the same RuVector recipe used for vanilla BERT (Keras-serializable monkey-patch +
multiproc_policy=disabled) plus two ALBERT-specific ONNX surgeries:Pow(x, 3.0) β Mul(Mul(x,x), x)β ALBERT's tanh-approximation GELU contains a cubic that DFC'sPowop refuses (onlyPowwith even integer exponent β€ 2 is supported).- Shape-fix + onnxsim before
extract_modelβ Kokoro's ONNX hasShape(input_ids)nodes whose values flow through Concat/Reshape deeper in the graph; without folding them,onnx.utils.extract_modelrefuses to extract the encoder subgraph.
What this is not
- End-to-end Kokoro TTS. This HEF covers only the encoder block. The full Kokoro pipeline also needs: token embedding + LayerNorm (CPU, cheap) on the input side, and the duration predictor + iSTFTNet decoder on the output side. Those are not compiled here; they would each need their own HEF or run on CPU.
- Audio-quality evaluated. We measured INT8 fidelity at the encoder MatMul output (cosine 0.72 vs FP32 on random inputs). We did NOT generate any audio with this HEF and have no listening test, no MOS estimate, and no Whisper-WER measurement. The downstream duration predictor and iSTFTNet decoder are sensitive to encoder output noise; the actual perceptual TTS quality is unknown.
- Tuned with real phoneme calibration. The calibration set was random-normal Gaussian (synthesized in
crisphailo-kokoro-compile.pybecause no labelled phoneme corpus was wired into the bringup path). A real-phoneme calibration set might shift the cosine number β could be either direction. - Optimized. Compile was at
optimization_level=2 / compression_level=4(the levels that fit in 30 GB RAM during bringup).optimization_level=4(full QAT) was not attempted due to RAM/time budget.
Known limits
- Cosine ceiling: ~0.72 vs FP32 reference. The recipe imposes a quantization noise floor that's empirically identical for MiniLM BERT (10.67M, no-mask),
cstr/all-MiniLM-L6-v2-hailo10h) and this Kokoro ALBERT-base (66.89M, no-mask). The plateau is independent of model size, mask presence, or BERT vs ALBERT family β it's the noise floor of(matmul_correction zp_comp_block + ew_add a16_w16 + neg_exp rank=1)on Hailo-10H. - Output range slightly tighter than FP32: FP32 emulator output range
[-8.22, +7.55], INT8 emulator range[-7.40, +7.46]. Mild clipping in the tails. MSE 0.93 on a value-range ~16 β 6% relative. - Static sequence 128. For shorter phoneme sequences, pad with the Kokoro tokenizer's pad token; for longer, the HEF won't run at all (re-extract + recompile with the desired
--seq). - 6 NPU contexts in the compiled placement. Not a problem for inference, but expect ~6Γ the latency of a 1-context HEF on the same hardware.
- Encoder MatMul output is 512-d, not 768-d. The Kokoro StyleTTS2 architecture projects the 768-d ALBERT hidden state down to 512 at the encoder MatMul we cut at, before feeding the duration predictor. So downstream code expects a
[B, T, 512]tensor, not[B, T, 768].
How to deploy on Hailo-10H
Prereqs:
- Hailo-10H device + HailoRT β₯ 5.0
- Python 3.10+ with
hailo_platform, plus the upstream Kokoro pipeline (you need the CPU embedding + LayerNorm + duration predictor + decoder β this HEF is the encoder block only) - A phonemizer (
misakifor English, espeak-ng under the hood)
End-to-end inference is not turn-key yet. The integration would look like:
# 1. Text -> phonemes (CPU, misaki/espeak)
from misaki.en import G2P
g2p = G2P()
phonemes, _ = g2p("Hello world.")
ids = tokenize_to_kokoro_ids(phonemes, max_length=128) # [1, 128] int64
# 2. CPU embedding + LayerNorm in FP32
import onnxruntime as ort
pre = ort.InferenceSession("kokoro-pre.onnx") # extracted subgraph
layernorm_out = pre.run(["/encoder/bert/embeddings/LayerNorm/LayerNormalization_output_0"],
{"input_ids": ids})[0] # [1, 128, 128] float32
# 3. Encoder on Hailo-10H
from hailo_platform import HEF, VDevice, ConfigureParams, ...
hef = HEF("kokoro-albert-encoder.hef")
# ... configure, infer with layernorm_out reshaped to [1, 1, 128, 128] NHWC ...
encoder_out = run_hef(hef, layernorm_out) # [1, 128, 512] float32
# 4. CPU duration predictor + iSTFTNet decoder (FP32) -> waveform
post = ort.InferenceSession("kokoro-post.onnx")
audio = post.run(..., feeds_with(encoder_out, style, speed))[0]
The "kokoro-pre" and "kokoro-post" subgraphs are NOT included here β extracting them is left to the integrator. The starting point is model.onnx from onnx-community/Kokoro-82M-v1.0-ONNX and the cut nodes documented above.
How to recompile from scratch
This compile is RAM-heavy (14 GB peak) and slow (2.5 h on Kaggle CPU). The bundled Kaggle kernel script handles the full pipeline end-to-end.
# Option A β Kaggle (recommended for low-RAM environments)
# Mirror the kernel chr1str/crisphailo-kokoro-phoneme-encoder-compile from
# recipe/crisphailo-kokoro-compile.py. The kernel handles:
# - apt: python3.10-venv, libgraphviz-dev, ...
# - DFC venv (Python 3.10 is required for the wheel ABI)
# - Clone CrispHailo (private repo OK β uses GH_TOKEN from a Kaggle dataset)
# - Download Kokoro ONNX from HF
# - Run make_kokoro_encoder_only.py + replace_pow3_with_mul.py
# - Run compile_moonshine_ruvector.py (which is the generic ALBERT/BERT compile)
# Inputs to attach: chr1str/hailo-dfc-wheel-530, chr1str/crispasr-hf-token
# Wall: ~2.5 h, output: hefs/kokoro-albert-encoder.hef + .har
# Option B β local Linux box (need >= 16 GB RAM)
python3.10 -m venv dfcvenv
. dfcvenv/bin/activate
pip install hailo_dataflow_compiler-5.3.0-py3-none-linux_x86_64.whl
pip install numpy onnx onnxruntime onnxsim huggingface_hub
# 1. Download upstream Kokoro ONNX (fp32, 311 MiB)
python -c "from huggingface_hub import hf_hub_download; \
hf_hub_download('onnx-community/Kokoro-82M-v1.0-ONNX', 'onnx/model.onnx', \
local_dir='work/kokoro')"
# 2. Extract encoder subgraph (shape-fix + onnxsim + extract_model)
python recipe/make_kokoro_encoder_only.py \
--in work/kokoro/onnx/model.onnx \
--out work/kokoro/kokoro_albert_encoder_seq128.onnx \
--seq 128
# 3. Rewrite Pow(x, 3.0) -> Mul (ALBERT tanh-approximation GELU)
python recipe/replace_pow3_with_mul.py \
--in work/kokoro/kokoro_albert_encoder_seq128.onnx \
--out work/kokoro/kokoro_albert_encoder_nopow3.onnx
# 4. Compile (no real calib β uses random-normal, as in production bringup)
python recipe/compile_moonshine_ruvector.py \
--onnx work/kokoro/kokoro_albert_encoder_nopow3.onnx \
--input-node "/encoder/bert/embeddings/LayerNorm/LayerNormalization_output_0" \
--seq-len 128 --channels 128 --input-layout NWC \
--hef-out kokoro-albert-encoder.hef \
--har-out kokoro-albert-encoder.har \
--model-name kokoro_albert
For the cosine fidelity check at the end, use recipe/eval-kokoro.py (a Kaggle kernel script that runs SDK_FP_OPTIMIZED vs SDK_QUANTIZED and emits the JSON shipped here as kokoro-eval_results.json). Wall: ~15 min on Kaggle CPU.
Eval results
kokoro-eval_results.json summary:
{
"input_shape": [-1, 1, 128, 128],
"output_shape": [16, 1, 128, 512],
"n_samples": 16,
"cos_mean": 0.72425,
"cos_min": 0.54747,
"cos_max": 0.80045,
"mse_mean": 0.93454,
"fp_range": [-8.224, +7.549],
"q_range": [-7.398, +7.456]
}
Method: 16 random-Gaussian inputs N(0,1) shaped [1, 1, 128, 128] fed to both the SDK_FP_OPTIMIZED and SDK_QUANTIZED emulator contexts on the same HAR, per-sample cosine + MSE computed on the flattened [N, 1, 128, 512] outputs.
Files in this repo
kokoro-albert-encoder.hef β compiled HEF (65.83 MiB)
kokoro-eval_results.json β FP32 vs INT8 emulator cosine (16 samples)
source/
kokoro_albert_encoder_seq128.onnx β post-extract ONNX (23.11 MiB)
kokoro_albert_encoder_nopow3.onnx β post-Pow3-surgery ONNX (23.12 MiB; the DFC input)
recipe/
make_kokoro_encoder_only.py β shape-fix + onnxsim + extract_model
replace_pow3_with_mul.py β universal Pow(x,3) β Mul rewrite
compile_moonshine_ruvector.py β generic ALBERT/BERT compile driver
crisphailo-kokoro-compile.py β Kaggle bootstrap kernel (end-to-end)
eval-kokoro.py β Kaggle eval kernel (FP32 vs INT8)
Attribution + licensing
- Source TTS model:
hexgrad/Kokoro-82Mβ Apache-2.0. StyleTTS2 architecture, by hexgrad. ONNX export by onnx-community. - Compile recipe: derived from RuVector β
compile-encoder-hef.py(MIT, Copyright (c) 2025 rUv) for the Keras-serializable monkey-patch +multiproc_policy=disabledpattern. ALBERT-specific extensions (Pow(x,3)rewrite, shape-fix + onnxsim for cut extraction) are original to this work. - Hailo DFC: compilation requires the Hailo Dataflow Compiler under Hailo's Developer Zone EULA. This repo does NOT redistribute the DFC.
- This HEF: distributed under Apache-2.0 (inherits source model license). The Python recipe scripts are Apache-2.0 with per-file attribution headers.
See also
cstr/all-MiniLM-L6-v2-hailo10hβ sister HEF forsentence-transformers/all-MiniLM-L6-v2, same recipe family, same cosine plateau.- CrispHailo β full bringup log: 4 research sessions, Kokoro Kaggle migration (10 iterations of debugging the dataset mount + Python 3.10 ABI + GH_TOKEN auth), and the 22-line ALBERT-Hailo-10H technique-notes addendum.
Citation
@misc{kokoro-encoder-hailo10h,
title = {Kokoro-82M phoneme encoder Hailo-10H HEF (experimental)},
author = {CrispHailo project},
year = {2026},
howpublished = {\url{https://huggingface.co/cstr/Kokoro-82M-encoder-hailo10h}}
}
Bringup log dates: research session 4, Kaggle compile completed 2026-05-27, eval completed 2026-05-29. Compile wall: 2 h 25 min on Kaggle CPU at optimization_level=2 / compression_level=4. Emulator eval wall: 80 min total (DFC venv setup + 16-sample INT8 eval at 834 s; SDK_QUANTIZED is ~22Γ slower than SDK_FP_OPTIMIZED on the 12-layer ALBERT graph).