Kokoro-82M phoneme encoder — Hailo-10H HEF (experimental)

⚠️ Experimental / early-stage research artifact. This HEF is the ALBERT phoneme encoder subgraph of hexgrad/Kokoro-82M (StyleTTS2-based TTS), compiled for the Hailo-10H accelerator. Reproducibility is the headline; production audio quality is not evaluated yet — only the encoder block's INT8 numerical fidelity. See "Known limits" below.

This repo contains a HEF for the ALBERT phoneme encoder subgraph of Kokoro-82M on Hailo-10H (target architecture mercury) using DFC 5.3.0.

TL;DR


HEF	`kokoro-albert-encoder.hef` (65.83 MiB)
Target	Hailo-10H (`hailo10h` / mercury)
DFC	5.3.0
Source model	`hexgrad/Kokoro-82M` (Apache-2.0, 82M total params; this HEF covers the ~67M phoneme-encoder subgraph)
Source ONNX	`onnx-community/Kokoro-82M-v1.0-ONNX` `model.onnx` (fp32, 311 MiB)
Subgraph cut	`start_node=/encoder/bert/embeddings/LayerNorm/LayerNormalization_output_0`, `end_node=/encoder/bert_encoder/MatMul_output_0` — i.e. the embedding-hidden-mapping projection + 12-layer ALBERT transformer + the final encoder MatMul (which projects 768 → 512 for the duration predictor).
Sequence length	128 phoneme tokens (static)
Input	`[1, 128, 128]` NWC — output of CPU-side embedding + LayerNorm (FP32)
Output	`[1, 128, 512]` NWC — encoder MatMul output, feeds the duration predictor downstream
Quantization	INT8 (with `ew_add*` raised to a16_w16; ALBERT-specific `Pow(x, 3.0)` rewritten to `Mul(Mul(x,x), x)`)
Cosine vs FP32	~0.72 mean cosine on emulator (SDK_QUANTIZED vs SDK_FP_OPTIMIZED) over 16 random-Gaussian inputs
Compile wall	2 h 25 min on Kaggle CPU (30 GB RAM session, after 10 iterations of Kaggle-side debugging)

What this is

A reproducible compile of Kokoro's phoneme encoder onto Hailo-10H.
A worked example of the ALBERT-on-Hailo stack: the same RuVector recipe used for vanilla BERT (Keras-serializable monkey-patch + multiproc_policy=disabled) plus two ALBERT-specific ONNX surgeries:
1. Pow(x, 3.0) → Mul(Mul(x,x), x) — ALBERT's tanh-approximation GELU contains a cubic that DFC's Pow op refuses (only Pow with even integer exponent ≤ 2 is supported).
2. Shape-fix + onnxsim before extract_model — Kokoro's ONNX has Shape(input_ids) nodes whose values flow through Concat/Reshape deeper in the graph; without folding them, onnx.utils.extract_model refuses to extract the encoder subgraph.

What this is not

End-to-end Kokoro TTS. This HEF covers only the encoder block. The full Kokoro pipeline also needs: token embedding + LayerNorm (CPU, cheap) on the input side, and the duration predictor + iSTFTNet decoder on the output side. Those are not compiled here; they would each need their own HEF or run on CPU.
Audio-quality evaluated. We measured INT8 fidelity at the encoder MatMul output (cosine 0.72 vs FP32 on random inputs). We did NOT generate any audio with this HEF and have no listening test, no MOS estimate, and no Whisper-WER measurement. The downstream duration predictor and iSTFTNet decoder are sensitive to encoder output noise; the actual perceptual TTS quality is unknown.
Tuned with real phoneme calibration. The calibration set was random-normal Gaussian (synthesized in crisphailo-kokoro-compile.py because no labelled phoneme corpus was wired into the bringup path). A real-phoneme calibration set might shift the cosine number — could be either direction.
Optimized. Compile was at optimization_level=2 / compression_level=4 (the levels that fit in 30 GB RAM during bringup). optimization_level=4 (full QAT) was not attempted due to RAM/time budget.

Known limits

Cosine ceiling: ~0.72 vs FP32 reference. The recipe imposes a quantization noise floor that's empirically identical for MiniLM BERT (10.67M, no-mask), cstr/all-MiniLM-L6-v2-hailo10h) and this Kokoro ALBERT-base (66.89M, no-mask). The plateau is independent of model size, mask presence, or BERT vs ALBERT family — it's the noise floor of (matmul_correction zp_comp_block + ew_add a16_w16 + neg_exp rank=1) on Hailo-10H.
Output range slightly tighter than FP32: FP32 emulator output range [-8.22, +7.55], INT8 emulator range [-7.40, +7.46]. Mild clipping in the tails. MSE 0.93 on a value-range ~16 ≈ 6% relative.
Static sequence 128. For shorter phoneme sequences, pad with the Kokoro tokenizer's pad token; for longer, the HEF won't run at all (re-extract + recompile with the desired --seq).
6 NPU contexts in the compiled placement. Not a problem for inference, but expect ~6× the latency of a 1-context HEF on the same hardware.
Encoder MatMul output is 512-d, not 768-d. The Kokoro StyleTTS2 architecture projects the 768-d ALBERT hidden state down to 512 at the encoder MatMul we cut at, before feeding the duration predictor. So downstream code expects a [B, T, 512] tensor, not [B, T, 768].

How to deploy on Hailo-10H

Prereqs:

Hailo-10H device + HailoRT ≥ 5.0
Python 3.10+ with hailo_platform, plus the upstream Kokoro pipeline (you need the CPU embedding + LayerNorm + duration predictor + decoder — this HEF is the encoder block only)
A phonemizer (misaki for English, espeak-ng under the hood)

End-to-end inference is not turn-key yet. The integration would look like:

# 1. Text -> phonemes (CPU, misaki/espeak)
from misaki.en import G2P
g2p = G2P()
phonemes, _ = g2p("Hello world.")
ids = tokenize_to_kokoro_ids(phonemes, max_length=128)   # [1, 128] int64

# 2. CPU embedding + LayerNorm in FP32
import onnxruntime as ort
pre = ort.InferenceSession("kokoro-pre.onnx")            # extracted subgraph
layernorm_out = pre.run(["/encoder/bert/embeddings/LayerNorm/LayerNormalization_output_0"],
                        {"input_ids": ids})[0]            # [1, 128, 128] float32

# 3. Encoder on Hailo-10H
from hailo_platform import HEF, VDevice, ConfigureParams, ...
hef = HEF("kokoro-albert-encoder.hef")
# ... configure, infer with layernorm_out reshaped to [1, 1, 128, 128] NHWC ...
encoder_out = run_hef(hef, layernorm_out)                # [1, 128, 512] float32

# 4. CPU duration predictor + iSTFTNet decoder (FP32) -> waveform
post = ort.InferenceSession("kokoro-post.onnx")
audio = post.run(..., feeds_with(encoder_out, style, speed))[0]

The "kokoro-pre" and "kokoro-post" subgraphs are NOT included here — extracting them is left to the integrator. The starting point is model.onnx from onnx-community/Kokoro-82M-v1.0-ONNX and the cut nodes documented above.

How to recompile from scratch

This compile is RAM-heavy (~~14 GB peak) and slow (~~2.5 h on Kaggle CPU). The bundled Kaggle kernel script handles the full pipeline end-to-end.

# Option A — Kaggle (recommended for low-RAM environments)
# Mirror the kernel chr1str/crisphailo-kokoro-phoneme-encoder-compile from
# recipe/crisphailo-kokoro-compile.py. The kernel handles:
#   - apt: python3.10-venv, libgraphviz-dev, ...
#   - DFC venv (Python 3.10 is required for the wheel ABI)
#   - Clone CrispHailo (private repo OK — uses GH_TOKEN from a Kaggle dataset)
#   - Download Kokoro ONNX from HF
#   - Run make_kokoro_encoder_only.py + replace_pow3_with_mul.py
#   - Run compile_moonshine_ruvector.py (which is the generic ALBERT/BERT compile)
# Inputs to attach: chr1str/hailo-dfc-wheel-530, chr1str/crispasr-hf-token
# Wall: ~2.5 h, output: hefs/kokoro-albert-encoder.hef + .har

# Option B — local Linux box (need >= 16 GB RAM)
python3.10 -m venv dfcvenv
. dfcvenv/bin/activate
pip install hailo_dataflow_compiler-5.3.0-py3-none-linux_x86_64.whl
pip install numpy onnx onnxruntime onnxsim huggingface_hub

# 1. Download upstream Kokoro ONNX (fp32, 311 MiB)
python -c "from huggingface_hub import hf_hub_download; \
  hf_hub_download('onnx-community/Kokoro-82M-v1.0-ONNX', 'onnx/model.onnx', \
                  local_dir='work/kokoro')"

# 2. Extract encoder subgraph (shape-fix + onnxsim + extract_model)
python recipe/make_kokoro_encoder_only.py \
    --in work/kokoro/onnx/model.onnx \
    --out work/kokoro/kokoro_albert_encoder_seq128.onnx \
    --seq 128

# 3. Rewrite Pow(x, 3.0) -> Mul (ALBERT tanh-approximation GELU)
python recipe/replace_pow3_with_mul.py \
    --in work/kokoro/kokoro_albert_encoder_seq128.onnx \
    --out work/kokoro/kokoro_albert_encoder_nopow3.onnx

# 4. Compile (no real calib — uses random-normal, as in production bringup)
python recipe/compile_moonshine_ruvector.py \
    --onnx work/kokoro/kokoro_albert_encoder_nopow3.onnx \
    --input-node "/encoder/bert/embeddings/LayerNorm/LayerNormalization_output_0" \
    --seq-len 128 --channels 128 --input-layout NWC \
    --hef-out kokoro-albert-encoder.hef \
    --har-out kokoro-albert-encoder.har \
    --model-name kokoro_albert

For the cosine fidelity check at the end, use recipe/eval-kokoro.py (a Kaggle kernel script that runs SDK_FP_OPTIMIZED vs SDK_QUANTIZED and emits the JSON shipped here as kokoro-eval_results.json). Wall: ~15 min on Kaggle CPU.

Eval results

kokoro-eval_results.json summary:

{
  "input_shape":  [-1, 1, 128, 128],
  "output_shape": [16, 1, 128, 512],
  "n_samples": 16,
  "cos_mean": 0.72425,
  "cos_min":  0.54747,
  "cos_max":  0.80045,
  "mse_mean": 0.93454,
  "fp_range": [-8.224, +7.549],
  "q_range":  [-7.398, +7.456]
}

Method: 16 random-Gaussian inputs N(0,1) shaped [1, 1, 128, 128] fed to both the SDK_FP_OPTIMIZED and SDK_QUANTIZED emulator contexts on the same HAR, per-sample cosine + MSE computed on the flattened [N, 1, 128, 512] outputs.

Files in this repo

kokoro-albert-encoder.hef                  ← compiled HEF (65.83 MiB)
kokoro-eval_results.json                   ← FP32 vs INT8 emulator cosine (16 samples)
source/
  kokoro_albert_encoder_seq128.onnx        ← post-extract ONNX (23.11 MiB)
  kokoro_albert_encoder_nopow3.onnx        ← post-Pow3-surgery ONNX (23.12 MiB; the DFC input)
recipe/
  make_kokoro_encoder_only.py              ← shape-fix + onnxsim + extract_model
  replace_pow3_with_mul.py                 ← universal Pow(x,3) → Mul rewrite
  compile_moonshine_ruvector.py            ← generic ALBERT/BERT compile driver
  crisphailo-kokoro-compile.py             ← Kaggle bootstrap kernel (end-to-end)
  eval-kokoro.py                           ← Kaggle eval kernel (FP32 vs INT8)

Attribution + licensing

Source TTS model: hexgrad/Kokoro-82M — Apache-2.0. StyleTTS2 architecture, by hexgrad. ONNX export by onnx-community.
Compile recipe: derived from RuVector — compile-encoder-hef.py (MIT, Copyright (c) 2025 rUv) for the Keras-serializable monkey-patch + multiproc_policy=disabled pattern. ALBERT-specific extensions (Pow(x,3) rewrite, shape-fix + onnxsim for cut extraction) are original to this work.
Hailo DFC: compilation requires the Hailo Dataflow Compiler under Hailo's Developer Zone EULA. This repo does NOT redistribute the DFC.
This HEF: distributed under Apache-2.0 (inherits source model license). The Python recipe scripts are Apache-2.0 with per-file attribution headers.

Citation

@misc{kokoro-encoder-hailo10h,
  title  = {Kokoro-82M phoneme encoder Hailo-10H HEF (experimental)},
  author = {CrispHailo project},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/cstr/Kokoro-82M-encoder-hailo10h}}
}

Bringup log dates: research session 4, Kaggle compile completed 2026-05-27, eval completed 2026-05-29. Compile wall: 2 h 25 min on Kaggle CPU at optimization_level=2 / compression_level=4. Emulator eval wall: 80 min total (DFC venv setup + 16-sample INT8 eval at 834 s; SDK_QUANTIZED is ~22× slower than SDK_FP_OPTIMIZED on the 12-layer ALBERT graph).

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for cstr/Kokoro-82M-encoder-hailo10h

Base model

yl4579/StyleTTS2-LJSpeech

Finetuned

hexgrad/Kokoro-82M

Quantized

(55)

this model

cstr
/

Kokoro-82M-encoder-hailo10h