IMBE-ASR: Speech Recognition from Vocoder Parameters
Collection
ASR directly from P25 IMBE codec parameters. Skip audio reconstruction, go straight from the digital bitstream to text. 1.9%% WER on LibriSpeech-IMBE. • 3 items • Updated
Edge-friendly speech recognition from IMBE vocoder parameters. Runs at 15x real-time on a Raspberry Pi 5.
Code: trunk-reporter/imbe-asr | Best model: imbe-asr-large-1024d | P25 fine-tuned: imbe-asr-base-512d-p25
Evaluated on LibriSpeech-IMBE speaker-split validation. The included 3-gram KenLM halves WER with minimal overhead.
| Decode method | WER | CER |
|---|---|---|
| Greedy | 10.55% | 3.32% |
| Beam + 3-gram KenLM (α=0.7, β=2.0) | 4.84% | 1.99% |
For maximum accuracy, use imbe-asr-large-1024d (3.35% WER with 5-gram LM). Use this model for edge deployment where the large model doesn't fit.
| Parameter | Value |
|---|---|
| d_model | 512 |
| Layers | 8 |
| Heads | 8 |
| d_ff | 2048 |
| Parameters | 48.6M |
Conformer-CTC, trained on ~1,220 hours of IMBE-encoded speech, 25 epochs.
| File | Format | Size | Notes |
|---|---|---|---|
model.safetensors |
SafeTensors | 205 MB | PyTorch weights |
config.json |
JSON | — | Architecture config |
model.onnx |
ONNX fp32 | 195 MB | Full precision |
model_int8.onnx |
ONNX int8 | 57 MB | Quantized, Python ORT |
model_uint8.onnx |
ONNX uint8 | 59 MB | Quantized, C engine compatible |
stats.npz |
NumPy | 2 KB | Normalization stats (required) |
lm/3gram.bin |
KenLM trie (3-gram, q8) | 501 MB | Language model for beam search |
lm/unigrams.txt |
Vocabulary | 9 MB | Unigrams for beam decoder |
| Runtime | Format | 10s call | RTF | RAM |
|---|---|---|---|---|
| C engine (70KB) | fp32 ONNX | 660ms | 0.07x | ~300 MB |
| C engine (70KB) | uint8 ONNX | 788ms | 0.08x | ~140 MB |
| PyTorch | safetensors | 800ms | 0.08x | 995 MB |
import onnxruntime as ort, numpy as np
session = ort.InferenceSession("model_int8.onnx")
stats = np.load("stats.npz")
features = ((raw_params - stats["mean"]) / stats["std"]).astype(np.float32)
log_probs, out_lengths = session.run(None, {
"features": features.reshape(1, -1, 170),
"lengths": np.array([features.shape[0]], dtype=np.int64),
})
import onnxruntime as ort, numpy as np
from pyctcdecode import build_ctcdecoder
session = ort.InferenceSession("model_int8.onnx")
stats = np.load("stats.npz")
VOCAB = list(" ABCDEFGHIJKLMNOPQRSTUVWXYZ'")
labels = [""] + VOCAB
decoder = build_ctcdecoder(
labels=labels,
kenlm_model_path="lm/3gram.bin",
unigrams=open("lm/unigrams.txt").read().splitlines(),
alpha=0.7, # LM weight — tuned on LibriSpeech-IMBE
beta=2.0, # word insertion bonus
)
features = ((raw_params - stats["mean"]) / stats["std"]).astype(np.float32)
log_probs, out_lengths = session.run(None, {
"features": features.reshape(1, -1, 170),
"lengths": np.array([features.shape[0]], dtype=np.int64),
})
text = decoder.decode(log_probs[0, :out_lengths[0]], beam_width=100)
Install dependencies: pip install pyctcdecode kenlm