IMBE-ASR Base (48.6M params, d=512, 8 layers)

Edge-friendly speech recognition from IMBE vocoder parameters. Runs at 15x real-time on a Raspberry Pi 5.

Code: trunk-reporter/imbe-asr | Best model: imbe-asr-large-1024d | P25 fine-tuned: imbe-asr-base-512d-p25

Results

Evaluated on LibriSpeech-IMBE speaker-split validation. The included 3-gram KenLM halves WER with minimal overhead.

Decode method WER CER
Greedy 10.55% 3.32%
Beam + 3-gram KenLM (α=0.7, β=2.0) 4.84% 1.99%

For maximum accuracy, use imbe-asr-large-1024d (3.35% WER with 5-gram LM). Use this model for edge deployment where the large model doesn't fit.

Architecture

Parameter Value
d_model 512
Layers 8
Heads 8
d_ff 2048
Parameters 48.6M

Conformer-CTC, trained on ~1,220 hours of IMBE-encoded speech, 25 epochs.

Files

File Format Size Notes
model.safetensors SafeTensors 205 MB PyTorch weights
config.json JSON Architecture config
model.onnx ONNX fp32 195 MB Full precision
model_int8.onnx ONNX int8 57 MB Quantized, Python ORT
model_uint8.onnx ONNX uint8 59 MB Quantized, C engine compatible
stats.npz NumPy 2 KB Normalization stats (required)
lm/3gram.bin KenLM trie (3-gram, q8) 501 MB Language model for beam search
lm/unigrams.txt Vocabulary 9 MB Unigrams for beam decoder

Edge Deployment (Raspberry Pi 5, 4GB)

Runtime Format 10s call RTF RAM
C engine (70KB) fp32 ONNX 660ms 0.07x ~300 MB
C engine (70KB) uint8 ONNX 788ms 0.08x ~140 MB
PyTorch safetensors 800ms 0.08x 995 MB

Usage

Greedy decode (fast, no dependencies)

import onnxruntime as ort, numpy as np

session = ort.InferenceSession("model_int8.onnx")
stats = np.load("stats.npz")
features = ((raw_params - stats["mean"]) / stats["std"]).astype(np.float32)
log_probs, out_lengths = session.run(None, {
    "features": features.reshape(1, -1, 170),
    "lengths": np.array([features.shape[0]], dtype=np.int64),
})

Beam search + KenLM (recommended)

import onnxruntime as ort, numpy as np
from pyctcdecode import build_ctcdecoder

session = ort.InferenceSession("model_int8.onnx")
stats = np.load("stats.npz")

VOCAB = list(" ABCDEFGHIJKLMNOPQRSTUVWXYZ'")
labels = [""] + VOCAB
decoder = build_ctcdecoder(
    labels=labels,
    kenlm_model_path="lm/3gram.bin",
    unigrams=open("lm/unigrams.txt").read().splitlines(),
    alpha=0.7,   # LM weight — tuned on LibriSpeech-IMBE
    beta=2.0,    # word insertion bonus
)

features = ((raw_params - stats["mean"]) / stats["std"]).astype(np.float32)
log_probs, out_lengths = session.run(None, {
    "features": features.reshape(1, -1, 170),
    "lengths": np.array([features.shape[0]], dtype=np.int64),
})
text = decoder.decode(log_probs[0, :out_lengths[0]], beam_width=100)

Install dependencies: pip install pyctcdecode kenlm

Limitations

Downloads last month
78
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for trunk-reporter/imbe-asr-base-512d

Quantizations
1 model

Datasets used to train trunk-reporter/imbe-asr-base-512d

Collection including trunk-reporter/imbe-asr-base-512d