Python ONNX inference example needed

by seraaj - opened Jun 5

Jun 5

Hi,

Thank you for publishing this model — it looks exactly like what I need for a Quran memorization app with real-time word highlighting.

I'm trying to run inference with the ONNX model in Python but I'm hitting shape errors with the cache tensors. I've tried various combinations of cache sizes and chunk lengths but keep getting Where node and Add node broadcast errors.

Could you share a minimal working Python example showing:

Correct initial shapes for cache_last_channel, cache_last_time, and cache_last_channel_len
The expected chunk size in samples or mel frames
How to decode the output token ids to Arabic text (tokenizer / vocab file)

I also noticed the repo only contains ONNX files and no .nemo checkpoint, so nemo_asr.models.ASRModel.from_pretrained() fails with a missing model_config.yaml error. Is there a .nemo file available, or is ONNX the only supported inference path?

Here is what I have so far that fails:

import onnxruntime as ort
import numpy as np
import librosa

sess = ort.InferenceSession("model.q8.onnx", providers=["CPUExecutionProvider"])

# What should these shapes be?
cache_last_channel     = np.zeros((1, 17, ???, 512), dtype=np.float32)
cache_last_time        = np.zeros((1, 17, 512, ???), dtype=np.float32)
cache_last_channel_len = np.array([0], dtype=np.int64)

# What chunk size in samples should we use?
chunk_samples = ???

Errors I'm seeing:

Where node ... Attempting to broadcast an axis by a dimension other than 1. 11 by 81
Add node ... 10 by 72

Thank you so much for your help!

Muno459

Owner Jun 5

Hi, and sorry for the missing pieces. I just pushed two things to the repo that should unblock you:

tokenizer.model (it was missing, you need it to decode). It is a SentencePiece BPE, blank id = 1024.
streaming_inference_example.py, a tested minimal pure-ONNX example that transcribes correctly end to end.

Your broadcast errors come from the cache shapes. The correct INITIAL shapes (batch-first for ONNX) are:

cache_last_channel     = np.zeros((1, 17, 70, 512), np.float32)
cache_last_time        = np.zeros((1, 17, 512, 8),  np.float32)
cache_last_channel_len = np.zeros((1,), np.int64)

17 = encoder layers, 70 = left-context cache, 8 = conv cache. Feed each step's returned cache_*_next straight back in as the next step's cache_* (they are already batch-first).

The model takes 80-dim log-mel, NOT raw audio: n_fft=512, win_length=400 (25ms hann), hop_length=160 (10ms), 80 slaney-norm mel, power=2.0, log(x + 2**-24), preemph 0.97. Then apply the fixed-global CMVN in streaming_global_cmvn.npz (use the tlog_* constants for phone audio, clean_* for studio). Feed ~112 mel frames per step. Decode CTC-greedy (collapse repeats, drop blank=1024) then tokenizer.decode.

ONNX is the supported inference path, no .nemo needed, the example covers the whole pipeline. Let me know if anything else comes up. baraka Allah feek.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment