Instructions to use Muno459/fastconformer-quran-streaming with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Muno459/fastconformer-quran-streaming with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Muno459/fastconformer-quran-streaming") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Python ONNX inference example needed
Hi,
Thank you for publishing this model — it looks exactly like what I need for a Quran memorization app with real-time word highlighting.
I'm trying to run inference with the ONNX model in Python but I'm hitting shape errors with the cache tensors. I've tried various combinations of cache sizes and chunk lengths but keep getting Where node and Add node broadcast errors.
Could you share a minimal working Python example showing:
- Correct initial shapes for
cache_last_channel,cache_last_time, andcache_last_channel_len - The expected chunk size in samples or mel frames
- How to decode the output token ids to Arabic text (tokenizer / vocab file)
I also noticed the repo only contains ONNX files and no .nemo checkpoint, so nemo_asr.models.ASRModel.from_pretrained() fails with a missing model_config.yaml error. Is there a .nemo file available, or is ONNX the only supported inference path?
Here is what I have so far that fails:
import onnxruntime as ort
import numpy as np
import librosa
sess = ort.InferenceSession("model.q8.onnx", providers=["CPUExecutionProvider"])
# What should these shapes be?
cache_last_channel = np.zeros((1, 17, ???, 512), dtype=np.float32)
cache_last_time = np.zeros((1, 17, 512, ???), dtype=np.float32)
cache_last_channel_len = np.array([0], dtype=np.int64)
# What chunk size in samples should we use?
chunk_samples = ???
Errors I'm seeing:
Where node ... Attempting to broadcast an axis by a dimension other than 1. 11 by 81Add node ... 10 by 72
Thank you so much for your help!
Hi, and sorry for the missing pieces. I just pushed two things to the repo that should unblock you:
tokenizer.model(it was missing, you need it to decode). It is a SentencePiece BPE, blank id = 1024.streaming_inference_example.py, a tested minimal pure-ONNX example that transcribes correctly end to end.
Your broadcast errors come from the cache shapes. The correct INITIAL shapes (batch-first for ONNX) are:
cache_last_channel = np.zeros((1, 17, 70, 512), np.float32)
cache_last_time = np.zeros((1, 17, 512, 8), np.float32)
cache_last_channel_len = np.zeros((1,), np.int64)
17 = encoder layers, 70 = left-context cache, 8 = conv cache. Feed each step's returned cache_*_next straight back in as the next step's cache_* (they are already batch-first).
The model takes 80-dim log-mel, NOT raw audio: n_fft=512, win_length=400 (25ms hann), hop_length=160 (10ms), 80 slaney-norm mel, power=2.0, log(x + 2**-24), preemph 0.97. Then apply the fixed-global CMVN in streaming_global_cmvn.npz (use the tlog_* constants for phone audio, clean_* for studio). Feed ~112 mel frames per step. Decode CTC-greedy (collapse repeats, drop blank=1024) then tokenizer.decode.
ONNX is the supported inference path, no .nemo needed, the example covers the whole pipeline. Let me know if anything else comes up. baraka Allah feek.