Instructions to use Muno459/fastconformer-quran-streaming with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Muno459/fastconformer-quran-streaming with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Muno459/fastconformer-quran-streaming") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
language: ar
license: apache-2.0
library_name: nemo
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- streaming
- cache-aware
- quran
- arabic
- fastconformer
- onnx
datasets:
- tarteel-ai/everyayah
- tarteel-ai/tlog
- obadx/muaalem-annotated-v3
FastConformer Quran ASR - Streaming (cache-aware)
Low-latency streaming Arabic Quran speech recognition. It transcribes live while you recite (causal, cache-aware FastConformer), for recitation-tracking use cases such as live word highlighting and real-time feedback. Trained and validated on real phone-recorded recitation, not just clean studio.
This is the streaming counterpart to the offline model: π Offline (highest accuracy): Muno459/fastconformer-quran
Offline vs streaming - which to use
| Offline | Streaming (this repo) | |
|---|---|---|
| When you get output | After the full recitation | Live, every ~1s chunk |
| Latency | High (whole-utterance) | Low (real-time) |
| Clean studio WER | ~1% | ~6% |
| Real phone-audio WER | ~9% (β4% normalized) | ~20% (β14% normalized) |
| Best for | Final transcript, max accuracy | Live tracking, feedback, mushaf follow-along |
Use offline when latency does not matter and you want the lowest error. Use streaming when you need output during recitation. The streaming model pays a few points of accuracy for real-time output.
π Full leakage-free leaderboard (this model vs nvidia, whisper, seamless, mms, omniASR, Tarteel): Quranic ASR Leaderboard.
WER notes: measured on a leakage-free held-out set - EveryAyah reciters never used in training,
a held-out QUL reciter, and real phone-recorded recitation (tlog). CTC greedy, [70,13] context
(1 s lookahead). "Normalized" = alef-insensitive, which removes a spelling-convention mismatch (the
eval references use Uthmani rasm; the model outputs imlaei), reflecting true word accuracy over
orthography. For reference, on the same held-out clips the public Arabic-ASR-leaderboard #1
(nvidia FastConformer) scores ~5.7% overall - the offline model here beats it (3% overall).
Which ONNX do I use?
Three graphs, same weights, pick by what you are building:
| File | Cache-aware (live)? | Outputs encoder_output? |
Use for |
|---|---|---|---|
model.onnx / model.q8.onnx |
β | β | Live word-by-word ASR (lowest overhead) |
model_with_encoder.onnx / .q8 |
β (full-context) | β | Record-then-tajweed (run once on the whole clip) |
model_streaming_with_encoder.onnx / .q8 |
β | β | Live ASR + live tajweed in one pass |
If you only need live text, use model.q8.onnx. If you score pronunciation only after the user stops,
model_with_encoder.onnx is simplest. If you want to stream text and feed the pronunciation head
while reciting, use model_streaming_with_encoder.q8.onnx - one forward returns logprobs,
encoder_output, and the next-step cache together.
Files
model.onnx- fp32 cache-aware streaming ONNX (~459 MB)model.q8.onnx- INT8 dynamic-quantized (~132 MB, for on-device)model_with_encoder.onnx/.q8.onnx- full-context (no cache) graph that returnsencoder_output(~459 / ~132 MB); feature source for record-then-tajweedmodel_streaming_with_encoder.onnx/.q8.onnx- cache-aware AND returnsencoder_outputin one pass (~459 / ~132 MB); for live ASR + live tajweedstreaming_global_cmvn.npz- fixed-global CMVN constants (clean_*/tlog_*mean+std)streaming_inference_example.py- runnable pure-ONNX streaming reference (mel β chunked forward β CTC decode)head/pronunciation_head.pt- streaming-matched pronunciation head (see below)tajweed/head_scorer.py- loader/scorer for the pronunciation head
Pronunciation head
head/pronunciation_head.pt is a small (1.33 M-param) per-token classifier that takes pooled encoder
features (512-d) + token IDs and returns P(token correctly pronounced). It is trained on features
pooled from this streaming encoder ([70, 13] context), so its inputs match what the streaming model
emits live - not the offline encoder.
To score pronunciation: get encoder_output (from model_with_encoder.onnx after recording, or live
from model_streaming_with_encoder.onnx), CTC-align the tokens, mean-pool encoder_output over each
token's frame interval, and feed those 512-d vectors + token IDs to the head via
tajweed/head_scorer.py. The CoreML (ANE) build of this streaming head is in
Muno459/fastconformer-quran-coreml-streaming.
Does streaming hurt tajweed scoring?
Barely. The streaming (limited-context) encoder is much weaker for ASR text (it needs long-range context), but pronunciation correctness is a local acoustic judgement per letter, which limited context preserves. Same held-out tokens, each head on its matched encoder features:
| Encoder feeding the head | ASR WER | Pron. AUC | TPR @ 1% FPR | TPR @ 5% FPR |
|---|---|---|---|---|
| Offline (full context) | 4.1 | 0.980 | 92.7% | 94.8% |
Streaming ([70,13]) |
11.9 | 0.984 | 92.3% | 96.5% |
So you can run live tajweed off the streaming encoder with essentially no loss versus offline. Use the offline model only when you want the lowest text WER.
Inference
The model takes 80-dim log-mel features (the app extracts these), normalized with the supplied
fixed-global CMVN (use tlog_* for phone audio, clean_* for studio), fed chunk by chunk with the
cache tensors carried across steps:
# model.onnx
inputs : audio_signal[B,80,T], length, cache_last_channel, cache_last_time, cache_last_channel_len
outputs: logprobs[B,T',1025], encoded_lengths, cache_last_channel_next, cache_last_time_next, cache_last_channel_next_len
# model_streaming_with_encoder.onnx (adds encoder_output)
outputs: logprobs[B,T',1025], encoder_output[B,512,T'], encoded_lengths, cache_last_channel_next, cache_last_time_next, cache_last_channel_next_len
Initialize the caches empty (cache_last_channel zeros [B,17,70,512], cache_last_time zeros
[B,17,512,8], cache_last_channel_len [0]), feed each audio chunk, carry cache_*_next into the
next step's cache_*, and CTC-greedy-decode logprobs (blank id = 1024). Carry your prev-token
state across chunks too, or you will see duplicated letters at chunk seams. Vocabulary is a 1024-token
Arabic BPE.
Training
FastConformer-Hybrid (17 layers, d_model 512), causal convolutions + chunked-limited attention. Warm-started from our offline Quran model, then adapted on a fully canonical-labeled corpus, all relabeled to each clip's canonical ayah text in one consistent imlaei orthography:
- EveryAyah (
tarteel-ai/everyayah): multi-reciter Hafs studio, the clean backbone. - Tarteel tlog (
tarteel-ai/tlog): real phone recitation, upweighted for real-world robustness. - Muaalem (
obadx/muaalem-annotated-v3): additional clean Hafs recitation (~12 K clips).
All Hafs riwayah.
License
Apache 2.0.