Automatic Speech Recognition
NeMo
ONNX
Arabic
streaming
cache-aware
quran
arabic
fastconformer
Muno459's picture
Link the leakage-free benchmark leaderboard Space
ad78e6d verified
metadata
language: ar
license: apache-2.0
library_name: nemo
pipeline_tag: automatic-speech-recognition
tags:
  - automatic-speech-recognition
  - streaming
  - cache-aware
  - quran
  - arabic
  - fastconformer
  - onnx
datasets:
  - tarteel-ai/everyayah
  - tarteel-ai/tlog
  - obadx/muaalem-annotated-v3

FastConformer Quran ASR - Streaming (cache-aware)

Low-latency streaming Arabic Quran speech recognition. It transcribes live while you recite (causal, cache-aware FastConformer), for recitation-tracking use cases such as live word highlighting and real-time feedback. Trained and validated on real phone-recorded recitation, not just clean studio.

This is the streaming counterpart to the offline model: πŸ‘‰ Offline (highest accuracy): Muno459/fastconformer-quran

Offline vs streaming - which to use

Offline Streaming (this repo)
When you get output After the full recitation Live, every ~1s chunk
Latency High (whole-utterance) Low (real-time)
Clean studio WER ~1% ~6%
Real phone-audio WER ~9% (β‰ˆ4% normalized) ~20% (β‰ˆ14% normalized)
Best for Final transcript, max accuracy Live tracking, feedback, mushaf follow-along

Use offline when latency does not matter and you want the lowest error. Use streaming when you need output during recitation. The streaming model pays a few points of accuracy for real-time output.

πŸ“Š Full leakage-free leaderboard (this model vs nvidia, whisper, seamless, mms, omniASR, Tarteel): Quranic ASR Leaderboard.

WER notes: measured on a leakage-free held-out set - EveryAyah reciters never used in training, a held-out QUL reciter, and real phone-recorded recitation (tlog). CTC greedy, [70,13] context (1 s lookahead). "Normalized" = alef-insensitive, which removes a spelling-convention mismatch (the eval references use Uthmani rasm; the model outputs imlaei), reflecting true word accuracy over orthography. For reference, on the same held-out clips the public Arabic-ASR-leaderboard #1 (nvidia FastConformer) scores ~5.7% overall - the offline model here beats it (3% overall).

Which ONNX do I use?

Three graphs, same weights, pick by what you are building:

File Cache-aware (live)? Outputs encoder_output? Use for
model.onnx / model.q8.onnx βœ… ❌ Live word-by-word ASR (lowest overhead)
model_with_encoder.onnx / .q8 ❌ (full-context) βœ… Record-then-tajweed (run once on the whole clip)
model_streaming_with_encoder.onnx / .q8 βœ… βœ… Live ASR + live tajweed in one pass

If you only need live text, use model.q8.onnx. If you score pronunciation only after the user stops, model_with_encoder.onnx is simplest. If you want to stream text and feed the pronunciation head while reciting, use model_streaming_with_encoder.q8.onnx - one forward returns logprobs, encoder_output, and the next-step cache together.

Files

  • model.onnx - fp32 cache-aware streaming ONNX (~459 MB)
  • model.q8.onnx - INT8 dynamic-quantized (~132 MB, for on-device)
  • model_with_encoder.onnx / .q8.onnx - full-context (no cache) graph that returns encoder_output (~459 / ~132 MB); feature source for record-then-tajweed
  • model_streaming_with_encoder.onnx / .q8.onnx - cache-aware AND returns encoder_output in one pass (~459 / ~132 MB); for live ASR + live tajweed
  • streaming_global_cmvn.npz - fixed-global CMVN constants (clean_* / tlog_* mean+std)
  • streaming_inference_example.py - runnable pure-ONNX streaming reference (mel β†’ chunked forward β†’ CTC decode)
  • head/pronunciation_head.pt - streaming-matched pronunciation head (see below)
  • tajweed/head_scorer.py - loader/scorer for the pronunciation head

Pronunciation head

head/pronunciation_head.pt is a small (1.33 M-param) per-token classifier that takes pooled encoder features (512-d) + token IDs and returns P(token correctly pronounced). It is trained on features pooled from this streaming encoder ([70, 13] context), so its inputs match what the streaming model emits live - not the offline encoder.

To score pronunciation: get encoder_output (from model_with_encoder.onnx after recording, or live from model_streaming_with_encoder.onnx), CTC-align the tokens, mean-pool encoder_output over each token's frame interval, and feed those 512-d vectors + token IDs to the head via tajweed/head_scorer.py. The CoreML (ANE) build of this streaming head is in Muno459/fastconformer-quran-coreml-streaming.

Does streaming hurt tajweed scoring?

Barely. The streaming (limited-context) encoder is much weaker for ASR text (it needs long-range context), but pronunciation correctness is a local acoustic judgement per letter, which limited context preserves. Same held-out tokens, each head on its matched encoder features:

Encoder feeding the head ASR WER Pron. AUC TPR @ 1% FPR TPR @ 5% FPR
Offline (full context) 4.1 0.980 92.7% 94.8%
Streaming ([70,13]) 11.9 0.984 92.3% 96.5%

So you can run live tajweed off the streaming encoder with essentially no loss versus offline. Use the offline model only when you want the lowest text WER.

Inference

The model takes 80-dim log-mel features (the app extracts these), normalized with the supplied fixed-global CMVN (use tlog_* for phone audio, clean_* for studio), fed chunk by chunk with the cache tensors carried across steps:

# model.onnx
inputs : audio_signal[B,80,T], length, cache_last_channel, cache_last_time, cache_last_channel_len
outputs: logprobs[B,T',1025], encoded_lengths, cache_last_channel_next, cache_last_time_next, cache_last_channel_next_len

# model_streaming_with_encoder.onnx  (adds encoder_output)
outputs: logprobs[B,T',1025], encoder_output[B,512,T'], encoded_lengths, cache_last_channel_next, cache_last_time_next, cache_last_channel_next_len

Initialize the caches empty (cache_last_channel zeros [B,17,70,512], cache_last_time zeros [B,17,512,8], cache_last_channel_len [0]), feed each audio chunk, carry cache_*_next into the next step's cache_*, and CTC-greedy-decode logprobs (blank id = 1024). Carry your prev-token state across chunks too, or you will see duplicated letters at chunk seams. Vocabulary is a 1024-token Arabic BPE.

Training

FastConformer-Hybrid (17 layers, d_model 512), causal convolutions + chunked-limited attention. Warm-started from our offline Quran model, then adapted on a fully canonical-labeled corpus, all relabeled to each clip's canonical ayah text in one consistent imlaei orthography:

  • EveryAyah (tarteel-ai/everyayah): multi-reciter Hafs studio, the clean backbone.
  • Tarteel tlog (tarteel-ai/tlog): real phone recitation, upweighted for real-world robustness.
  • Muaalem (obadx/muaalem-annotated-v3): additional clean Hafs recitation (~12 K clips).

All Hafs riwayah.

License

Apache 2.0.