FastConformer Quran ASR β€” Streaming (cache-aware)

Low-latency streaming Arabic Quran speech recognition. It transcribes live while you recite (causal, cache-aware FastConformer), for recitation-tracking use cases such as live word highlighting and real-time feedback. Trained and validated on real phone-recorded recitation, not just clean studio.

This is the streaming counterpart to the offline model: πŸ‘‰ Offline (highest accuracy): Muno459/fastconformer-quran

Offline vs streaming β€” which to use

Offline Streaming (this repo)
When you get output After the full recitation Live, every ~1s chunk
Latency High (whole-utterance) Low (real-time)
Clean studio WER <1% ~5%
Real phone-audio WER ~22% (β‰ˆ14% normalized) ~29% (β‰ˆ23% normalized)
Best for Final transcript, max accuracy Live tracking, feedback, mushaf follow-along

Use offline when latency does not matter and you want the lowest error. Use streaming when you need output during recitation. The streaming model pays a few points of accuracy for real-time output.

WER notes: held-out eval on real phone recordings + clean studio, CTC greedy, [70,13] context (~1s lookahead). "Normalized" removes a spelling-convention mismatch (the eval references use Uthmani rasm; the model outputs imlaei), so it reflects true word accuracy rather than orthography.

Files

  • model.onnx β€” fp32 cache-aware streaming ONNX (~459 MB)
  • model.q8.onnx β€” INT8 dynamic-quantized (~132 MB, for on-device)
  • streaming_global_cmvn.npz β€” fixed-global CMVN constants (clean_* / tlog_* mean+std)

Inference

The model takes 80-dim log-mel features (the app extracts these), normalized with the supplied fixed-global CMVN (use tlog_* for phone audio, clean_* for studio), fed chunk by chunk with the cache tensors carried across steps:

inputs : audio_signal[B,80,T], length, cache_last_channel, cache_last_time, cache_last_channel_len
outputs: logprobs[B,T',1025], encoded_lengths, cache_last_channel_next, cache_last_time_next, cache_last_channel_next_len

Initialize the caches empty, feed each audio chunk, carry cache_*_next into the next step's cache_*, and CTC-greedy-decode logprobs (blank id = 1024). Vocabulary is a 1024-token Arabic BPE.

Training

FastConformer-Hybrid (17 layers, d_model 512), causal convolutions + chunked-limited attention. Warm-started from our offline Quran model, then adapted on a fully canonical-labeled corpus: EveryAyah (multi-reciter Hafs) plus real phone recitation (Tarteel tlog) labeled with each clip's canonical ayah text, all in one consistent imlaei orthography.

License

Apache 2.0.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support