Instructions to use Muno459/fastconformer-quran-streaming with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Muno459/fastconformer-quran-streaming with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Muno459/fastconformer-quran-streaming") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
FastConformer Quran ASR β Streaming (cache-aware)
Low-latency streaming Arabic Quran speech recognition. It transcribes live while you recite (causal, cache-aware FastConformer), for recitation-tracking use cases such as live word highlighting and real-time feedback. Trained and validated on real phone-recorded recitation, not just clean studio.
This is the streaming counterpart to the offline model: π Offline (highest accuracy): Muno459/fastconformer-quran
Offline vs streaming β which to use
| Offline | Streaming (this repo) | |
|---|---|---|
| When you get output | After the full recitation | Live, every ~1s chunk |
| Latency | High (whole-utterance) | Low (real-time) |
| Clean studio WER | <1% | ~5% |
| Real phone-audio WER | ~22% (β14% normalized) | ~29% (β23% normalized) |
| Best for | Final transcript, max accuracy | Live tracking, feedback, mushaf follow-along |
Use offline when latency does not matter and you want the lowest error. Use streaming when you need output during recitation. The streaming model pays a few points of accuracy for real-time output.
WER notes: held-out eval on real phone recordings + clean studio, CTC greedy, [70,13] context
(~1s lookahead). "Normalized" removes a spelling-convention mismatch (the eval references use Uthmani
rasm; the model outputs imlaei), so it reflects true word accuracy rather than orthography.
Files
model.onnxβ fp32 cache-aware streaming ONNX (~459 MB)model.q8.onnxβ INT8 dynamic-quantized (~132 MB, for on-device)streaming_global_cmvn.npzβ fixed-global CMVN constants (clean_*/tlog_*mean+std)
Inference
The model takes 80-dim log-mel features (the app extracts these), normalized with the supplied
fixed-global CMVN (use tlog_* for phone audio, clean_* for studio), fed chunk by chunk with the
cache tensors carried across steps:
inputs : audio_signal[B,80,T], length, cache_last_channel, cache_last_time, cache_last_channel_len
outputs: logprobs[B,T',1025], encoded_lengths, cache_last_channel_next, cache_last_time_next, cache_last_channel_next_len
Initialize the caches empty, feed each audio chunk, carry cache_*_next into the next step's
cache_*, and CTC-greedy-decode logprobs (blank id = 1024). Vocabulary is a 1024-token Arabic BPE.
Training
FastConformer-Hybrid (17 layers, d_model 512), causal convolutions + chunked-limited attention. Warm-started from our offline Quran model, then adapted on a fully canonical-labeled corpus: EveryAyah (multi-reciter Hafs) plus real phone recitation (Tarteel tlog) labeled with each clip's canonical ayah text, all in one consistent imlaei orthography.
License
Apache 2.0.
- Downloads last month
- -