Instructions to use Muno459/fastconformer-quran with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Muno459/fastconformer-quran with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Muno459/fastconformer-quran") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
FastConformer-Quran
State-of-the-art automatic speech recognition for Quranic recitation, with a multi-signal mispronunciation detector built on top.
🚀 0.029% WER on EveryAyah test (Hafs riwayah). 13× better than the public Tarteel Whisper on the same audio. 🎯 82% sensitivity at 7% FPR on a 39K-token held-out mispronunciation benchmark. 📱 iOS-ready via the companion CoreML repo: Muno459/fastconformer-quran-coreml.
Headline numbers
Word and character error rates on EveryAyah test (500 clips, CTC decoder)
| Model | WER loose | WER strict | CER strict | RTF |
|---|---|---|---|---|
| FastConformer-Quran (this model) | 0.029% | 0.175% | 0.027% | 0.0012 |
Tarteel Whisper (tarteel-ai/whisper-base-ar-quran) |
0.380% | 0.409% | 0.080% | 0.0199 |
| Tarteel Whisper published claim | 5.7544% | n/a | n/a | n/a |
13× better WER and 16× faster on the same hardware.
Zero-shot reciter generalization (1,760 clips, 36 reciters)
Of the 36 reciters in the EveryAyah test split, 30 have zero training samples in our manifest:
| Group | Clips | This model | Tarteel Whisper | Ratio |
|---|---|---|---|---|
| Seen reciters (5) | 300 | 0.186% | 0.876% | 4.7× |
| Unseen reciters (30) | 1,460 | 0.230% | 0.635% | 2.8× |
| Aggregate | 1,760 | 0.222% | 0.676% | 3.0× |
Perfect (0.000%) WER on 16 zero-shot reciters including alafasy, husary, minshawi, mahmoud_ali_al_banna, mustafa_ismail, ahmed_ibn_ali_al_ajamy, and akram_alalaqimy. The model is not memorizing voices.
Phone-mic / user audio (tlog held-out, 800 clips, CTC)
| Metric | Value |
|---|---|
| WER loose | 21.24% |
| CER loose | 5.79% |
| WER strict | 40.91% |
| CER strict | 8.47% |
The tlog dataset has substantial label noise (filename to ayah mismatches, user-added basmala, partial recitations). A portion of the apparent WER is the model correctly transcribing what was said with the metadata being wrong.
Demo audio
Alafasy reciting Q 1:4:
Predicted: مَالِكِ يَوْمِ الدِّينِ ✓
Abdullah Basfar reciting Q 112:1:
Predicted: قُلْ هُوَ اللَّهُ أَحَدٌ ✓
Alafasy reciting Q 78:2:
Predicted: قُلْ هُوَ نَبَأٌ عَظِيمٌ ✓
Try the live demo: Space.
Mispronunciation detection
We also ship a multi-signal pronunciation scorer combining three orthogonal signals on the same CTC architecture:
| Signal | What it measures | Style-invariant |
|---|---|---|
| Pronunciation head v7 | Learned P(token correctly pronounced) on 1.33 M-parameter MLP over encoder features | ✓ |
| Reference-anchor distance | Cosine distance to master qari centroid bank (multi-ayah aware) | partial |
| CTC GOP | log P(expected token) minus max log P(non-blank token), averaged over CTC interval | ✓ |
A token is flagged by the consensus rule when at least 2 of 3 signals agree (default thresholds: head < 0.5, anchor > 0.20, GOP < -3.0).
Held-out evaluation
On 39,173 tokens from 996 tlog clips that were never seen by the pronunciation head during training, with per-token consensus labels from our ASR and ElevenLabs Scribe v2:
| Detector | TPR @ 1% FPR | TPR @ 5% FPR | AUC |
|---|---|---|---|
| GOP (style-invariant) | 72.4% | 77.6% | 0.969 |
| Pronunciation head | 73.7% | 84.2% | 0.953 |
| Anchor distance (where bank covers) | 0.7% | 10.5% | 0.732 |
| Consensus (2 of 3) | n/a | n/a | 82.2% TPR / 7.2% FPR |
The combined detector catches ~82% of real mispronunciations at a 7% false-positive rate. This is the deployable operating point.
What this model is and isn't
Is:
- The best published ASR for Quranic recitation in Hafs riwayah.
- A frame-level CTC model with 512-dim encoder features exposed for downstream scoring.
- Production-ready ONNX (fp32 437 MB, fp16 219 MB), running at RTF ~0.001 on an RTX 4090.
- Diacritic-aware: outputs fully harakat-marked Arabic text.
Isn't:
- A general Arabic ASR. Trained only on Quranic audio with a 1,024-token BPE tokenizer. Performance on dialectal, news, or conversational Arabic will be poor by design.
- A native streaming model. The offline model can be run on overlapping chunks for a responsive UX (~5-8 second latency), which is appropriate for ayah-by-ayah recitation. See the CoreML repo for the chunked-streaming pattern with Swift code. True token-by-token low-latency streaming would require a separately-trained cache-aware variant (multi-day GPU job, deferred).
- Trained for other qira'at. Hafs riwayah only.
How it was built
Stage 1: base training on EveryAyah. Fine-tuned NVIDIA's stt_en_fastconformer_hybrid_large_pc from English-pretrained weights to Arabic + Quran on 22 K EveryAyah clips. Reached 0.0757% WER on the held-out test split.
Stage 2: pronunciation scoring stack. Built a per-token head on top of frozen encoder features (512-dim pooled + 64-dim token embedding + 16-dim Quran-phonology features into an MLP). Initial training on weak labels (CTC-vs-expected disagreement + GOP scores) plus master qari anchors (Husary, Abdul Basit, Alafasy clean recitations).
Stage 3: phone-audio fine-tune. Three rounds of low-LR continuation on EveryAyah and tlog: 22,585 clean clips, 6,869 high-quality tlog clips (full weight), 20,589 borderline tlog clips (half weight). LR schedule 1e-5, 5e-6, 2.5e-6 over six epochs total. Trend was monotonic improvement on a held-out tlog slice through round three, then saturated.
Stage 4: dual-ASR consensus labels. Ran ElevenLabs Scribe v2 over the 6,168 highest-quality tlog clips. Aligned Scribe transcripts vs. expected ayah text at the character level (diacritic-insensitive Levenshtein), then aligned vs. our ASR output. Asymmetric-trust consensus rule:
A token is labeled CORRECT if EITHER ASR or Scribe says correct. It is labeled WRONG only when BOTH agree wrong.
Result: 144,664 per-token labels at ~98.0% positive rate, with 6,789 ASR-vs-Scribe disagreements flagged as high-information tokens. Cost: ~$11 in ElevenLabs Creator-plan credits.
Stage 5: final pronunciation head. Retrained on encoder features extracted with the final ASR (consistent features for both consensus labels and master qari anchors).
Stage 6: tajweed rule engine. Pure-Python rule engine that takes the expected diacritized text, audio, and per-token alignment, producing per-letter tajweed feedback across 27 dispatched rules (noon sakinah, meem sakinah, madd typology, qalqalah sughra/kubra, ra tafkheem/tarqeeq, Allah lafdh, lam shamsiyyah, hamzat wasl, idgham types, leen letters, more).
Architecture
- Backbone: NVIDIA FastConformer Large encoder + CTC head (114.6 M params)
- Tokenizer: SentencePiece BPE, 1,024 vocab + 1 blank = 1,025 output classes
- Audio: 16 kHz, log-mel features (80 channels)
- Decoder: CTC greedy_batched (frame-independent, noise-robust, deterministic)
- Output: token log-probabilities and 512-dim encoder features per output frame
CTC is the right decoder for this task: frame-level alignment for the pronunciation pipeline, no auto-correction that would mask user mispronunciations, faster inference, and the loose-WER ceiling on EveryAyah is already 0.029% so there's no quality reason to add decoder complexity.
Quick start (CTC, ONNX)
import numpy as np, soundfile as sf, onnxruntime as ort, sentencepiece as spm
session = ort.InferenceSession("onnx/model_with_encoder.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
wav, sr = sf.read("clip.wav") # 16 kHz mono float32
features = log_mel(wav) # see tajweed/aligner.py for the pipeline
features = features[None, ...] # (B=1, 80, T_in)
length = np.array([features.shape[2]], dtype=np.int64)
logprobs, encoder_features = session.run(
["logprobs", "encoder_output"],
{"audio_signal": features, "length": length},
)
# encoder_features: (1, 512, T_out). Feed to the pronunciation head.
# logprobs: (1, T_out, 1025). Argmax + CTC collapse to get tokens.
The full multi-signal scorer (CTC + head + anchor + GOP + tajweed) is in tajweed/full_scorer.py.
Files
| Path | Description |
|---|---|
nemo/fastconformer-quran-phase4c.nemo |
NeMo checkpoint (438 MB) |
onnx/model.onnx |
CTC-only ONNX, fp32 (437 MB) |
onnx/model.fp16.onnx |
CTC-only ONNX, fp16 (219 MB) |
onnx/model_with_encoder.onnx |
CTC + encoder features, fp32 (437 MB) |
head/pronunciation_head.pt |
Pronunciation head v7 (5.4 MB) |
tajweed/ |
Python module: text analyzer, 27 rules, full scorer |
tokenizer.model |
SentencePiece tokenizer |
model_config.yaml |
NeMo model config |
demo/ |
Three sample clips with known transcriptions |
For iOS / CoreML deployment, see Muno459/fastconformer-quran-coreml.
Datasets
- Tarteel EveryAyah (
tarteel-ai/everyayah): CC-BY 4.0. ~30 K studio recitation clips. - Tarteel tlog (
tarteel-ai/tlog): gated. Real user phone recordings.
License
Apache 2.0, matching the upstream FastConformer-Hybrid license.
Citation
@misc{fastconformer-quran-2026,
title = {FastConformer-Quran: Quranic ASR and unsupervised mispronunciation scoring},
author = {Anon},
year = {2026},
url = {https://huggingface.co/Muno459/fastconformer-quran},
}
- Downloads last month
- 60
Model tree for Muno459/fastconformer-quran
Datasets used to train Muno459/fastconformer-quran
tarteel-ai/tlog
Space using Muno459/fastconformer-quran 1
Evaluation results
- WER (loose, no diacritics) on EveryAyah testself-reported0.029
- CER (strict, with diacritics) on EveryAyah testself-reported0.027
- WER (strict, with diacritics) on EveryAyah testself-reported0.175
- WER (loose) on 30 unseen qaris on EveryAyah test (zero-shot reciters)self-reported0.230