You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

FastConformer-Quran

State-of-the-art automatic speech recognition for Quranic recitation, with a multi-signal mispronunciation detector built on top.

🚀 0.029% WER on EveryAyah test (Hafs riwayah). 13× better than the public Tarteel Whisper on the same audio. 🎯 82% sensitivity at 7% FPR on a 39K-token held-out mispronunciation benchmark. 📱 iOS-ready via the companion CoreML repo: Muno459/fastconformer-quran-coreml.


Headline numbers

Word and character error rates on EveryAyah test (500 clips, CTC decoder)

Model WER loose WER strict CER strict RTF
FastConformer-Quran (this model) 0.029% 0.175% 0.027% 0.0012
Tarteel Whisper (tarteel-ai/whisper-base-ar-quran) 0.380% 0.409% 0.080% 0.0199
Tarteel Whisper published claim 5.7544% n/a n/a n/a

13× better WER and 16× faster on the same hardware.

Zero-shot reciter generalization (1,760 clips, 36 reciters)

Of the 36 reciters in the EveryAyah test split, 30 have zero training samples in our manifest:

Group Clips This model Tarteel Whisper Ratio
Seen reciters (5) 300 0.186% 0.876% 4.7×
Unseen reciters (30) 1,460 0.230% 0.635% 2.8×
Aggregate 1,760 0.222% 0.676% 3.0×

Perfect (0.000%) WER on 16 zero-shot reciters including alafasy, husary, minshawi, mahmoud_ali_al_banna, mustafa_ismail, ahmed_ibn_ali_al_ajamy, and akram_alalaqimy. The model is not memorizing voices.

Phone-mic / user audio (tlog held-out, 800 clips, CTC)

Metric Value
WER loose 21.24%
CER loose 5.79%
WER strict 40.91%
CER strict 8.47%

The tlog dataset has substantial label noise (filename to ayah mismatches, user-added basmala, partial recitations). A portion of the apparent WER is the model correctly transcribing what was said with the metadata being wrong.


Demo audio

Alafasy reciting Q 1:4:

Predicted: مَالِكِ يَوْمِ الدِّينِ

Abdullah Basfar reciting Q 112:1:

Predicted: قُلْ هُوَ اللَّهُ أَحَدٌ

Alafasy reciting Q 78:2:

Predicted: قُلْ هُوَ نَبَأٌ عَظِيمٌ

Try the live demo: Space.


Mispronunciation detection

We also ship a multi-signal pronunciation scorer combining three orthogonal signals on the same CTC architecture:

Signal What it measures Style-invariant
Pronunciation head v7 Learned P(token correctly pronounced) on 1.33 M-parameter MLP over encoder features
Reference-anchor distance Cosine distance to master qari centroid bank (multi-ayah aware) partial
CTC GOP log P(expected token) minus max log P(non-blank token), averaged over CTC interval

A token is flagged by the consensus rule when at least 2 of 3 signals agree (default thresholds: head < 0.5, anchor > 0.20, GOP < -3.0).

Held-out evaluation

On 39,173 tokens from 996 tlog clips that were never seen by the pronunciation head during training, with per-token consensus labels from our ASR and ElevenLabs Scribe v2:

Detector TPR @ 1% FPR TPR @ 5% FPR AUC
GOP (style-invariant) 72.4% 77.6% 0.969
Pronunciation head 73.7% 84.2% 0.953
Anchor distance (where bank covers) 0.7% 10.5% 0.732
Consensus (2 of 3) n/a n/a 82.2% TPR / 7.2% FPR

The combined detector catches ~82% of real mispronunciations at a 7% false-positive rate. This is the deployable operating point.


What this model is and isn't

Is:

  • The best published ASR for Quranic recitation in Hafs riwayah.
  • A frame-level CTC model with 512-dim encoder features exposed for downstream scoring.
  • Production-ready ONNX (fp32 437 MB, fp16 219 MB), running at RTF ~0.001 on an RTX 4090.
  • Diacritic-aware: outputs fully harakat-marked Arabic text.

Isn't:

  • A general Arabic ASR. Trained only on Quranic audio with a 1,024-token BPE tokenizer. Performance on dialectal, news, or conversational Arabic will be poor by design.
  • A native streaming model. The offline model can be run on overlapping chunks for a responsive UX (~5-8 second latency), which is appropriate for ayah-by-ayah recitation. See the CoreML repo for the chunked-streaming pattern with Swift code. True token-by-token low-latency streaming would require a separately-trained cache-aware variant (multi-day GPU job, deferred).
  • Trained for other qira'at. Hafs riwayah only.

How it was built

Stage 1: base training on EveryAyah. Fine-tuned NVIDIA's stt_en_fastconformer_hybrid_large_pc from English-pretrained weights to Arabic + Quran on 22 K EveryAyah clips. Reached 0.0757% WER on the held-out test split.

Stage 2: pronunciation scoring stack. Built a per-token head on top of frozen encoder features (512-dim pooled + 64-dim token embedding + 16-dim Quran-phonology features into an MLP). Initial training on weak labels (CTC-vs-expected disagreement + GOP scores) plus master qari anchors (Husary, Abdul Basit, Alafasy clean recitations).

Stage 3: phone-audio fine-tune. Three rounds of low-LR continuation on EveryAyah and tlog: 22,585 clean clips, 6,869 high-quality tlog clips (full weight), 20,589 borderline tlog clips (half weight). LR schedule 1e-5, 5e-6, 2.5e-6 over six epochs total. Trend was monotonic improvement on a held-out tlog slice through round three, then saturated.

Stage 4: dual-ASR consensus labels. Ran ElevenLabs Scribe v2 over the 6,168 highest-quality tlog clips. Aligned Scribe transcripts vs. expected ayah text at the character level (diacritic-insensitive Levenshtein), then aligned vs. our ASR output. Asymmetric-trust consensus rule:

A token is labeled CORRECT if EITHER ASR or Scribe says correct. It is labeled WRONG only when BOTH agree wrong.

Result: 144,664 per-token labels at ~98.0% positive rate, with 6,789 ASR-vs-Scribe disagreements flagged as high-information tokens. Cost: ~$11 in ElevenLabs Creator-plan credits.

Stage 5: final pronunciation head. Retrained on encoder features extracted with the final ASR (consistent features for both consensus labels and master qari anchors).

Stage 6: tajweed rule engine. Pure-Python rule engine that takes the expected diacritized text, audio, and per-token alignment, producing per-letter tajweed feedback across 27 dispatched rules (noon sakinah, meem sakinah, madd typology, qalqalah sughra/kubra, ra tafkheem/tarqeeq, Allah lafdh, lam shamsiyyah, hamzat wasl, idgham types, leen letters, more).


Architecture

  • Backbone: NVIDIA FastConformer Large encoder + CTC head (114.6 M params)
  • Tokenizer: SentencePiece BPE, 1,024 vocab + 1 blank = 1,025 output classes
  • Audio: 16 kHz, log-mel features (80 channels)
  • Decoder: CTC greedy_batched (frame-independent, noise-robust, deterministic)
  • Output: token log-probabilities and 512-dim encoder features per output frame

CTC is the right decoder for this task: frame-level alignment for the pronunciation pipeline, no auto-correction that would mask user mispronunciations, faster inference, and the loose-WER ceiling on EveryAyah is already 0.029% so there's no quality reason to add decoder complexity.


Quick start (CTC, ONNX)

import numpy as np, soundfile as sf, onnxruntime as ort, sentencepiece as spm

session = ort.InferenceSession("onnx/model_with_encoder.onnx",
                                providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")

wav, sr = sf.read("clip.wav")           # 16 kHz mono float32
features = log_mel(wav)                  # see tajweed/aligner.py for the pipeline
features = features[None, ...]           # (B=1, 80, T_in)
length = np.array([features.shape[2]], dtype=np.int64)

logprobs, encoder_features = session.run(
    ["logprobs", "encoder_output"],
    {"audio_signal": features, "length": length},
)
# encoder_features: (1, 512, T_out). Feed to the pronunciation head.
# logprobs:         (1, T_out, 1025). Argmax + CTC collapse to get tokens.

The full multi-signal scorer (CTC + head + anchor + GOP + tajweed) is in tajweed/full_scorer.py.


Files

Path Description
nemo/fastconformer-quran-phase4c.nemo NeMo checkpoint (438 MB)
onnx/model.onnx CTC-only ONNX, fp32 (437 MB)
onnx/model.fp16.onnx CTC-only ONNX, fp16 (219 MB)
onnx/model_with_encoder.onnx CTC + encoder features, fp32 (437 MB)
head/pronunciation_head.pt Pronunciation head v7 (5.4 MB)
tajweed/ Python module: text analyzer, 27 rules, full scorer
tokenizer.model SentencePiece tokenizer
model_config.yaml NeMo model config
demo/ Three sample clips with known transcriptions

For iOS / CoreML deployment, see Muno459/fastconformer-quran-coreml.


Datasets

  • Tarteel EveryAyah (tarteel-ai/everyayah): CC-BY 4.0. ~30 K studio recitation clips.
  • Tarteel tlog (tarteel-ai/tlog): gated. Real user phone recordings.

License

Apache 2.0, matching the upstream FastConformer-Hybrid license.

Citation

@misc{fastconformer-quran-2026,
  title  = {FastConformer-Quran: Quranic ASR and unsupervised mispronunciation scoring},
  author = {Anon},
  year   = {2026},
  url    = {https://huggingface.co/Muno459/fastconformer-quran},
}
Downloads last month
60
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Muno459/fastconformer-quran

Quantizations
2 models

Datasets used to train Muno459/fastconformer-quran

Space using Muno459/fastconformer-quran 1

Evaluation results

  • WER (loose, no diacritics) on EveryAyah test
    self-reported
    0.029
  • CER (strict, with diacritics) on EveryAyah test
    self-reported
    0.027
  • WER (strict, with diacritics) on EveryAyah test
    self-reported
    0.175
  • WER (loose) on 30 unseen qaris on EveryAyah test (zero-shot reciters)
    self-reported
    0.230