You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

FastConformer-Quran — Streaming (CoreML / Apple Neural Engine)

Real-time, on-device streaming Quranic recitation ASR for iOS & macOS. Cache-aware FastConformer-Hybrid (CTC), fp16, runs on the Apple Neural Engine at a few milliseconds per chunk — built for live recitation tracking (word highlighting, follow-along, real-time feedback).

This is the streaming member of the FastConformer-Quran family. For maximum-accuracy full-utterance transcription see the offline CoreML repo; for the source model / ONNX / .nemo, see Muno459/fastconformer-quran.

Riwayah: Hafs only — not a general Arabic ASR.
Output: Arabic with full tashkīl (diacritics).
Architecture: cache-aware FastConformer-Hybrid, CTC head, att_context_size = [70, 13] (~1.04 s lookahead), fixed chunk of 112 mel frames (1120 ms).

✅ Verified on the Apple Neural Engine

Measured on-device (Apple Silicon, MLComputeUnits.cpuAndNeuralEngine):

Metric	Result
Decoding	11 / 11 Al-Fātiḥah + Al-Ikhlās ayāt correct (incl. the Basmala), 0 NaN
ANE residency	~99% — 1094 ops on ANE / 9 on CPU (no silent GPU/CPU fallback)
Latency	5–8 ms per 1120 ms chunk (real-time, large margin)

The chunked, limited-context attention bounds fp16 accumulation by design, so CTC margins stay safely positive in fp16 on the ANE.

Scope: the on-device check above is on 11 clean EveryAyah test ayāt — strong evidence the design holds. A broad multi-reciter WER sweep on-device is future work; a few frames sit on a thin positive margin (~0.01–0.46 nats), so a very noisy input could surface an edge case.

Accuracy (held-out WER / CER %)

Evaluated on a leakage-free held-out set (EveryAyah reciters never used in training + a held-out QUL reciter + real phone-recorded recitation), CTC greedy, alef-insensitive in parentheses:

Test set	Streaming WER	CER
EveryAyah (held-out reciters, clean studio)	6.3 (6.0)	2.2
QUL — Al-Nufais (held-out reciter, clean)	11.6 (11.2)	6.7
Real phone recitation (tlog)	19.6 (14.3)	7.1
All	9.8 (8.6)	4.0

Streaming trades some accuracy for low-latency, state-carrying inference. If you need the lowest WER and latency isn't critical, the offline variant scores ~3% WER on the same clips.

Files

File	Purpose	Size
`fastconformer-quran-streaming.mlpackage`	Cache-aware streaming encoder + CTC head	~204 MB
`pronunciation-head.mlpackage`	Per-token pronunciation scorer (streaming-matched)	~5 MB
`tokenizer.model` / `tokens.txt`	SentencePiece BPE (1024 pieces + blank id 1024)	—

Streaming model I/O (fixed shapes, fp16)

Inputs

Name	Shape	dtype
`audio_signal`	`(1, 80, 112)`	float16
`cache_last_channel`	`(1, 17, 70, 512)`	float16
`cache_last_time`	`(1, 17, 512, 8)`	float16
`cache_last_channel_len`	`(1,)`	int32

Outputs

Name	Shape
`logprobs`	`(1, 13, 1025)`
`encoder_output`	`(1, 512, 13)`
`cache_last_channel_next`	`(1, 17, 70, 512)`
`cache_last_time_next`	`(1, 17, 512, 8)`
`cache_last_channel_len_next`	`(1,)`

All shapes are concrete (no dynamic axes, no length input), so the Neural Engine pre-compiles one kernel and runs it without fallback. Feed each chunk, carry the three *_next cache tensors into the next call.

Quick start (Swift)

import CoreML

let cfg = MLModelConfiguration()
cfg.computeUnits = .cpuAndNeuralEngine
let model = try FastConformerQuranStreaming(configuration: cfg)

// Empty caches — shapes must match the spec exactly.
var cacheLC  = try MLMultiArray(shape: [1, 17, 70, 512], dataType: .float16)  // attention cache
var cacheLT  = try MLMultiArray(shape: [1, 17, 512, 8],  dataType: .float16)  // conv cache
var cacheLen = try MLMultiArray(shape: [1], dataType: .int32); cacheLen[0] = 0
zero(cacheLC); zero(cacheLT)

// Fixed chunk: 112 mel frames = 1120 ms = 17,920 samples @ 16 kHz.
let CHUNK_SAMPLES = 112 * 160
var buffer = [Float](), transcript = ""

func feed(_ samples: [Float]) throws {
    buffer.append(contentsOf: samples)
    while buffer.count >= CHUNK_SAMPLES {
        let chunk = Array(buffer.prefix(CHUNK_SAMPLES)); buffer.removeFirst(CHUNK_SAMPLES)
        let feats = computeLogMel(chunk)                       // (1, 80, 112) Float16
        let out = try model.prediction(audio_signal: feats,
                                       cache_last_channel: cacheLC,
                                       cache_last_time: cacheLT,
                                       cache_last_channel_len: cacheLen)
        cacheLC = out.cache_last_channel_next
        cacheLT = out.cache_last_time_next
        cacheLen = out.cache_last_channel_len_next
        transcript += sentencePieceDecode(ctcCollapse(out.logprobs))
    }
}

Feature extraction (must match exactly)

80-channel log-mel, identical to NeMo FilterbankFeatures:

16 kHz, mono
window 25 ms (400 samples), Hann · hop 10 ms (160 samples) · 512-pt FFT
80 mel bins (Slaney), power spectrum, log(mel + 1e-5)
pre-emphasis 0.97, then per-feature mean/var normalization

Python reference: tajweed/aligner.py. ~200 lines in Swift with Accelerate for the FFT.

Decoding

Argmax logprobs per frame → token IDs.
CTC collapse: drop blanks (id 1024) and dedupe consecutive identical IDs.
SentencePiece-decode (tokenizer.model) → Arabic text. Append across chunks for a rolling transcript.

Pronunciation head (optional)

pronunciation-head.mlpackage is trained on features pooled from this streaming encoder (so its input distribution matches what the model emits on-device). Inputs: pooled encoder_output per token (512-d) + token ID → prob_correct (P the token was pronounced correctly). All ops are ANE-friendly; sigmoid-bounded, no fp16 concerns.

Precision note

fp16 throughout (no int8/int4 — the ANE is natively fp16). The RelPositionalEncoding xscale multiply (×√512) can exceed the fp16 max on large activations, so it is computed then saturated to ±65 504 ((x·√512).clamp(±65504)) — exactly the ANE's own behaviour, so it's a no-op on-device yet prevents inf→NaN if an op is ever evicted off-ANE. Baked into the graph as a single clip op.

License

This model is licensed under the Quran-Lab No-Profit License v1.0 (NPL-1.0). See LICENSE for the full terms. In short:

No profit. You may not use this model, any derivative (fine-tune, quantization, distillation, re-export), or any service built on it to make money.
Local use is free. Running it yourself must cost users nothing.
Hosted services: cost recovery only. If you serve it over a network, you may recover only the strictly necessary direct running costs (compute, storage, bandwidth), never a profit.
Share-alike. Every derivative must be released under this same license.
Attribution. Credit Muno459 / Quran-Lab, fastconformer-quran.

This replaces the previous Apache-2.0 license. The change is not retroactive: copies obtained earlier under Apache-2.0 keep those terms, but all use going forward is governed by NPL-1.0.

Upstream components this model builds on (NVIDIA FastConformer and the upstream Muno459/fastconformer-quran) were released under Apache-2.0 and retain their own Apache-2.0 terms for those original portions. The NPL-1.0 terms above apply to this model as distributed here.

Citation

@misc{fastconformer_quran_coreml_streaming_2026,
  title  = {FastConformer-Quran (Streaming, CoreML): on-device Quranic ASR for Apple Neural Engine},
  year   = {2026},
  url    = {https://huggingface.co/Muno459/fastconformer-quran-coreml-streaming}
}

Benchmark

Leakage-free held-out WER vs nvidia / whisper / seamless / mms / omniASR / Tarteel: Quranic ASR Leaderboard.

Downloads last month: 15

Model tree for Muno459/fastconformer-quran-coreml-streaming

Base model

Muno459/fastconformer-quran-streaming

Quantized

(1)

this model

Muno459
/

fastconformer-quran-coreml-streaming