FastConformer-Quran CoreML β Streaming
iOS and macOS deployment of Muno459/fastconformer-quran, cache-aware streaming variant (~1 s lookahead). Apple-silicon native, full fp16 precision (no integer reduction), Neural Engine optimized.
For the offline full-utterance variant, see Muno459/fastconformer-quran-coreml-offline.
Precision
fp16 throughout, no integer reduction. Apple's Neural Engine is natively fp16, so fp16 weights and activations are the accuracy-preserving choice on-device. No int8 / int4 used anywhere in this release.
FP16 overflow fix
The vanilla FP16 export produced all-NaN logprobs on real audio. Root cause: RelPositionalEncoding multiplies the pre-encode output (peaks ~5 200) by xscale = β512 β 22.6, producing values up to 117 924 β 1.8Γ the FP16 maximum of 65 504. Fix: pos_enc.forward is patched to clamp(x, β2400, 2400) before the xscale multiply. Baked into the traced graph as a MIL clip op with no effect on transcription accuracy for real audio.
Models
| File | Purpose | Size | Inputs | Outputs |
|---|---|---|---|---|
fastconformer-quran-streaming.mlpackage |
Cache-aware streaming ASR + CTC head | 204 MB | audio_signal: (1, 80, 112), cache_last_channel: (1, 17, 70, 512), cache_last_time: (1, 17, 512, 4), cache_last_channel_len: (1,) |
logprobs: (1, 12, 1025), encoder_output: (1, 512, 12), updated cache tensors |
pronunciation-head.mlpackage |
Pronunciation head v7 | 5 MB | encoder_feature: (N, 512), token_id: (N,) |
prob_correct: (N,) |
Both packages are pure CTC end-to-end.
Streaming variant details
Trained on the NVIDIA Arabic FastConformer base + Quran fine-tune with chunked_limited attention. Validated val_wer 1.25% at the recommended preset (training-subset metric); offline equivalent measures 0.13% on the full validation set, so streaming carries a real but small accuracy penalty in exchange for state-carrying low-latency inference.
Preset (att_context_size) |
Lookahead | End-to-end latency | Status |
|---|---|---|---|
[70, 13] |
13 frames | ~1040 ms (recommended) | Clean |
[70, 6] |
6 frames | 480 ms | Mostly clean |
[70, 1] |
1 frame | 80 ms | Degraded |
[70, 0] |
0 frames | 0 ms | Degraded |
Why the low-latency presets are degraded: each Conformer layer has a symmetric depthwise conv (kernel 31) that wants ~640 ms of future audio per layer regardless of attention lookahead. Cleaning those up would require retraining with causal convolutions. The [70, 13] preset has enough lookahead to absorb the conv leak and gives full-quality output.
Use [70, 13]. It carries state across chunks via the cache tensors, so a real ~1 s rolling transcript is possible without the chunked-overlap stitching the offline pattern needs.
The model is fully ANE-specialized: chunk size is fixed at 112 audio frames (1120 ms), there is no length input (the wrapper assumes the full chunk is valid), and every input/output shape is concrete so the Neural Engine pre-compiles one kernel and runs it without GPU/CPU fallback.
Quick start (Swift, Core ML)
import CoreML
let cfg = MLModelConfiguration()
cfg.computeUnits = .all
let stream = try await FastConformerQuranStreaming(configuration: cfg)
// Initialize empty caches (FIXED shapes β exactly as declared in the model spec)
var cacheLC = try MLMultiArray(shape: [1, 17, 70, 512], dataType: .float16) // (B, L, T_left, D)
var cacheLT = try MLMultiArray(shape: [1, 17, 512, 4], dataType: .float16)
var cacheLen = try MLMultiArray(shape: [1], dataType: .int32)
cacheLen[0] = 0
zero(cacheLC); zero(cacheLT)
// Chunk size is FIXED at 112 audio frames (1120 ms = 17920 samples at 16 kHz).
// Caller must always feed exactly this many samples per call.
let CHUNK_FRAMES = 112
let CHUNK_SAMPLES = CHUNK_FRAMES * 160 // 17920
var buffer: [Float] = []
func feed(_ samples: [Float]) async -> String {
buffer.append(contentsOf: samples)
var emitted = ""
while buffer.count >= CHUNK_SAMPLES {
let chunk = Array(buffer.prefix(CHUNK_SAMPLES))
buffer.removeFirst(CHUNK_SAMPLES)
let features = computeLogMel(chunk) // (1, 80, 112), Float16
let out = try! await stream.prediction(
audio_signal: features,
cache_last_channel: cacheLC,
cache_last_time: cacheLT,
cache_last_channel_len: cacheLen)
cacheLC = out.cache_last_channel_next
cacheLT = out.cache_last_time_next
cacheLen = out.cache_last_channel_len_next
emitted += sentencePieceDecode(ctcCollapse(out.logprobs), model: "tokenizer.model")
}
return emitted
}
Trade-off vs. smart chunked offline inference:
- Streaming: ~1 s rolling latency, one forward pass per chunk, no stitching logic, slightly higher WER vs offline at the [70, 13] preset.
- Smart chunked offline: same ~1 s latency window with the offline model, slightly higher per-chunk compute, no cache state to manage, lower WER.
- Pick streaming if you want a clean state-carrying pipeline. Pick offline chunked if you want maximum accuracy and don't mind a stitching helper.
Feature extraction
The model expects 80-channel log-mel features computed identically to NVIDIA's NeMo FilterbankFeatures default:
- 16 kHz sample rate
- 25 ms window (400 samples) with Hann
- 10 ms hop (160 samples)
- 512-point FFT
- 80 mel bins, mel_floor = 1e-5
- Per-utterance mean and variance normalization per channel
A pure-Swift implementation will be ~200 lines using Accelerate for the FFT. The exact Python reference is in tajweed/aligner.py on the main repo.
Tokenizer
tokenizer.model is a SentencePiece BPE model with 1,024 pieces plus 1 blank (id 1024). For iOS, use a SentencePiece Swift port or implement BPE decoding manually (~50 lines).
Decoding pipeline:
- Argmax over
logprobsper frame to get a sequence of token IDs - CTC collapse: remove blanks (id 1024) and dedupe consecutive identical IDs
- SentencePiece decode to final Arabic text
License
Apache 2.0. Same license as the upstream Muno459/fastconformer-quran and NVIDIA FastConformer-Hybrid.
Citation
@misc{fastconformer-quran-coreml-streaming-2026,
title = {FastConformer-Quran CoreML (Streaming): on-device Quranic ASR for iOS},
author = {Anon},
year = {2026},
url = {https://huggingface.co/Muno459/fastconformer-quran-coreml-streaming},
}
- Downloads last month
- 14
Model tree for Muno459/fastconformer-quran-coreml-streaming
Base model
Muno459/fastconformer-quran