You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

FastConformer-Quran CoreML β€” Streaming

iOS and macOS deployment of Muno459/fastconformer-quran, cache-aware streaming variant (~1 s lookahead). Apple-silicon native, full fp16 precision (no integer reduction), Neural Engine optimized.

For the offline full-utterance variant, see Muno459/fastconformer-quran-coreml-offline.

Precision

fp16 throughout, no integer reduction. Apple's Neural Engine is natively fp16, so fp16 weights and activations are the accuracy-preserving choice on-device. No int8 / int4 used anywhere in this release.

FP16 overflow fix

The vanilla FP16 export produced all-NaN logprobs on real audio. Root cause: RelPositionalEncoding multiplies the pre-encode output (peaks ~5 200) by xscale = √512 β‰ˆ 22.6, producing values up to 117 924 β€” 1.8Γ— the FP16 maximum of 65 504. Fix: pos_enc.forward is patched to clamp(x, βˆ’2400, 2400) before the xscale multiply. Baked into the traced graph as a MIL clip op with no effect on transcription accuracy for real audio.

Models

File Purpose Size Inputs Outputs
fastconformer-quran-streaming.mlpackage Cache-aware streaming ASR + CTC head 204 MB audio_signal: (1, 80, 112), cache_last_channel: (1, 17, 70, 512), cache_last_time: (1, 17, 512, 4), cache_last_channel_len: (1,) logprobs: (1, 12, 1025), encoder_output: (1, 512, 12), updated cache tensors
pronunciation-head.mlpackage Pronunciation head v7 5 MB encoder_feature: (N, 512), token_id: (N,) prob_correct: (N,)

Both packages are pure CTC end-to-end.

Streaming variant details

Trained on the NVIDIA Arabic FastConformer base + Quran fine-tune with chunked_limited attention. Validated val_wer 1.25% at the recommended preset (training-subset metric); offline equivalent measures 0.13% on the full validation set, so streaming carries a real but small accuracy penalty in exchange for state-carrying low-latency inference.

Preset (att_context_size) Lookahead End-to-end latency Status
[70, 13] 13 frames ~1040 ms (recommended) Clean
[70, 6] 6 frames 480 ms Mostly clean
[70, 1] 1 frame 80 ms Degraded
[70, 0] 0 frames 0 ms Degraded

Why the low-latency presets are degraded: each Conformer layer has a symmetric depthwise conv (kernel 31) that wants ~640 ms of future audio per layer regardless of attention lookahead. Cleaning those up would require retraining with causal convolutions. The [70, 13] preset has enough lookahead to absorb the conv leak and gives full-quality output.

Use [70, 13]. It carries state across chunks via the cache tensors, so a real ~1 s rolling transcript is possible without the chunked-overlap stitching the offline pattern needs.

The model is fully ANE-specialized: chunk size is fixed at 112 audio frames (1120 ms), there is no length input (the wrapper assumes the full chunk is valid), and every input/output shape is concrete so the Neural Engine pre-compiles one kernel and runs it without GPU/CPU fallback.

Quick start (Swift, Core ML)

import CoreML

let cfg = MLModelConfiguration()
cfg.computeUnits = .all
let stream = try await FastConformerQuranStreaming(configuration: cfg)

// Initialize empty caches (FIXED shapes β€” exactly as declared in the model spec)
var cacheLC = try MLMultiArray(shape: [1, 17, 70, 512], dataType: .float16)  // (B, L, T_left, D)
var cacheLT = try MLMultiArray(shape: [1, 17, 512, 4], dataType: .float16)
var cacheLen = try MLMultiArray(shape: [1], dataType: .int32)
cacheLen[0] = 0
zero(cacheLC); zero(cacheLT)

// Chunk size is FIXED at 112 audio frames (1120 ms = 17920 samples at 16 kHz).
// Caller must always feed exactly this many samples per call.
let CHUNK_FRAMES = 112
let CHUNK_SAMPLES = CHUNK_FRAMES * 160   // 17920
var buffer: [Float] = []

func feed(_ samples: [Float]) async -> String {
    buffer.append(contentsOf: samples)
    var emitted = ""
    while buffer.count >= CHUNK_SAMPLES {
        let chunk = Array(buffer.prefix(CHUNK_SAMPLES))
        buffer.removeFirst(CHUNK_SAMPLES)
        let features = computeLogMel(chunk)                            // (1, 80, 112), Float16
        let out = try! await stream.prediction(
            audio_signal: features,
            cache_last_channel: cacheLC,
            cache_last_time: cacheLT,
            cache_last_channel_len: cacheLen)
        cacheLC = out.cache_last_channel_next
        cacheLT = out.cache_last_time_next
        cacheLen = out.cache_last_channel_len_next
        emitted += sentencePieceDecode(ctcCollapse(out.logprobs), model: "tokenizer.model")
    }
    return emitted
}

Trade-off vs. smart chunked offline inference:

  • Streaming: ~1 s rolling latency, one forward pass per chunk, no stitching logic, slightly higher WER vs offline at the [70, 13] preset.
  • Smart chunked offline: same ~1 s latency window with the offline model, slightly higher per-chunk compute, no cache state to manage, lower WER.
  • Pick streaming if you want a clean state-carrying pipeline. Pick offline chunked if you want maximum accuracy and don't mind a stitching helper.

Feature extraction

The model expects 80-channel log-mel features computed identically to NVIDIA's NeMo FilterbankFeatures default:

  • 16 kHz sample rate
  • 25 ms window (400 samples) with Hann
  • 10 ms hop (160 samples)
  • 512-point FFT
  • 80 mel bins, mel_floor = 1e-5
  • Per-utterance mean and variance normalization per channel

A pure-Swift implementation will be ~200 lines using Accelerate for the FFT. The exact Python reference is in tajweed/aligner.py on the main repo.

Tokenizer

tokenizer.model is a SentencePiece BPE model with 1,024 pieces plus 1 blank (id 1024). For iOS, use a SentencePiece Swift port or implement BPE decoding manually (~50 lines).

Decoding pipeline:

  1. Argmax over logprobs per frame to get a sequence of token IDs
  2. CTC collapse: remove blanks (id 1024) and dedupe consecutive identical IDs
  3. SentencePiece decode to final Arabic text

License

Apache 2.0. Same license as the upstream Muno459/fastconformer-quran and NVIDIA FastConformer-Hybrid.

Citation

@misc{fastconformer-quran-coreml-streaming-2026,
  title  = {FastConformer-Quran CoreML (Streaming): on-device Quranic ASR for iOS},
  author = {Anon},
  year   = {2026},
  url    = {https://huggingface.co/Muno459/fastconformer-quran-coreml-streaming},
}
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Muno459/fastconformer-quran-coreml-streaming

Quantized
(2)
this model