--- language: - fa license: cc-by-4.0 library_name: nemo pipeline_tag: automatic-speech-recognition tags: - automatic-speech-recognition - fastconformer - persian - fa - streaming - cache-aware - on-device - webgpu - rnnt base_model: nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms metrics: - wer - cer --- # Shenava-Rizeh 0.9 — Persian Cache-Aware Streaming ASR (32M) A tiny (**32M-param**) **cache-aware, multi-latency streaming** Persian (Farsi) ASR model — FastConformer-Hybrid (RNNT + CTC), 16 kHz. Built for **fully offline, on-device, real-time** captioning (WebGPU / WASM / NeMo), part of the [VisualEars](https://huggingface.co/Reza2kn) project (SLT 2026). **One model serves the entire latency–accuracy curve** (0 / 80 / 480 / 1040 ms) — pick your operating point at inference time, no re-training. Its larger sibling is [`Shenava-Koochik-0.9`](https://huggingface.co/Reza2kn/Shenava-Koochik-0.9) (114M). ## 📊 Results — Golden6669 (held-out gold Persian eval) Evaluated on [`Reza2kn/visualears-golden-6669`](https://huggingface.co/datasets/Reza2kn/visualears-golden-6669) (6,669 clips, official Persian normalizer), RNNT head: | `att_context_size` | **Latency** | **WER** | **CER** | WER_bf | |---|---|---|---|---| | `[70, 0]` | **0 ms** (real-time) | **11.08%** | **3.17%** | 9.68% | | `[70, 1]` | 80 ms | 10.85% | 3.08% | 9.43% | | `[70, 6]` | 480 ms | 10.56% | 2.93% | 9.14% | | `[70, 13]` | 1040 ms | **10.46%** | **2.89%** | 9.03% | **The curve is nearly flat — only 0.62 pp WER from 0 → 1040 ms.** You get near-best accuracy at **zero lookahead**: a 32M model doing **11.08% WER / 3.17% CER at true real-time**, fully on-device. For reference, the previous-generation single-latency `fa32M` scored 17.40% (and could not run below its trained latency). ### Flatness holds per-condition The latency penalty is uniform across acoustic conditions — low-latency does **not** fray on the hard far-field/obstructed tail (Golden6669 is deliberately ~94% non-clean): | Condition | n | 0 ms WER | 1040 ms WER | Δ | |---|---|---|---|---| | clean | 400 | 13.38 | 12.62 | +0.76 | | obstructed | 4,335 | 10.87 | 10.20 | +0.67 | | far-field | 1,934 | 11.24 | 10.77 | **+0.47** | Far-field has the *smallest* 0→1040 ms gap. (`clean` scores worst here — a quirk of Golden6669's small 400-clip clean slice, not a streaming effect.) `WER_bf` = boundary-forgiven WER (utterances perfect modulo Persian word-spacing conventions counted correct). ## How it was trained - **Base:** `nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms` (English cache-aware streaming), persianized by swapping in a Persian **BPE-1024** tokenizer and reinitializing the decoder + joint (encoder kept). - **Multi-latency:** `att_context_size = [[70,13],[70,6],[70,1],[70,0]]` (chunked_limited) — one checkpoint covers 0 / 80 / 480 / 1040 ms. - **Phase A:** ~7,386 h / 3.66M clips of cleaned, teacher-pseudo-labeled Persian ([`Reza2kn/visualears-persian-pseudo-asr`](https://huggingface.co/datasets/Reza2kn/visualears-persian-pseudo-asr)). - **Phase B:** gold-anchor fine-tune on 355 human-verified gold + active-learning corrections. - Trajectory: random-init decoder **→ Phase A 17.94% → Phase B 10.46%** (@1040 ms). ## Usage (NeMo) ```python from nemo.collections.asr.models import ASRModel m = ASRModel.restore_from("shenava-rizeh-0.9.nemo").cuda().eval() m.encoder.set_default_att_context_size([70, 0]) # 0 ms (real-time); or [70,13] for best WER print(m.transcribe(["clip.wav"])[0].text) ``` `[70,0]`=0 ms · `[70,1]`=80 ms · `[70,6]`=480 ms · `[70,13]`=1040 ms (1 encoder frame = 80 ms, FastConformer subsampling 8). ## Notes - **Version 0.9** — Phase B on 355 human-verified gold. **v1.0** follows a larger active-learning gold round (the 6K worst-disagreement clips now under review on Argilla). - The CTC head is the deployment head for real-time/WebGPU; RNNT is the higher-accuracy offline/rescorer head. - Larger streaming sibling: [`Shenava-Koochik-0.9`](https://huggingface.co/Reza2kn/Shenava-Koochik-0.9) (114M). Offline flagship: [`shenava-fa-fastconformer-115m`](https://huggingface.co/Reza2kn/shenava-fa-fastconformer-115m) (7.29%). ## Citation ```bibtex @misc{shenava_rizeh_2026, title = {Shenava-Rizeh: Persian Cache-Aware Streaming ASR (32M)}, author = {Sayar, Reza}, year = {2026}, howpublished = {Hugging Face}, url = {https://huggingface.co/Reza2kn/Shenava-Rizeh-0.9} } ```