Automatic Speech Recognition
NeMo
Persian
fastconformer
persian
streaming
cache-aware
on-device
webgpu
rnnt
Instructions to use Reza2kn/Shenava-Rizeh-0.9 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Reza2kn/Shenava-Rizeh-0.9 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Reza2kn/Shenava-Rizeh-0.9") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
| language: | |
| - fa | |
| license: cc-by-4.0 | |
| library_name: nemo | |
| pipeline_tag: automatic-speech-recognition | |
| tags: | |
| - automatic-speech-recognition | |
| - fastconformer | |
| - persian | |
| - fa | |
| - streaming | |
| - cache-aware | |
| - on-device | |
| - webgpu | |
| - rnnt | |
| base_model: nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms | |
| metrics: | |
| - wer | |
| - cer | |
| # Shenava-Rizeh 0.9 — Persian Cache-Aware Streaming ASR (32M) | |
| A tiny (**32M-param**) **cache-aware, multi-latency streaming** Persian (Farsi) ASR model — FastConformer-Hybrid (RNNT + CTC), 16 kHz. Built for **fully offline, on-device, real-time** captioning (WebGPU / WASM / NeMo), part of the [VisualEars](https://huggingface.co/Reza2kn) project (SLT 2026). | |
| **One model serves the entire latency–accuracy curve** (0 / 80 / 480 / 1040 ms) — pick your operating point at inference time, no re-training. Its larger sibling is [`Shenava-Koochik-0.9`](https://huggingface.co/Reza2kn/Shenava-Koochik-0.9) (114M). | |
| ## 📊 Results — Golden6669 (held-out gold Persian eval) | |
| Evaluated on [`Reza2kn/visualears-golden-6669`](https://huggingface.co/datasets/Reza2kn/visualears-golden-6669) (6,669 clips, official Persian normalizer), RNNT head: | |
| | `att_context_size` | **Latency** | **WER** | **CER** | WER_bf | | |
| |---|---|---|---|---| | |
| | `[70, 0]` | **0 ms** (real-time) | **11.08%** | **3.17%** | 9.68% | | |
| | `[70, 1]` | 80 ms | 10.85% | 3.08% | 9.43% | | |
| | `[70, 6]` | 480 ms | 10.56% | 2.93% | 9.14% | | |
| | `[70, 13]` | 1040 ms | **10.46%** | **2.89%** | 9.03% | | |
| **The curve is nearly flat — only 0.62 pp WER from 0 → 1040 ms.** You get near-best accuracy at **zero lookahead**: a 32M model doing **11.08% WER / 3.17% CER at true real-time**, fully on-device. For reference, the previous-generation single-latency `fa32M` scored 17.40% (and could not run below its trained latency). | |
| ### Flatness holds per-condition | |
| The latency penalty is uniform across acoustic conditions — low-latency does **not** fray on the hard far-field/obstructed tail (Golden6669 is deliberately ~94% non-clean): | |
| | Condition | n | 0 ms WER | 1040 ms WER | Δ | | |
| |---|---|---|---|---| | |
| | clean | 400 | 13.38 | 12.62 | +0.76 | | |
| | obstructed | 4,335 | 10.87 | 10.20 | +0.67 | | |
| | far-field | 1,934 | 11.24 | 10.77 | **+0.47** | | |
| Far-field has the *smallest* 0→1040 ms gap. (`clean` scores worst here — a quirk of Golden6669's small 400-clip clean slice, not a streaming effect.) | |
| `WER_bf` = boundary-forgiven WER (utterances perfect modulo Persian word-spacing conventions counted correct). | |
| ## How it was trained | |
| - **Base:** `nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms` (English cache-aware streaming), persianized by swapping in a Persian **BPE-1024** tokenizer and reinitializing the decoder + joint (encoder kept). | |
| - **Multi-latency:** `att_context_size = [[70,13],[70,6],[70,1],[70,0]]` (chunked_limited) — one checkpoint covers 0 / 80 / 480 / 1040 ms. | |
| - **Phase A:** ~7,386 h / 3.66M clips of cleaned, teacher-pseudo-labeled Persian ([`Reza2kn/visualears-persian-pseudo-asr`](https://huggingface.co/datasets/Reza2kn/visualears-persian-pseudo-asr)). | |
| - **Phase B:** gold-anchor fine-tune on 355 human-verified gold + active-learning corrections. | |
| - Trajectory: random-init decoder **→ Phase A 17.94% → Phase B 10.46%** (@1040 ms). | |
| ## Usage (NeMo) | |
| ```python | |
| from nemo.collections.asr.models import ASRModel | |
| m = ASRModel.restore_from("shenava-rizeh-0.9.nemo").cuda().eval() | |
| m.encoder.set_default_att_context_size([70, 0]) # 0 ms (real-time); or [70,13] for best WER | |
| print(m.transcribe(["clip.wav"])[0].text) | |
| ``` | |
| `[70,0]`=0 ms · `[70,1]`=80 ms · `[70,6]`=480 ms · `[70,13]`=1040 ms (1 encoder frame = 80 ms, FastConformer subsampling 8). | |
| ## Notes | |
| - **Version 0.9** — Phase B on 355 human-verified gold. **v1.0** follows a larger active-learning gold round (the 6K worst-disagreement clips now under review on Argilla). | |
| - The CTC head is the deployment head for real-time/WebGPU; RNNT is the higher-accuracy offline/rescorer head. | |
| - Larger streaming sibling: [`Shenava-Koochik-0.9`](https://huggingface.co/Reza2kn/Shenava-Koochik-0.9) (114M). Offline flagship: [`shenava-fa-fastconformer-115m`](https://huggingface.co/Reza2kn/shenava-fa-fastconformer-115m) (7.29%). | |
| ## Citation | |
| ```bibtex | |
| @misc{shenava_rizeh_2026, | |
| title = {Shenava-Rizeh: Persian Cache-Aware Streaming ASR (32M)}, | |
| author = {Sayar, Reza}, | |
| year = {2026}, | |
| howpublished = {Hugging Face}, | |
| url = {https://huggingface.co/Reza2kn/Shenava-Rizeh-0.9} | |
| } | |
| ``` | |