Shenava-Rizeh-0.9 / README.md
Reza2kn's picture
Upload README.md with huggingface_hub
bb3648e verified
|
Raw
History Blame Contribute Delete
4.46 kB
---
language:
- fa
license: cc-by-4.0
library_name: nemo
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- fastconformer
- persian
- fa
- streaming
- cache-aware
- on-device
- webgpu
- rnnt
base_model: nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms
metrics:
- wer
- cer
---
# Shenava-Rizeh 0.9 — Persian Cache-Aware Streaming ASR (32M)
A tiny (**32M-param**) **cache-aware, multi-latency streaming** Persian (Farsi) ASR model — FastConformer-Hybrid (RNNT + CTC), 16 kHz. Built for **fully offline, on-device, real-time** captioning (WebGPU / WASM / NeMo), part of the [VisualEars](https://huggingface.co/Reza2kn) project (SLT 2026).
**One model serves the entire latency–accuracy curve** (0 / 80 / 480 / 1040 ms) — pick your operating point at inference time, no re-training. Its larger sibling is [`Shenava-Koochik-0.9`](https://huggingface.co/Reza2kn/Shenava-Koochik-0.9) (114M).
## 📊 Results — Golden6669 (held-out gold Persian eval)
Evaluated on [`Reza2kn/visualears-golden-6669`](https://huggingface.co/datasets/Reza2kn/visualears-golden-6669) (6,669 clips, official Persian normalizer), RNNT head:
| `att_context_size` | **Latency** | **WER** | **CER** | WER_bf |
|---|---|---|---|---|
| `[70, 0]` | **0 ms** (real-time) | **11.08%** | **3.17%** | 9.68% |
| `[70, 1]` | 80 ms | 10.85% | 3.08% | 9.43% |
| `[70, 6]` | 480 ms | 10.56% | 2.93% | 9.14% |
| `[70, 13]` | 1040 ms | **10.46%** | **2.89%** | 9.03% |
**The curve is nearly flat — only 0.62 pp WER from 0 → 1040 ms.** You get near-best accuracy at **zero lookahead**: a 32M model doing **11.08% WER / 3.17% CER at true real-time**, fully on-device. For reference, the previous-generation single-latency `fa32M` scored 17.40% (and could not run below its trained latency).
### Flatness holds per-condition
The latency penalty is uniform across acoustic conditions — low-latency does **not** fray on the hard far-field/obstructed tail (Golden6669 is deliberately ~94% non-clean):
| Condition | n | 0 ms WER | 1040 ms WER | Δ |
|---|---|---|---|---|
| clean | 400 | 13.38 | 12.62 | +0.76 |
| obstructed | 4,335 | 10.87 | 10.20 | +0.67 |
| far-field | 1,934 | 11.24 | 10.77 | **+0.47** |
Far-field has the *smallest* 0→1040 ms gap. (`clean` scores worst here — a quirk of Golden6669's small 400-clip clean slice, not a streaming effect.)
`WER_bf` = boundary-forgiven WER (utterances perfect modulo Persian word-spacing conventions counted correct).
## How it was trained
- **Base:** `nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms` (English cache-aware streaming), persianized by swapping in a Persian **BPE-1024** tokenizer and reinitializing the decoder + joint (encoder kept).
- **Multi-latency:** `att_context_size = [[70,13],[70,6],[70,1],[70,0]]` (chunked_limited) — one checkpoint covers 0 / 80 / 480 / 1040 ms.
- **Phase A:** ~7,386 h / 3.66M clips of cleaned, teacher-pseudo-labeled Persian ([`Reza2kn/visualears-persian-pseudo-asr`](https://huggingface.co/datasets/Reza2kn/visualears-persian-pseudo-asr)).
- **Phase B:** gold-anchor fine-tune on 355 human-verified gold + active-learning corrections.
- Trajectory: random-init decoder **→ Phase A 17.94% → Phase B 10.46%** (@1040 ms).
## Usage (NeMo)
```python
from nemo.collections.asr.models import ASRModel
m = ASRModel.restore_from("shenava-rizeh-0.9.nemo").cuda().eval()
m.encoder.set_default_att_context_size([70, 0]) # 0 ms (real-time); or [70,13] for best WER
print(m.transcribe(["clip.wav"])[0].text)
```
`[70,0]`=0 ms · `[70,1]`=80 ms · `[70,6]`=480 ms · `[70,13]`=1040 ms (1 encoder frame = 80 ms, FastConformer subsampling 8).
## Notes
- **Version 0.9** — Phase B on 355 human-verified gold. **v1.0** follows a larger active-learning gold round (the 6K worst-disagreement clips now under review on Argilla).
- The CTC head is the deployment head for real-time/WebGPU; RNNT is the higher-accuracy offline/rescorer head.
- Larger streaming sibling: [`Shenava-Koochik-0.9`](https://huggingface.co/Reza2kn/Shenava-Koochik-0.9) (114M). Offline flagship: [`shenava-fa-fastconformer-115m`](https://huggingface.co/Reza2kn/shenava-fa-fastconformer-115m) (7.29%).
## Citation
```bibtex
@misc{shenava_rizeh_2026,
title = {Shenava-Rizeh: Persian Cache-Aware Streaming ASR (32M)},
author = {Sayar, Reza},
year = {2026},
howpublished = {Hugging Face},
url = {https://huggingface.co/Reza2kn/Shenava-Rizeh-0.9}
}
```