Upload README.md with huggingface_hub

bb3648e verified 6 days ago

4.46 kB

	---
	language:
	- fa
	license: cc-by-4.0
	library_name: nemo
	pipeline_tag: automatic-speech-recognition
	tags:
	- automatic-speech-recognition
	- fastconformer
	- persian
	- fa
	- streaming
	- cache-aware
	- on-device
	- webgpu
	- rnnt
	base_model: nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms
	metrics:
	- wer
	- cer
	---

	# Shenava-Rizeh 0.9 — Persian Cache-Aware Streaming ASR (32M)

	A tiny (32M-param) cache-aware, multi-latency streaming Persian (Farsi) ASR model — FastConformer-Hybrid (RNNT + CTC), 16 kHz. Built for fully offline, on-device, real-time captioning (WebGPU / WASM / NeMo), part of the [VisualEars](https://huggingface.co/Reza2kn) project (SLT 2026).

	One model serves the entire latency–accuracy curve (0 / 80 / 480 / 1040 ms) — pick your operating point at inference time, no re-training. Its larger sibling is [`Shenava-Koochik-0.9`](https://huggingface.co/Reza2kn/Shenava-Koochik-0.9) (114M).

	## 📊 Results — Golden6669 (held-out gold Persian eval)

	Evaluated on [`Reza2kn/visualears-golden-6669`](https://huggingface.co/datasets/Reza2kn/visualears-golden-6669) (6,669 clips, official Persian normalizer), RNNT head:

	\| `att_context_size` \| Latency \| WER \| CER \| WER_bf \|
	\|---\|---\|---\|---\|---\|
	\| `[70, 0]` \| 0 ms (real-time) \| 11.08% \| 3.17% \| 9.68% \|
	\| `[70, 1]` \| 80 ms \| 10.85% \| 3.08% \| 9.43% \|
	\| `[70, 6]` \| 480 ms \| 10.56% \| 2.93% \| 9.14% \|
	\| `[70, 13]` \| 1040 ms \| 10.46% \| 2.89% \| 9.03% \|

	The curve is nearly flat — only 0.62 pp WER from 0 → 1040 ms. You get near-best accuracy at zero lookahead: a 32M model doing 11.08% WER / 3.17% CER at true real-time, fully on-device. For reference, the previous-generation single-latency `fa32M` scored 17.40% (and could not run below its trained latency).

	### Flatness holds per-condition

	The latency penalty is uniform across acoustic conditions — low-latency does not fray on the hard far-field/obstructed tail (Golden6669 is deliberately ~94% non-clean):

	\| Condition \| n \| 0 ms WER \| 1040 ms WER \| Δ \|
	\|---\|---\|---\|---\|---\|
	\| clean \| 400 \| 13.38 \| 12.62 \| +0.76 \|
	\| obstructed \| 4,335 \| 10.87 \| 10.20 \| +0.67 \|
	\| far-field \| 1,934 \| 11.24 \| 10.77 \| +0.47 \|

	Far-field has the smallest 0→1040 ms gap. (`clean` scores worst here — a quirk of Golden6669's small 400-clip clean slice, not a streaming effect.)

	`WER_bf` = boundary-forgiven WER (utterances perfect modulo Persian word-spacing conventions counted correct).

	## How it was trained

	- Base: `nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms` (English cache-aware streaming), persianized by swapping in a Persian BPE-1024 tokenizer and reinitializing the decoder + joint (encoder kept).
	- Multi-latency: `att_context_size = [[70,13],[70,6],[70,1],[70,0]]` (chunked_limited) — one checkpoint covers 0 / 80 / 480 / 1040 ms.
	- Phase A: ~7,386 h / 3.66M clips of cleaned, teacher-pseudo-labeled Persian ([`Reza2kn/visualears-persian-pseudo-asr`](https://huggingface.co/datasets/Reza2kn/visualears-persian-pseudo-asr)).
	- Phase B: gold-anchor fine-tune on 355 human-verified gold + active-learning corrections.
	- Trajectory: random-init decoder → Phase A 17.94% → Phase B 10.46% (@1040 ms).

	## Usage (NeMo)

	```python
	from nemo.collections.asr.models import ASRModel
	m = ASRModel.restore_from("shenava-rizeh-0.9.nemo").cuda().eval()
	m.encoder.set_default_att_context_size([70, 0]) # 0 ms (real-time); or [70,13] for best WER
	print(m.transcribe(["clip.wav"])[0].text)
	```

	`[70,0]`=0 ms · `[70,1]`=80 ms · `[70,6]`=480 ms · `[70,13]`=1040 ms (1 encoder frame = 80 ms, FastConformer subsampling 8).

	## Notes
	- Version 0.9 — Phase B on 355 human-verified gold. v1.0 follows a larger active-learning gold round (the 6K worst-disagreement clips now under review on Argilla).
	- The CTC head is the deployment head for real-time/WebGPU; RNNT is the higher-accuracy offline/rescorer head.
	- Larger streaming sibling: [`Shenava-Koochik-0.9`](https://huggingface.co/Reza2kn/Shenava-Koochik-0.9) (114M). Offline flagship: [`shenava-fa-fastconformer-115m`](https://huggingface.co/Reza2kn/shenava-fa-fastconformer-115m) (7.29%).

	## Citation
	```bibtex
	@misc{shenava_rizeh_2026,
	title = {Shenava-Rizeh: Persian Cache-Aware Streaming ASR (32M)},
	author = {Sayar, Reza},
	year = {2026},
	howpublished = {Hugging Face},
	url = {https://huggingface.co/Reza2kn/Shenava-Rizeh-0.9}
	}
	```