Add contact email for fine-tuning help

65252a5 verified 21 days ago

9.51 kB

	---
	language:
	- hi
	license: other
	license_name: skunkworks-modified-mit
	license_link: LICENSE
	pretty_name: Varuna STT
	library_name: nemo
	tags:
	- automatic-speech-recognition
	- hindi
	- asr
	- speech
	- conformer
	- rnnt
	- nemo
	- varuna
	pipeline_tag: automatic-speech-recognition
	base_model: nvidia/nemotron-speech-streaming-en-0.6b
	metrics:
	- wer
	- cer
	model-index:
	- name: Varuna STT
	results:
	- task:
	type: automatic-speech-recognition
	dataset:
	name: SkunkWorkLabs Hindi ASR Benchmark — kathbath
	type: SkunkWorkLabs/hindi-asr-benchmark
	config: kathbath
	split: eval
	metrics:
	- type: wer
	value: 16.82
	- type: cer
	value: 6.36
	- task:
	type: automatic-speech-recognition
	dataset:
	name: SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy
	type: SkunkWorkLabs/hindi-asr-benchmark
	config: kathbath_noisy
	split: eval
	metrics:
	- type: wer
	value: 19.06
	- type: cer
	value: 8.00
	- task:
	type: automatic-speech-recognition
	dataset:
	name: SkunkWorkLabs Hindi ASR Benchmark — commonvoice
	type: SkunkWorkLabs/hindi-asr-benchmark
	config: commonvoice
	split: eval
	metrics:
	- type: wer
	value: 24.16
	- type: cer
	value: 10.72
	- task:
	type: automatic-speech-recognition
	dataset:
	name: SkunkWorkLabs Hindi ASR Benchmark — fleurs
	type: SkunkWorkLabs/hindi-asr-benchmark
	config: fleurs
	split: eval
	metrics:
	- type: wer
	value: 17.29
	- type: cer
	value: 7.20
	- task:
	type: automatic-speech-recognition
	dataset:
	name: SkunkWorkLabs Hindi ASR Benchmark — indictts
	type: SkunkWorkLabs/hindi-asr-benchmark
	config: indictts
	split: eval
	metrics:
	- type: wer
	value: 9.75
	- type: cer
	value: 2.75
	- task:
	type: automatic-speech-recognition
	dataset:
	name: SkunkWorkLabs Hindi ASR Benchmark — mucs
	type: SkunkWorkLabs/hindi-asr-benchmark
	config: mucs
	split: eval
	metrics:
	- type: wer
	value: 24.60
	- type: cer
	value: 10.75
	---

	# Varuna STT 🌊

	Varuna STT is a 0.6B-parameter Hindi automatic speech recognition (ASR) model
	fine-tuned from NVIDIA's [`nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
	base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi
	text — digits, ordinals (`1st`/`3rd`), Indian numbering (lakh/crore comma
	placement), and Devanagari punctuation (`।`, `,`, `?`, `!`) — directly from the
	acoustic signal, ready to drop into voicebot / IVR / transcription pipelines
	without a separate ITN postprocessor.

	- Architecture: Conformer encoder + RNN-T decoder (NeMo `EncDecRNNTBPEModel`)
	- Parameters: 0.6 B
	- Language: Hindi (`hi`)
	- Sample rate: 16 kHz mono
	- Output style: Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation
	- License: SkunkWorks Modified MIT (see `LICENSE`)

	## ⚡ Inference speed (NVIDIA H100 PCIe)

	Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding:

	\| Metric \| Value \|
	\|---\|---\|
	\| RTFx \| 25.13× \|
	\| Mean per-clip latency \| 208 ms \|
	\| p50 latency \| 175 ms \|
	\| p90 latency \| 362 ms \|

	(RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.)

	## 📊 Benchmark — Vistaar-style normalized WER % / CER %

	Evaluated on six Hindi held-out subsets from the
	[`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) dataset.
	References and hypotheses both pass through the same Vistaar-style normalizer
	([Bhogale et al., Interspeech 2023](https://www.isca-archive.org/interspeech_2023/bhogale23_interspeech.pdf))
	plus digit / ordinal expansion, so all systems are compared in a style-neutral way.

	### WER %

	\| Subset \| n \| Varuna STT \| ElevenLabs Scribe v1 \| Deepgram Nova-2 \| Sarvam Saarika v2.5 \|
	\|---\|---\|---\|---\|---\|---\|
	\| indictts \| 98 \| 9.75 🥇 \| 13.20 \| 15.41 \| 14.71 \|
	\| fleurs (test) \| 417 \| 17.29 \| 11.93 \| 21.22 \| 15.74 \|
	\| kathbath \| 1,929 \| 16.82 \| 13.32 \| 20.55 \| 16.62 \|
	\| kathbath_noisy \| 1,929 \| 19.06 \| 13.16 \| 21.98 \| 17.75 \|
	\| commonvoice \| 1,727 \| 24.16 \| 17.02 \| 28.34 \| 19.32 \|
	\| mucs \| 3,897 \| 24.60 \| 10.97 \| 20.54 \| 12.72 \|

	### CER %

	\| Subset \| Varuna STT \| ElevenLabs Scribe v1 \| Deepgram Nova-2 \| Sarvam Saarika v2.5 \|
	\|---\|---\|---\|---\|---\|
	\| indictts \| 2.75 🥇 \| 4.16 \| 8.53 \| 6.51 \|
	\| fleurs (test) \| 7.20 \| 5.68 \| 16.74 \| 7.08 \|
	\| kathbath \| 6.36 🥇 \| 6.50 \| 13.53 \| 7.42 \|
	\| kathbath_noisy \| 8.00 \| 5.87 \| 14.75 \| 7.82 \|
	\| commonvoice \| 10.72 \| 8.96 \| 20.25 \| 9.87 \|
	\| mucs \| 10.75 \| 3.94 \| 9.94 \| 4.79 \|

	Varuna leads on `indictts` (both metrics) and matches the leader on `kathbath` CER. It has more headroom on conversational / codec-degraded subsets (`commonvoice`, `mucs`).

	## 🚀 Usage

	```python
	from inference import VarunaSTT

	model = VarunaSTT() # auto-picks GPU if available
	texts = model.transcribe(["clip1.wav", "clip2.wav"]) # 16 kHz mono
	for t in texts: print(t)
	```

	CLI:
	```bash
	python inference.py --audio path/to/clip.wav
	```

	You'll need:
	- `nemo_toolkit[asr]>=2.4`
	- `omegaconf`, `torch`, `soundfile`
	- The base `nemotron-speech-streaming-en-0.6b.nemo` file (download separately
	from [`nvidia/nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b))

	Files in this repo:
	- `varuna.ckpt` — fine-tuned weights
	- `tokenizer.model`, `tokenizer.vocab`, `vocab.txt` — bilingual EN-1024 / HI-512 BPE tokenizer
	- `inference.py` — minimal inference example

	## 🛠 Training

	Fine-tuned from NVIDIA `nemotron-speech-streaming-en-0.6b` using the NeMo
	ASR framework. Hindi training mix:

	\| Source \| Approx. hours \|
	\|---\|---\|
	\| Shrutilipi (Hindi) \| ~1,500 \|
	\| IndicVoices (Hindi) \| ~1,000 \|
	\| Kathbath (Hindi) \| ~137 \|
	\| IndicVoices-R \| ~150 \|
	\| Gramvaani \| ~100 \|
	\| Vaani \| ~50 \|
	\| Lahaja \| ~30 \|
	\| IndicTTS \| ~30 \|
	\| Short-form domain \| ~20 \|

	All Hindi training labels were ITN-normalized (digits, ordinals, `।`/`,` punctuation,
	Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva
	Hindi ITN conventions.

	Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across
	languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on
	NVIDIA H100s.

	## 📋 Output convention

	Varuna emits ITN-style Hindi:

	\| spoken \| output \|
	\|---\|---\|
	\| `पाँच सौ` (five hundred) \| `500` \|
	\| `दो लाख पचास हजार` \| `2,50,000` \|
	\| `तीन करोड़` \| `3,00,00,000` \|
	\| `पहला` (first) \| `1st` \|
	\| `तीसरा` \| `3rd` \|
	\| End of sentence \| `।` \|

	This is what voicebot / IVR / call-center products typically want. If your
	downstream consumer expects spelled-out Devanagari, post-process the model
	output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time
	(strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see
	[AI4Bharat/vistaar/evaluation.py](https://github.com/AI4Bharat/vistaar/blob/master/evaluation.py)
	for the reference implementation.

	## ⚠️ Limitations

	- Code-switching not supported yet. Varuna is trained on monolingual Hindi
	audio. Inputs that mix English words mid-sentence (e.g., conversational
	Hindi-English) may produce transliteration artifacts or substitutions. A
	bilingual fine-tune is on the roadmap.
	- Codec-degraded audio. Performance on telephony / heavily compressed audio
	(e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs
	2.75 % on IndicTTS). Codec-augmentation training is planned.
	- Audio format. Expects 16 kHz mono. Other sample rates need resampling
	upstream.

	## 🔗 Links

	- 📊 Benchmark dataset: [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) — 6 Hindi subsets with embedded audio + outputs from Varuna and 3 commercial systems.
	- 🧪 Vistaar normalizer reference: [AI4Bharat/vistaar](https://github.com/AI4Bharat/vistaar)
	- 🛠 Base model: [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)

	## 📬 Contact

	Need help with the training recipe or want to fine-tune Varuna on
	your own data? Reach out: harshris2314@gmail.com.

	## 📝 Citation

	If you use Varuna STT in research or production, please cite:

	```bibtex
	@misc{skunkworks-varuna-stt-2026,
	title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron},
	author = {SkunkWorks Labs},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/SkunkWorkLabs/varuna-stt}
	}
	```