Varuna STT 🌊

Varuna STT is a 0.6B-parameter Hindi automatic speech recognition (ASR) model fine-tuned from NVIDIA's nemotron-speech-streaming-en-0.6b base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi text — digits, ordinals (1st/3rd), Indian numbering (lakh/crore comma placement), and Devanagari punctuation (।, ,, ?, !) — directly from the acoustic signal, ready to drop into voicebot / IVR / transcription pipelines without a separate ITN postprocessor.

Architecture: Conformer encoder + RNN-T decoder (NeMo EncDecRNNTBPEModel)
Parameters: 0.6 B
Language: Hindi (hi)
Sample rate: 16 kHz mono
Output style: Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation
License: SkunkWorks Modified MIT (see LICENSE)

⚡ Inference speed (NVIDIA H100 PCIe)

Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding:

Metric	Value
RTFx	25.13×
Mean per-clip latency	208 ms
p50 latency	175 ms
p90 latency	362 ms

(RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.)

📊 Benchmark — Vistaar-style normalized WER % / CER %

Evaluated on six Hindi held-out subsets from the SkunkWorkLabs/hindi-asr-benchmark dataset. References and hypotheses both pass through the same Vistaar-style normalizer (Bhogale et al., Interspeech 2023) plus digit / ordinal expansion, so all systems are compared in a style-neutral way.

WER %

Subset	n	Varuna STT	ElevenLabs Scribe v1	Deepgram Nova-2	Sarvam Saarika v2.5
indictts	98	9.75 🥇	13.20	15.41	14.71
fleurs (test)	417	17.29	11.93	21.22	15.74
kathbath	1,929	16.82	13.32	20.55	16.62
kathbath_noisy	1,929	19.06	13.16	21.98	17.75
commonvoice	1,727	24.16	17.02	28.34	19.32
mucs	3,897	24.60	10.97	20.54	12.72

CER %

Subset	Varuna STT	ElevenLabs Scribe v1	Deepgram Nova-2	Sarvam Saarika v2.5
indictts	2.75 🥇	4.16	8.53	6.51
fleurs (test)	7.20	5.68	16.74	7.08
kathbath	6.36 🥇	6.50	13.53	7.42
kathbath_noisy	8.00	5.87	14.75	7.82
commonvoice	10.72	8.96	20.25	9.87
mucs	10.75	3.94	9.94	4.79

Varuna leads on indictts (both metrics) and matches the leader on kathbath CER. It has more headroom on conversational / codec-degraded subsets (commonvoice, mucs).

🚀 Usage

from inference import VarunaSTT

model = VarunaSTT()                                    # auto-picks GPU if available
texts = model.transcribe(["clip1.wav", "clip2.wav"])   # 16 kHz mono
for t in texts: print(t)

CLI:

python inference.py --audio path/to/clip.wav

You'll need:

nemo_toolkit[asr]>=2.4
omegaconf, torch, soundfile
The base nemotron-speech-streaming-en-0.6b.nemo file (download separately from nvidia/nemotron-speech-streaming-en-0.6b)

Files in this repo:

varuna.ckpt — fine-tuned weights
tokenizer.model, tokenizer.vocab, vocab.txt — bilingual EN-1024 / HI-512 BPE tokenizer
inference.py — minimal inference example

🛠 Training

Fine-tuned from NVIDIA nemotron-speech-streaming-en-0.6b using the NeMo ASR framework. Hindi training mix:

Source	Approx. hours
Shrutilipi (Hindi)	~1,500
IndicVoices (Hindi)	~1,000
Kathbath (Hindi)	~137
IndicVoices-R	~150
Gramvaani	~100
Vaani	~50
Lahaja	~30
IndicTTS	~30
Short-form domain	~20

All Hindi training labels were ITN-normalized (digits, ordinals, ।/, punctuation, Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva Hindi ITN conventions.

Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on NVIDIA H100s.

📋 Output convention

Varuna emits ITN-style Hindi:

spoken	output
`पाँच सौ` (five hundred)	`500`
`दो लाख पचास हजार`	`2,50,000`
`तीन करोड़`	`3,00,00,000`
`पहला` (first)	`1st`
`तीसरा`	`3rd`
End of sentence	`।`

This is what voicebot / IVR / call-center products typically want. If your downstream consumer expects spelled-out Devanagari, post-process the model output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time (strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see AI4Bharat/vistaar/evaluation.py for the reference implementation.

⚠️ Limitations

Code-switching not supported yet. Varuna is trained on monolingual Hindi audio. Inputs that mix English words mid-sentence (e.g., conversational Hindi-English) may produce transliteration artifacts or substitutions. A bilingual fine-tune is on the roadmap.
Codec-degraded audio. Performance on telephony / heavily compressed audio (e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs 2.75 % on IndicTTS). Codec-augmentation training is planned.
Audio format. Expects 16 kHz mono. Other sample rates need resampling upstream.

🔗 Links

📊 Benchmark dataset: SkunkWorkLabs/hindi-asr-benchmark — 6 Hindi subsets with embedded audio + outputs from Varuna and 3 commercial systems.
🧪 Vistaar normalizer reference: AI4Bharat/vistaar
🛠 Base model: nvidia/nemotron-speech-streaming-en-0.6b

📬 Contact

Need help with the training recipe or want to fine-tune Varuna on your own data? Reach out: harshris2314@gmail.com.

📝 Citation

If you use Varuna STT in research or production, please cite:

@misc{skunkworks-varuna-stt-2026,
  title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron},
  author = {SkunkWorks Labs},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SkunkWorkLabs/varuna-stt}
}

Downloads last month: -

Model tree for SkunkWorkLabs/varuna-stt

Base model

nvidia/nemotron-speech-streaming-en-0.6b

Finetuned

(7)

this model

Space using SkunkWorkLabs/varuna-stt 1

Evaluation results

wer on SkunkWorkLabs Hindi ASR Benchmark — kathbath
self-reported

16.820
cer on SkunkWorkLabs Hindi ASR Benchmark — kathbath
self-reported

6.360
wer on SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy
self-reported

19.060
cer on SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy
self-reported

8.000
wer on SkunkWorkLabs Hindi ASR Benchmark — commonvoice
self-reported

24.160
cer on SkunkWorkLabs Hindi ASR Benchmark — commonvoice
self-reported

10.720
wer on SkunkWorkLabs Hindi ASR Benchmark — fleurs
self-reported

17.290
cer on SkunkWorkLabs Hindi ASR Benchmark — fleurs
self-reported

7.200
wer on SkunkWorkLabs Hindi ASR Benchmark — indictts
self-reported

9.750
cer on SkunkWorkLabs Hindi ASR Benchmark — indictts
self-reported

2.750
wer on SkunkWorkLabs Hindi ASR Benchmark — mucs
self-reported

24.600
cer on SkunkWorkLabs Hindi ASR Benchmark — mucs
self-reported

10.750