Varuna STT 🌊
Varuna STT is a 0.6B-parameter Hindi automatic speech recognition (ASR) model
fine-tuned from NVIDIA's nemotron-speech-streaming-en-0.6b
base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi
text — digits, ordinals (1st/3rd), Indian numbering (lakh/crore comma
placement), and Devanagari punctuation (।, ,, ?, !) — directly from the
acoustic signal, ready to drop into voicebot / IVR / transcription pipelines
without a separate ITN postprocessor.
- Architecture: Conformer encoder + RNN-T decoder (NeMo
EncDecRNNTBPEModel) - Parameters: 0.6 B
- Language: Hindi (
hi) - Sample rate: 16 kHz mono
- Output style: Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation
- License: SkunkWorks Modified MIT (see
LICENSE)
⚡ Inference speed (NVIDIA H100 PCIe)
Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding:
| Metric | Value |
|---|---|
| RTFx | 25.13× |
| Mean per-clip latency | 208 ms |
| p50 latency | 175 ms |
| p90 latency | 362 ms |
(RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.)
📊 Benchmark — Vistaar-style normalized WER % / CER %
Evaluated on six Hindi held-out subsets from the
SkunkWorkLabs/hindi-asr-benchmark dataset.
References and hypotheses both pass through the same Vistaar-style normalizer
(Bhogale et al., Interspeech 2023)
plus digit / ordinal expansion, so all systems are compared in a style-neutral way.
WER %
| Subset | n | Varuna STT | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
|---|---|---|---|---|---|
| indictts | 98 | 9.75 🥇 | 13.20 | 15.41 | 14.71 |
| fleurs (test) | 417 | 17.29 | 11.93 | 21.22 | 15.74 |
| kathbath | 1,929 | 16.82 | 13.32 | 20.55 | 16.62 |
| kathbath_noisy | 1,929 | 19.06 | 13.16 | 21.98 | 17.75 |
| commonvoice | 1,727 | 24.16 | 17.02 | 28.34 | 19.32 |
| mucs | 3,897 | 24.60 | 10.97 | 20.54 | 12.72 |
CER %
| Subset | Varuna STT | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
|---|---|---|---|---|
| indictts | 2.75 🥇 | 4.16 | 8.53 | 6.51 |
| fleurs (test) | 7.20 | 5.68 | 16.74 | 7.08 |
| kathbath | 6.36 🥇 | 6.50 | 13.53 | 7.42 |
| kathbath_noisy | 8.00 | 5.87 | 14.75 | 7.82 |
| commonvoice | 10.72 | 8.96 | 20.25 | 9.87 |
| mucs | 10.75 | 3.94 | 9.94 | 4.79 |
Varuna leads on indictts (both metrics) and matches the leader on kathbath CER. It has more headroom on conversational / codec-degraded subsets (commonvoice, mucs).
🚀 Usage
from inference import VarunaSTT
model = VarunaSTT() # auto-picks GPU if available
texts = model.transcribe(["clip1.wav", "clip2.wav"]) # 16 kHz mono
for t in texts: print(t)
CLI:
python inference.py --audio path/to/clip.wav
You'll need:
nemo_toolkit[asr]>=2.4omegaconf,torch,soundfile- The base
nemotron-speech-streaming-en-0.6b.nemofile (download separately fromnvidia/nemotron-speech-streaming-en-0.6b)
Files in this repo:
varuna.ckpt— fine-tuned weightstokenizer.model,tokenizer.vocab,vocab.txt— bilingual EN-1024 / HI-512 BPE tokenizerinference.py— minimal inference example
🛠 Training
Fine-tuned from NVIDIA nemotron-speech-streaming-en-0.6b using the NeMo
ASR framework. Hindi training mix:
| Source | Approx. hours |
|---|---|
| Shrutilipi (Hindi) | ~1,500 |
| IndicVoices (Hindi) | ~1,000 |
| Kathbath (Hindi) | ~137 |
| IndicVoices-R | ~150 |
| Gramvaani | ~100 |
| Vaani | ~50 |
| Lahaja | ~30 |
| IndicTTS | ~30 |
| Short-form domain | ~20 |
All Hindi training labels were ITN-normalized (digits, ordinals, ।/, punctuation,
Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva
Hindi ITN conventions.
Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on NVIDIA H100s.
📋 Output convention
Varuna emits ITN-style Hindi:
| spoken | output |
|---|---|
पाँच सौ (five hundred) |
500 |
दो लाख पचास हजार |
2,50,000 |
तीन करोड़ |
3,00,00,000 |
पहला (first) |
1st |
तीसरा |
3rd |
| End of sentence | । |
This is what voicebot / IVR / call-center products typically want. If your downstream consumer expects spelled-out Devanagari, post-process the model output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time (strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see AI4Bharat/vistaar/evaluation.py for the reference implementation.
⚠️ Limitations
- Code-switching not supported yet. Varuna is trained on monolingual Hindi audio. Inputs that mix English words mid-sentence (e.g., conversational Hindi-English) may produce transliteration artifacts or substitutions. A bilingual fine-tune is on the roadmap.
- Codec-degraded audio. Performance on telephony / heavily compressed audio (e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs 2.75 % on IndicTTS). Codec-augmentation training is planned.
- Audio format. Expects 16 kHz mono. Other sample rates need resampling upstream.
🔗 Links
- 📊 Benchmark dataset:
SkunkWorkLabs/hindi-asr-benchmark— 6 Hindi subsets with embedded audio + outputs from Varuna and 3 commercial systems. - 🧪 Vistaar normalizer reference: AI4Bharat/vistaar
- 🛠 Base model: nvidia/nemotron-speech-streaming-en-0.6b
📬 Contact
Need help with the training recipe or want to fine-tune Varuna on your own data? Reach out: harshris2314@gmail.com.
📝 Citation
If you use Varuna STT in research or production, please cite:
@misc{skunkworks-varuna-stt-2026,
title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron},
author = {SkunkWorks Labs},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/SkunkWorkLabs/varuna-stt}
}
- Downloads last month
- -
Model tree for SkunkWorkLabs/varuna-stt
Base model
nvidia/nemotron-speech-streaming-en-0.6bSpace using SkunkWorkLabs/varuna-stt 1
Evaluation results
- wer on SkunkWorkLabs Hindi ASR Benchmark — kathbathself-reported16.820
- cer on SkunkWorkLabs Hindi ASR Benchmark — kathbathself-reported6.360
- wer on SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisyself-reported19.060
- cer on SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisyself-reported8.000
- wer on SkunkWorkLabs Hindi ASR Benchmark — commonvoiceself-reported24.160
- cer on SkunkWorkLabs Hindi ASR Benchmark — commonvoiceself-reported10.720
- wer on SkunkWorkLabs Hindi ASR Benchmark — fleursself-reported17.290
- cer on SkunkWorkLabs Hindi ASR Benchmark — fleursself-reported7.200
- wer on SkunkWorkLabs Hindi ASR Benchmark — indicttsself-reported9.750
- cer on SkunkWorkLabs Hindi ASR Benchmark — indicttsself-reported2.750
- wer on SkunkWorkLabs Hindi ASR Benchmark — mucsself-reported24.600
- cer on SkunkWorkLabs Hindi ASR Benchmark — mucsself-reported10.750