Automatic Speech Recognition
NeMo
Hindi
hindi
asr
speech
conformer
rnnt
varuna
Eval Results (legacy)
Instructions to use SkunkWorkLabs/varuna-stt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use SkunkWorkLabs/varuna-stt with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("SkunkWorkLabs/varuna-stt") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
File size: 9,506 Bytes
2e67c80 65252a5 2e67c80 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | ---
language:
- hi
license: other
license_name: skunkworks-modified-mit
license_link: LICENSE
pretty_name: Varuna STT
library_name: nemo
tags:
- automatic-speech-recognition
- hindi
- asr
- speech
- conformer
- rnnt
- nemo
- varuna
pipeline_tag: automatic-speech-recognition
base_model: nvidia/nemotron-speech-streaming-en-0.6b
metrics:
- wer
- cer
model-index:
- name: Varuna STT
results:
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark — kathbath
type: SkunkWorkLabs/hindi-asr-benchmark
config: kathbath
split: eval
metrics:
- type: wer
value: 16.82
- type: cer
value: 6.36
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy
type: SkunkWorkLabs/hindi-asr-benchmark
config: kathbath_noisy
split: eval
metrics:
- type: wer
value: 19.06
- type: cer
value: 8.00
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark — commonvoice
type: SkunkWorkLabs/hindi-asr-benchmark
config: commonvoice
split: eval
metrics:
- type: wer
value: 24.16
- type: cer
value: 10.72
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark — fleurs
type: SkunkWorkLabs/hindi-asr-benchmark
config: fleurs
split: eval
metrics:
- type: wer
value: 17.29
- type: cer
value: 7.20
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark — indictts
type: SkunkWorkLabs/hindi-asr-benchmark
config: indictts
split: eval
metrics:
- type: wer
value: 9.75
- type: cer
value: 2.75
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark — mucs
type: SkunkWorkLabs/hindi-asr-benchmark
config: mucs
split: eval
metrics:
- type: wer
value: 24.60
- type: cer
value: 10.75
---
# Varuna STT 🌊
**Varuna STT** is a 0.6B-parameter Hindi automatic speech recognition (ASR) model
fine-tuned from NVIDIA's [`nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi
text — digits, ordinals (`1st`/`3rd`), Indian numbering (lakh/crore comma
placement), and Devanagari punctuation (`।`, `,`, `?`, `!`) — directly from the
acoustic signal, ready to drop into voicebot / IVR / transcription pipelines
without a separate ITN postprocessor.
- **Architecture:** Conformer encoder + RNN-T decoder (NeMo `EncDecRNNTBPEModel`)
- **Parameters:** 0.6 B
- **Language:** Hindi (`hi`)
- **Sample rate:** 16 kHz mono
- **Output style:** Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation
- **License:** SkunkWorks Modified MIT (see `LICENSE`)
## ⚡ Inference speed (NVIDIA H100 PCIe)
Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding:
| Metric | Value |
|---|---|
| **RTFx** | **25.13×** |
| Mean per-clip latency | 208 ms |
| p50 latency | 175 ms |
| p90 latency | 362 ms |
(RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.)
## 📊 Benchmark — Vistaar-style normalized WER % / CER %
Evaluated on six Hindi held-out subsets from the
[`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) dataset.
References and hypotheses both pass through the same Vistaar-style normalizer
([Bhogale et al., Interspeech 2023](https://www.isca-archive.org/interspeech_2023/bhogale23_interspeech.pdf))
plus digit / ordinal expansion, so all systems are compared in a style-neutral way.
### WER %
| Subset | n | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
|---|---|---|---|---|---|
| **indictts** | 98 | **9.75 🥇** | 13.20 | 15.41 | 14.71 |
| **fleurs (test)** | 417 | 17.29 | **11.93** | 21.22 | 15.74 |
| **kathbath** | 1,929 | 16.82 | **13.32** | 20.55 | 16.62 |
| **kathbath_noisy** | 1,929 | 19.06 | **13.16** | 21.98 | 17.75 |
| **commonvoice** | 1,727 | 24.16 | **17.02** | 28.34 | 19.32 |
| **mucs** | 3,897 | 24.60 | **10.97** | 20.54 | 12.72 |
### CER %
| Subset | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
|---|---|---|---|---|
| **indictts** | **2.75 🥇** | 4.16 | 8.53 | 6.51 |
| **fleurs (test)** | 7.20 | **5.68** | 16.74 | 7.08 |
| **kathbath** | **6.36 🥇** | 6.50 | 13.53 | 7.42 |
| **kathbath_noisy** | 8.00 | **5.87** | 14.75 | 7.82 |
| **commonvoice** | 10.72 | **8.96** | 20.25 | 9.87 |
| **mucs** | 10.75 | **3.94** | 9.94 | 4.79 |
Varuna leads on `indictts` (both metrics) and matches the leader on `kathbath` CER. It has more headroom on conversational / codec-degraded subsets (`commonvoice`, `mucs`).
## 🚀 Usage
```python
from inference import VarunaSTT
model = VarunaSTT() # auto-picks GPU if available
texts = model.transcribe(["clip1.wav", "clip2.wav"]) # 16 kHz mono
for t in texts: print(t)
```
CLI:
```bash
python inference.py --audio path/to/clip.wav
```
You'll need:
- `nemo_toolkit[asr]>=2.4`
- `omegaconf`, `torch`, `soundfile`
- The base `nemotron-speech-streaming-en-0.6b.nemo` file (download separately
from [`nvidia/nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b))
Files in this repo:
- `varuna.ckpt` — fine-tuned weights
- `tokenizer.model`, `tokenizer.vocab`, `vocab.txt` — bilingual EN-1024 / HI-512 BPE tokenizer
- `inference.py` — minimal inference example
## 🛠 Training
Fine-tuned from **NVIDIA `nemotron-speech-streaming-en-0.6b`** using the NeMo
ASR framework. Hindi training mix:
| Source | Approx. hours |
|---|---|
| Shrutilipi (Hindi) | ~1,500 |
| IndicVoices (Hindi) | ~1,000 |
| Kathbath (Hindi) | ~137 |
| IndicVoices-R | ~150 |
| Gramvaani | ~100 |
| Vaani | ~50 |
| Lahaja | ~30 |
| IndicTTS | ~30 |
| Short-form domain | ~20 |
All Hindi training labels were ITN-normalized (digits, ordinals, `।`/`,` punctuation,
Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva
Hindi ITN conventions.
Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across
languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on
NVIDIA H100s.
## 📋 Output convention
Varuna emits **ITN-style** Hindi:
| spoken | output |
|---|---|
| `पाँच सौ` (five hundred) | `500` |
| `दो लाख पचास हजार` | `2,50,000` |
| `तीन करोड़` | `3,00,00,000` |
| `पहला` (first) | `1st` |
| `तीसरा` | `3rd` |
| End of sentence | `।` |
This is what voicebot / IVR / call-center products typically want. If your
downstream consumer expects spelled-out Devanagari, post-process the model
output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time
(strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see
[AI4Bharat/vistaar/evaluation.py](https://github.com/AI4Bharat/vistaar/blob/master/evaluation.py)
for the reference implementation.
## ⚠️ Limitations
- **Code-switching not supported yet.** Varuna is trained on monolingual Hindi
audio. Inputs that mix English words mid-sentence (e.g., conversational
Hindi-English) may produce transliteration artifacts or substitutions. A
bilingual fine-tune is on the roadmap.
- **Codec-degraded audio.** Performance on telephony / heavily compressed audio
(e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs
2.75 % on IndicTTS). Codec-augmentation training is planned.
- **Audio format.** Expects 16 kHz mono. Other sample rates need resampling
upstream.
## 🔗 Links
- 📊 **Benchmark dataset:** [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) — 6 Hindi subsets with embedded audio + outputs from Varuna and 3 commercial systems.
- 🧪 **Vistaar normalizer reference:** [AI4Bharat/vistaar](https://github.com/AI4Bharat/vistaar)
- 🛠 **Base model:** [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
## 📬 Contact
Need help with the **training recipe** or want to **fine-tune Varuna** on
your own data? Reach out: **harshris2314@gmail.com**.
## 📝 Citation
If you use Varuna STT in research or production, please cite:
```bibtex
@misc{skunkworks-varuna-stt-2026,
title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron},
author = {SkunkWorks Labs},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/SkunkWorkLabs/varuna-stt}
}
```
|