Automatic Speech Recognition
NeMo
Hindi
hindi
asr
speech
conformer
rnnt
varuna
Eval Results (legacy)
Instructions to use SkunkWorkLabs/varuna-stt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use SkunkWorkLabs/varuna-stt with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("SkunkWorkLabs/varuna-stt") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
| language: | |
| - hi | |
| license: other | |
| license_name: skunkworks-modified-mit | |
| license_link: LICENSE | |
| pretty_name: Varuna STT | |
| library_name: nemo | |
| tags: | |
| - automatic-speech-recognition | |
| - hindi | |
| - asr | |
| - speech | |
| - conformer | |
| - rnnt | |
| - nemo | |
| - varuna | |
| pipeline_tag: automatic-speech-recognition | |
| base_model: nvidia/nemotron-speech-streaming-en-0.6b | |
| metrics: | |
| - wer | |
| - cer | |
| model-index: | |
| - name: Varuna STT | |
| results: | |
| - task: | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: SkunkWorkLabs Hindi ASR Benchmark — kathbath | |
| type: SkunkWorkLabs/hindi-asr-benchmark | |
| config: kathbath | |
| split: eval | |
| metrics: | |
| - type: wer | |
| value: 16.82 | |
| - type: cer | |
| value: 6.36 | |
| - task: | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy | |
| type: SkunkWorkLabs/hindi-asr-benchmark | |
| config: kathbath_noisy | |
| split: eval | |
| metrics: | |
| - type: wer | |
| value: 19.06 | |
| - type: cer | |
| value: 8.00 | |
| - task: | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: SkunkWorkLabs Hindi ASR Benchmark — commonvoice | |
| type: SkunkWorkLabs/hindi-asr-benchmark | |
| config: commonvoice | |
| split: eval | |
| metrics: | |
| - type: wer | |
| value: 24.16 | |
| - type: cer | |
| value: 10.72 | |
| - task: | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: SkunkWorkLabs Hindi ASR Benchmark — fleurs | |
| type: SkunkWorkLabs/hindi-asr-benchmark | |
| config: fleurs | |
| split: eval | |
| metrics: | |
| - type: wer | |
| value: 17.29 | |
| - type: cer | |
| value: 7.20 | |
| - task: | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: SkunkWorkLabs Hindi ASR Benchmark — indictts | |
| type: SkunkWorkLabs/hindi-asr-benchmark | |
| config: indictts | |
| split: eval | |
| metrics: | |
| - type: wer | |
| value: 9.75 | |
| - type: cer | |
| value: 2.75 | |
| - task: | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: SkunkWorkLabs Hindi ASR Benchmark — mucs | |
| type: SkunkWorkLabs/hindi-asr-benchmark | |
| config: mucs | |
| split: eval | |
| metrics: | |
| - type: wer | |
| value: 24.60 | |
| - type: cer | |
| value: 10.75 | |
| # Varuna STT 🌊 | |
| **Varuna STT** is a 0.6B-parameter Hindi automatic speech recognition (ASR) model | |
| fine-tuned from NVIDIA's [`nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) | |
| base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi | |
| text — digits, ordinals (`1st`/`3rd`), Indian numbering (lakh/crore comma | |
| placement), and Devanagari punctuation (`।`, `,`, `?`, `!`) — directly from the | |
| acoustic signal, ready to drop into voicebot / IVR / transcription pipelines | |
| without a separate ITN postprocessor. | |
| - **Architecture:** Conformer encoder + RNN-T decoder (NeMo `EncDecRNNTBPEModel`) | |
| - **Parameters:** 0.6 B | |
| - **Language:** Hindi (`hi`) | |
| - **Sample rate:** 16 kHz mono | |
| - **Output style:** Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation | |
| - **License:** SkunkWorks Modified MIT (see `LICENSE`) | |
| ## ⚡ Inference speed (NVIDIA H100 PCIe) | |
| Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding: | |
| | Metric | Value | | |
| |---|---| | |
| | **RTFx** | **25.13×** | | |
| | Mean per-clip latency | 208 ms | | |
| | p50 latency | 175 ms | | |
| | p90 latency | 362 ms | | |
| (RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.) | |
| ## 📊 Benchmark — Vistaar-style normalized WER % / CER % | |
| Evaluated on six Hindi held-out subsets from the | |
| [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) dataset. | |
| References and hypotheses both pass through the same Vistaar-style normalizer | |
| ([Bhogale et al., Interspeech 2023](https://www.isca-archive.org/interspeech_2023/bhogale23_interspeech.pdf)) | |
| plus digit / ordinal expansion, so all systems are compared in a style-neutral way. | |
| ### WER % | |
| | Subset | n | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 | | |
| |---|---|---|---|---|---| | |
| | **indictts** | 98 | **9.75 🥇** | 13.20 | 15.41 | 14.71 | | |
| | **fleurs (test)** | 417 | 17.29 | **11.93** | 21.22 | 15.74 | | |
| | **kathbath** | 1,929 | 16.82 | **13.32** | 20.55 | 16.62 | | |
| | **kathbath_noisy** | 1,929 | 19.06 | **13.16** | 21.98 | 17.75 | | |
| | **commonvoice** | 1,727 | 24.16 | **17.02** | 28.34 | 19.32 | | |
| | **mucs** | 3,897 | 24.60 | **10.97** | 20.54 | 12.72 | | |
| ### CER % | |
| | Subset | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 | | |
| |---|---|---|---|---| | |
| | **indictts** | **2.75 🥇** | 4.16 | 8.53 | 6.51 | | |
| | **fleurs (test)** | 7.20 | **5.68** | 16.74 | 7.08 | | |
| | **kathbath** | **6.36 🥇** | 6.50 | 13.53 | 7.42 | | |
| | **kathbath_noisy** | 8.00 | **5.87** | 14.75 | 7.82 | | |
| | **commonvoice** | 10.72 | **8.96** | 20.25 | 9.87 | | |
| | **mucs** | 10.75 | **3.94** | 9.94 | 4.79 | | |
| Varuna leads on `indictts` (both metrics) and matches the leader on `kathbath` CER. It has more headroom on conversational / codec-degraded subsets (`commonvoice`, `mucs`). | |
| ## 🚀 Usage | |
| ```python | |
| from inference import VarunaSTT | |
| model = VarunaSTT() # auto-picks GPU if available | |
| texts = model.transcribe(["clip1.wav", "clip2.wav"]) # 16 kHz mono | |
| for t in texts: print(t) | |
| ``` | |
| CLI: | |
| ```bash | |
| python inference.py --audio path/to/clip.wav | |
| ``` | |
| You'll need: | |
| - `nemo_toolkit[asr]>=2.4` | |
| - `omegaconf`, `torch`, `soundfile` | |
| - The base `nemotron-speech-streaming-en-0.6b.nemo` file (download separately | |
| from [`nvidia/nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)) | |
| Files in this repo: | |
| - `varuna.ckpt` — fine-tuned weights | |
| - `tokenizer.model`, `tokenizer.vocab`, `vocab.txt` — bilingual EN-1024 / HI-512 BPE tokenizer | |
| - `inference.py` — minimal inference example | |
| ## 🛠 Training | |
| Fine-tuned from **NVIDIA `nemotron-speech-streaming-en-0.6b`** using the NeMo | |
| ASR framework. Hindi training mix: | |
| | Source | Approx. hours | | |
| |---|---| | |
| | Shrutilipi (Hindi) | ~1,500 | | |
| | IndicVoices (Hindi) | ~1,000 | | |
| | Kathbath (Hindi) | ~137 | | |
| | IndicVoices-R | ~150 | | |
| | Gramvaani | ~100 | | |
| | Vaani | ~50 | | |
| | Lahaja | ~30 | | |
| | IndicTTS | ~30 | | |
| | Short-form domain | ~20 | | |
| All Hindi training labels were ITN-normalized (digits, ordinals, `।`/`,` punctuation, | |
| Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva | |
| Hindi ITN conventions. | |
| Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across | |
| languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on | |
| NVIDIA H100s. | |
| ## 📋 Output convention | |
| Varuna emits **ITN-style** Hindi: | |
| | spoken | output | | |
| |---|---| | |
| | `पाँच सौ` (five hundred) | `500` | | |
| | `दो लाख पचास हजार` | `2,50,000` | | |
| | `तीन करोड़` | `3,00,00,000` | | |
| | `पहला` (first) | `1st` | | |
| | `तीसरा` | `3rd` | | |
| | End of sentence | `।` | | |
| This is what voicebot / IVR / call-center products typically want. If your | |
| downstream consumer expects spelled-out Devanagari, post-process the model | |
| output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time | |
| (strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see | |
| [AI4Bharat/vistaar/evaluation.py](https://github.com/AI4Bharat/vistaar/blob/master/evaluation.py) | |
| for the reference implementation. | |
| ## ⚠️ Limitations | |
| - **Code-switching not supported yet.** Varuna is trained on monolingual Hindi | |
| audio. Inputs that mix English words mid-sentence (e.g., conversational | |
| Hindi-English) may produce transliteration artifacts or substitutions. A | |
| bilingual fine-tune is on the roadmap. | |
| - **Codec-degraded audio.** Performance on telephony / heavily compressed audio | |
| (e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs | |
| 2.75 % on IndicTTS). Codec-augmentation training is planned. | |
| - **Audio format.** Expects 16 kHz mono. Other sample rates need resampling | |
| upstream. | |
| ## 🔗 Links | |
| - 📊 **Benchmark dataset:** [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) — 6 Hindi subsets with embedded audio + outputs from Varuna and 3 commercial systems. | |
| - 🧪 **Vistaar normalizer reference:** [AI4Bharat/vistaar](https://github.com/AI4Bharat/vistaar) | |
| - 🛠 **Base model:** [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) | |
| ## 📬 Contact | |
| Need help with the **training recipe** or want to **fine-tune Varuna** on | |
| your own data? Reach out: **harshris2314@gmail.com**. | |
| ## 📝 Citation | |
| If you use Varuna STT in research or production, please cite: | |
| ```bibtex | |
| @misc{skunkworks-varuna-stt-2026, | |
| title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron}, | |
| author = {SkunkWorks Labs}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/SkunkWorkLabs/varuna-stt} | |
| } | |
| ``` | |