varuna-stt / README.md
harsh2ai's picture
Add contact email for fine-tuning help
65252a5 verified
---
language:
- hi
license: other
license_name: skunkworks-modified-mit
license_link: LICENSE
pretty_name: Varuna STT
library_name: nemo
tags:
- automatic-speech-recognition
- hindi
- asr
- speech
- conformer
- rnnt
- nemo
- varuna
pipeline_tag: automatic-speech-recognition
base_model: nvidia/nemotron-speech-streaming-en-0.6b
metrics:
- wer
- cer
model-index:
- name: Varuna STT
results:
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark kathbath
type: SkunkWorkLabs/hindi-asr-benchmark
config: kathbath
split: eval
metrics:
- type: wer
value: 16.82
- type: cer
value: 6.36
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark kathbath_noisy
type: SkunkWorkLabs/hindi-asr-benchmark
config: kathbath_noisy
split: eval
metrics:
- type: wer
value: 19.06
- type: cer
value: 8.00
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark commonvoice
type: SkunkWorkLabs/hindi-asr-benchmark
config: commonvoice
split: eval
metrics:
- type: wer
value: 24.16
- type: cer
value: 10.72
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark fleurs
type: SkunkWorkLabs/hindi-asr-benchmark
config: fleurs
split: eval
metrics:
- type: wer
value: 17.29
- type: cer
value: 7.20
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark indictts
type: SkunkWorkLabs/hindi-asr-benchmark
config: indictts
split: eval
metrics:
- type: wer
value: 9.75
- type: cer
value: 2.75
- task:
type: automatic-speech-recognition
dataset:
name: SkunkWorkLabs Hindi ASR Benchmark mucs
type: SkunkWorkLabs/hindi-asr-benchmark
config: mucs
split: eval
metrics:
- type: wer
value: 24.60
- type: cer
value: 10.75
---
# Varuna STT 🌊
**Varuna STT** is a 0.6B-parameter Hindi automatic speech recognition (ASR) model
fine-tuned from NVIDIA's [`nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi
text — digits, ordinals (`1st`/`3rd`), Indian numbering (lakh/crore comma
placement), and Devanagari punctuation (`।`, `,`, `?`, `!`) — directly from the
acoustic signal, ready to drop into voicebot / IVR / transcription pipelines
without a separate ITN postprocessor.
- **Architecture:** Conformer encoder + RNN-T decoder (NeMo `EncDecRNNTBPEModel`)
- **Parameters:** 0.6 B
- **Language:** Hindi (`hi`)
- **Sample rate:** 16 kHz mono
- **Output style:** Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation
- **License:** SkunkWorks Modified MIT (see `LICENSE`)
## ⚡ Inference speed (NVIDIA H100 PCIe)
Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding:
| Metric | Value |
|---|---|
| **RTFx** | **25.13×** |
| Mean per-clip latency | 208 ms |
| p50 latency | 175 ms |
| p90 latency | 362 ms |
(RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.)
## 📊 Benchmark — Vistaar-style normalized WER % / CER %
Evaluated on six Hindi held-out subsets from the
[`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) dataset.
References and hypotheses both pass through the same Vistaar-style normalizer
([Bhogale et al., Interspeech 2023](https://www.isca-archive.org/interspeech_2023/bhogale23_interspeech.pdf))
plus digit / ordinal expansion, so all systems are compared in a style-neutral way.
### WER %
| Subset | n | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
|---|---|---|---|---|---|
| **indictts** | 98 | **9.75 🥇** | 13.20 | 15.41 | 14.71 |
| **fleurs (test)** | 417 | 17.29 | **11.93** | 21.22 | 15.74 |
| **kathbath** | 1,929 | 16.82 | **13.32** | 20.55 | 16.62 |
| **kathbath_noisy** | 1,929 | 19.06 | **13.16** | 21.98 | 17.75 |
| **commonvoice** | 1,727 | 24.16 | **17.02** | 28.34 | 19.32 |
| **mucs** | 3,897 | 24.60 | **10.97** | 20.54 | 12.72 |
### CER %
| Subset | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
|---|---|---|---|---|
| **indictts** | **2.75 🥇** | 4.16 | 8.53 | 6.51 |
| **fleurs (test)** | 7.20 | **5.68** | 16.74 | 7.08 |
| **kathbath** | **6.36 🥇** | 6.50 | 13.53 | 7.42 |
| **kathbath_noisy** | 8.00 | **5.87** | 14.75 | 7.82 |
| **commonvoice** | 10.72 | **8.96** | 20.25 | 9.87 |
| **mucs** | 10.75 | **3.94** | 9.94 | 4.79 |
Varuna leads on `indictts` (both metrics) and matches the leader on `kathbath` CER. It has more headroom on conversational / codec-degraded subsets (`commonvoice`, `mucs`).
## 🚀 Usage
```python
from inference import VarunaSTT
model = VarunaSTT() # auto-picks GPU if available
texts = model.transcribe(["clip1.wav", "clip2.wav"]) # 16 kHz mono
for t in texts: print(t)
```
CLI:
```bash
python inference.py --audio path/to/clip.wav
```
You'll need:
- `nemo_toolkit[asr]>=2.4`
- `omegaconf`, `torch`, `soundfile`
- The base `nemotron-speech-streaming-en-0.6b.nemo` file (download separately
from [`nvidia/nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b))
Files in this repo:
- `varuna.ckpt` — fine-tuned weights
- `tokenizer.model`, `tokenizer.vocab`, `vocab.txt` — bilingual EN-1024 / HI-512 BPE tokenizer
- `inference.py` — minimal inference example
## 🛠 Training
Fine-tuned from **NVIDIA `nemotron-speech-streaming-en-0.6b`** using the NeMo
ASR framework. Hindi training mix:
| Source | Approx. hours |
|---|---|
| Shrutilipi (Hindi) | ~1,500 |
| IndicVoices (Hindi) | ~1,000 |
| Kathbath (Hindi) | ~137 |
| IndicVoices-R | ~150 |
| Gramvaani | ~100 |
| Vaani | ~50 |
| Lahaja | ~30 |
| IndicTTS | ~30 |
| Short-form domain | ~20 |
All Hindi training labels were ITN-normalized (digits, ordinals, `।`/`,` punctuation,
Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva
Hindi ITN conventions.
Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across
languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on
NVIDIA H100s.
## 📋 Output convention
Varuna emits **ITN-style** Hindi:
| spoken | output |
|---|---|
| `पाँच सौ` (five hundred) | `500` |
| `दो लाख पचास हजार` | `2,50,000` |
| `तीन करोड़` | `3,00,00,000` |
| `पहला` (first) | `1st` |
| `तीसरा` | `3rd` |
| End of sentence | `।` |
This is what voicebot / IVR / call-center products typically want. If your
downstream consumer expects spelled-out Devanagari, post-process the model
output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time
(strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see
[AI4Bharat/vistaar/evaluation.py](https://github.com/AI4Bharat/vistaar/blob/master/evaluation.py)
for the reference implementation.
## ⚠️ Limitations
- **Code-switching not supported yet.** Varuna is trained on monolingual Hindi
audio. Inputs that mix English words mid-sentence (e.g., conversational
Hindi-English) may produce transliteration artifacts or substitutions. A
bilingual fine-tune is on the roadmap.
- **Codec-degraded audio.** Performance on telephony / heavily compressed audio
(e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs
2.75 % on IndicTTS). Codec-augmentation training is planned.
- **Audio format.** Expects 16 kHz mono. Other sample rates need resampling
upstream.
## 🔗 Links
- 📊 **Benchmark dataset:** [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) — 6 Hindi subsets with embedded audio + outputs from Varuna and 3 commercial systems.
- 🧪 **Vistaar normalizer reference:** [AI4Bharat/vistaar](https://github.com/AI4Bharat/vistaar)
- 🛠 **Base model:** [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
## 📬 Contact
Need help with the **training recipe** or want to **fine-tune Varuna** on
your own data? Reach out: **harshris2314@gmail.com**.
## 📝 Citation
If you use Varuna STT in research or production, please cite:
```bibtex
@misc{skunkworks-varuna-stt-2026,
title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron},
author = {SkunkWorks Labs},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/SkunkWorkLabs/varuna-stt}
}
```