File size: 9,506 Bytes

---
language:
  - hi
license: other
license_name: skunkworks-modified-mit
license_link: LICENSE
pretty_name: Varuna STT
library_name: nemo
tags:
  - automatic-speech-recognition
  - hindi
  - asr
  - speech
  - conformer
  - rnnt
  - nemo
  - varuna
pipeline_tag: automatic-speech-recognition
base_model: nvidia/nemotron-speech-streaming-en-0.6b
metrics:
  - wer
  - cer
model-index:
  - name: Varuna STT
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — kathbath
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: kathbath
          split: eval
        metrics:
          - type: wer
            value: 16.82
          - type: cer
            value: 6.36
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: kathbath_noisy
          split: eval
        metrics:
          - type: wer
            value: 19.06
          - type: cer
            value: 8.00
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — commonvoice
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: commonvoice
          split: eval
        metrics:
          - type: wer
            value: 24.16
          - type: cer
            value: 10.72
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — fleurs
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: fleurs
          split: eval
        metrics:
          - type: wer
            value: 17.29
          - type: cer
            value: 7.20
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — indictts
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: indictts
          split: eval
        metrics:
          - type: wer
            value: 9.75
          - type: cer
            value: 2.75
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — mucs
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: mucs
          split: eval
        metrics:
          - type: wer
            value: 24.60
          - type: cer
            value: 10.75
---

# Varuna STT 🌊

**Varuna STT** is a 0.6B-parameter Hindi automatic speech recognition (ASR) model
fine-tuned from NVIDIA's [`nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi
text — digits, ordinals (`1st`/`3rd`), Indian numbering (lakh/crore comma
placement), and Devanagari punctuation (`।`, `,`, `?`, `!`) — directly from the
acoustic signal, ready to drop into voicebot / IVR / transcription pipelines
without a separate ITN postprocessor.

- **Architecture:** Conformer encoder + RNN-T decoder (NeMo `EncDecRNNTBPEModel`)
- **Parameters:** 0.6 B
- **Language:** Hindi (`hi`)
- **Sample rate:** 16 kHz mono
- **Output style:** Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation
- **License:** SkunkWorks Modified MIT (see `LICENSE`)

## ⚡ Inference speed (NVIDIA H100 PCIe)

Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding:

| Metric | Value |
|---|---|
| **RTFx** | **25.13×** |
| Mean per-clip latency | 208 ms |
| p50 latency | 175 ms |
| p90 latency | 362 ms |

(RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.)

## 📊 Benchmark — Vistaar-style normalized WER % / CER %

Evaluated on six Hindi held-out subsets from the
[`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) dataset.
References and hypotheses both pass through the same Vistaar-style normalizer
([Bhogale et al., Interspeech 2023](https://www.isca-archive.org/interspeech_2023/bhogale23_interspeech.pdf))
plus digit / ordinal expansion, so all systems are compared in a style-neutral way.

### WER %

| Subset | n | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
|---|---|---|---|---|---|
| **indictts**       | 98    | **9.75 🥇** | 13.20 | 15.41 | 14.71 |
| **fleurs (test)**  | 417   | 17.29       | **11.93** | 21.22 | 15.74 |
| **kathbath**       | 1,929 | 16.82       | **13.32** | 20.55 | 16.62 |
| **kathbath_noisy** | 1,929 | 19.06       | **13.16** | 21.98 | 17.75 |
| **commonvoice**    | 1,727 | 24.16       | **17.02** | 28.34 | 19.32 |
| **mucs**           | 3,897 | 24.60       | **10.97** | 20.54 | 12.72 |

### CER %

| Subset | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
|---|---|---|---|---|
| **indictts**       | **2.75 🥇** | 4.16 | 8.53 | 6.51 |
| **fleurs (test)**  | 7.20        | **5.68** | 16.74 | 7.08 |
| **kathbath**       | **6.36 🥇** | 6.50 | 13.53 | 7.42 |
| **kathbath_noisy** | 8.00        | **5.87** | 14.75 | 7.82 |
| **commonvoice**    | 10.72       | **8.96** | 20.25 | 9.87 |
| **mucs**           | 10.75       | **3.94** | 9.94 | 4.79 |

Varuna leads on `indictts` (both metrics) and matches the leader on `kathbath` CER. It has more headroom on conversational / codec-degraded subsets (`commonvoice`, `mucs`).

## 🚀 Usage

```python
from inference import VarunaSTT

model = VarunaSTT()                                    # auto-picks GPU if available
texts = model.transcribe(["clip1.wav", "clip2.wav"])   # 16 kHz mono
for t in texts: print(t)
```

CLI:
```bash
python inference.py --audio path/to/clip.wav
```

You'll need:
- `nemo_toolkit[asr]>=2.4`
- `omegaconf`, `torch`, `soundfile`
- The base `nemotron-speech-streaming-en-0.6b.nemo` file (download separately
  from [`nvidia/nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b))

Files in this repo:
- `varuna.ckpt` — fine-tuned weights
- `tokenizer.model`, `tokenizer.vocab`, `vocab.txt` — bilingual EN-1024 / HI-512 BPE tokenizer
- `inference.py` — minimal inference example

## 🛠 Training

Fine-tuned from **NVIDIA `nemotron-speech-streaming-en-0.6b`** using the NeMo
ASR framework. Hindi training mix:

| Source | Approx. hours |
|---|---|
| Shrutilipi (Hindi)  | ~1,500 |
| IndicVoices (Hindi) | ~1,000 |
| Kathbath (Hindi)    | ~137 |
| IndicVoices-R       | ~150 |
| Gramvaani           | ~100 |
| Vaani               | ~50 |
| Lahaja              | ~30 |
| IndicTTS            | ~30 |
| Short-form domain   | ~20 |

All Hindi training labels were ITN-normalized (digits, ordinals, `।`/`,` punctuation,
Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva
Hindi ITN conventions.

Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across
languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on
NVIDIA H100s.

## 📋 Output convention

Varuna emits **ITN-style** Hindi:

| spoken | output |
|---|---|
| `पाँच सौ` (five hundred) | `500` |
| `दो लाख पचास हजार` | `2,50,000` |
| `तीन करोड़` | `3,00,00,000` |
| `पहला` (first) | `1st` |
| `तीसरा` | `3rd` |
| End of sentence | `।` |

This is what voicebot / IVR / call-center products typically want. If your
downstream consumer expects spelled-out Devanagari, post-process the model
output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time
(strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see
[AI4Bharat/vistaar/evaluation.py](https://github.com/AI4Bharat/vistaar/blob/master/evaluation.py)
for the reference implementation.

## ⚠️ Limitations

- **Code-switching not supported yet.** Varuna is trained on monolingual Hindi
  audio. Inputs that mix English words mid-sentence (e.g., conversational
  Hindi-English) may produce transliteration artifacts or substitutions. A
  bilingual fine-tune is on the roadmap.
- **Codec-degraded audio.** Performance on telephony / heavily compressed audio
  (e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs
  2.75 % on IndicTTS). Codec-augmentation training is planned.
- **Audio format.** Expects 16 kHz mono. Other sample rates need resampling
  upstream.

## 🔗 Links

- 📊 **Benchmark dataset:** [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) — 6 Hindi subsets with embedded audio + outputs from Varuna and 3 commercial systems.
- 🧪 **Vistaar normalizer reference:** [AI4Bharat/vistaar](https://github.com/AI4Bharat/vistaar)
- 🛠 **Base model:** [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)

## 📬 Contact

Need help with the **training recipe** or want to **fine-tune Varuna** on
your own data? Reach out: **harshris2314@gmail.com**.

## 📝 Citation

If you use Varuna STT in research or production, please cite:

```bibtex
@misc{skunkworks-varuna-stt-2026,
  title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron},
  author = {SkunkWorks Labs},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SkunkWorkLabs/varuna-stt}
}
```