Varuna STT 🌊

Varuna STT is a 0.6B-parameter Hindi automatic speech recognition (ASR) model fine-tuned from NVIDIA's nemotron-speech-streaming-en-0.6b base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi text — digits, ordinals (1st/3rd), Indian numbering (lakh/crore comma placement), and Devanagari punctuation (, ,, ?, !) — directly from the acoustic signal, ready to drop into voicebot / IVR / transcription pipelines without a separate ITN postprocessor.

  • Architecture: Conformer encoder + RNN-T decoder (NeMo EncDecRNNTBPEModel)
  • Parameters: 0.6 B
  • Language: Hindi (hi)
  • Sample rate: 16 kHz mono
  • Output style: Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation
  • License: SkunkWorks Modified MIT (see LICENSE)

⚡ Inference speed (NVIDIA H100 PCIe)

Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding:

Metric Value
RTFx 25.13×
Mean per-clip latency 208 ms
p50 latency 175 ms
p90 latency 362 ms

(RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.)

📊 Benchmark — Vistaar-style normalized WER % / CER %

Evaluated on six Hindi held-out subsets from the SkunkWorkLabs/hindi-asr-benchmark dataset. References and hypotheses both pass through the same Vistaar-style normalizer (Bhogale et al., Interspeech 2023) plus digit / ordinal expansion, so all systems are compared in a style-neutral way.

WER %

Subset n Varuna STT ElevenLabs Scribe v1 Deepgram Nova-2 Sarvam Saarika v2.5
indictts 98 9.75 🥇 13.20 15.41 14.71
fleurs (test) 417 17.29 11.93 21.22 15.74
kathbath 1,929 16.82 13.32 20.55 16.62
kathbath_noisy 1,929 19.06 13.16 21.98 17.75
commonvoice 1,727 24.16 17.02 28.34 19.32
mucs 3,897 24.60 10.97 20.54 12.72

CER %

Subset Varuna STT ElevenLabs Scribe v1 Deepgram Nova-2 Sarvam Saarika v2.5
indictts 2.75 🥇 4.16 8.53 6.51
fleurs (test) 7.20 5.68 16.74 7.08
kathbath 6.36 🥇 6.50 13.53 7.42
kathbath_noisy 8.00 5.87 14.75 7.82
commonvoice 10.72 8.96 20.25 9.87
mucs 10.75 3.94 9.94 4.79

Varuna leads on indictts (both metrics) and matches the leader on kathbath CER. It has more headroom on conversational / codec-degraded subsets (commonvoice, mucs).

🚀 Usage

from inference import VarunaSTT

model = VarunaSTT()                                    # auto-picks GPU if available
texts = model.transcribe(["clip1.wav", "clip2.wav"])   # 16 kHz mono
for t in texts: print(t)

CLI:

python inference.py --audio path/to/clip.wav

You'll need:

Files in this repo:

  • varuna.ckpt — fine-tuned weights
  • tokenizer.model, tokenizer.vocab, vocab.txt — bilingual EN-1024 / HI-512 BPE tokenizer
  • inference.py — minimal inference example

🛠 Training

Fine-tuned from NVIDIA nemotron-speech-streaming-en-0.6b using the NeMo ASR framework. Hindi training mix:

Source Approx. hours
Shrutilipi (Hindi) ~1,500
IndicVoices (Hindi) ~1,000
Kathbath (Hindi) ~137
IndicVoices-R ~150
Gramvaani ~100
Vaani ~50
Lahaja ~30
IndicTTS ~30
Short-form domain ~20

All Hindi training labels were ITN-normalized (digits, ordinals, /, punctuation, Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva Hindi ITN conventions.

Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on NVIDIA H100s.

📋 Output convention

Varuna emits ITN-style Hindi:

spoken output
पाँच सौ (five hundred) 500
दो लाख पचास हजार 2,50,000
तीन करोड़ 3,00,00,000
पहला (first) 1st
तीसरा 3rd
End of sentence

This is what voicebot / IVR / call-center products typically want. If your downstream consumer expects spelled-out Devanagari, post-process the model output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time (strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see AI4Bharat/vistaar/evaluation.py for the reference implementation.

⚠️ Limitations

  • Code-switching not supported yet. Varuna is trained on monolingual Hindi audio. Inputs that mix English words mid-sentence (e.g., conversational Hindi-English) may produce transliteration artifacts or substitutions. A bilingual fine-tune is on the roadmap.
  • Codec-degraded audio. Performance on telephony / heavily compressed audio (e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs 2.75 % on IndicTTS). Codec-augmentation training is planned.
  • Audio format. Expects 16 kHz mono. Other sample rates need resampling upstream.

🔗 Links

📬 Contact

Need help with the training recipe or want to fine-tune Varuna on your own data? Reach out: harshris2314@gmail.com.

📝 Citation

If you use Varuna STT in research or production, please cite:

@misc{skunkworks-varuna-stt-2026,
  title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron},
  author = {SkunkWorks Labs},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SkunkWorkLabs/varuna-stt}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SkunkWorkLabs/varuna-stt

Finetuned
(7)
this model

Space using SkunkWorkLabs/varuna-stt 1

Evaluation results

  • wer on SkunkWorkLabs Hindi ASR Benchmark — kathbath
    self-reported
    16.820
  • cer on SkunkWorkLabs Hindi ASR Benchmark — kathbath
    self-reported
    6.360
  • wer on SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy
    self-reported
    19.060
  • cer on SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy
    self-reported
    8.000
  • wer on SkunkWorkLabs Hindi ASR Benchmark — commonvoice
    self-reported
    24.160
  • cer on SkunkWorkLabs Hindi ASR Benchmark — commonvoice
    self-reported
    10.720
  • wer on SkunkWorkLabs Hindi ASR Benchmark — fleurs
    self-reported
    17.290
  • cer on SkunkWorkLabs Hindi ASR Benchmark — fleurs
    self-reported
    7.200
  • wer on SkunkWorkLabs Hindi ASR Benchmark — indictts
    self-reported
    9.750
  • cer on SkunkWorkLabs Hindi ASR Benchmark — indictts
    self-reported
    2.750
  • wer on SkunkWorkLabs Hindi ASR Benchmark — mucs
    self-reported
    24.600
  • cer on SkunkWorkLabs Hindi ASR Benchmark — mucs
    self-reported
    10.750