varuna-stt / README.md
harsh2ai's picture
Add contact email for fine-tuning help
65252a5 verified
metadata
language:
  - hi
license: other
license_name: skunkworks-modified-mit
license_link: LICENSE
pretty_name: Varuna STT
library_name: nemo
tags:
  - automatic-speech-recognition
  - hindi
  - asr
  - speech
  - conformer
  - rnnt
  - nemo
  - varuna
pipeline_tag: automatic-speech-recognition
base_model: nvidia/nemotron-speech-streaming-en-0.6b
metrics:
  - wer
  - cer
model-index:
  - name: Varuna STT
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark  kathbath
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: kathbath
          split: eval
        metrics:
          - type: wer
            value: 16.82
          - type: cer
            value: 6.36
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark  kathbath_noisy
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: kathbath_noisy
          split: eval
        metrics:
          - type: wer
            value: 19.06
          - type: cer
            value: 8
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark  commonvoice
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: commonvoice
          split: eval
        metrics:
          - type: wer
            value: 24.16
          - type: cer
            value: 10.72
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark  fleurs
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: fleurs
          split: eval
        metrics:
          - type: wer
            value: 17.29
          - type: cer
            value: 7.2
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark  indictts
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: indictts
          split: eval
        metrics:
          - type: wer
            value: 9.75
          - type: cer
            value: 2.75
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark  mucs
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: mucs
          split: eval
        metrics:
          - type: wer
            value: 24.6
          - type: cer
            value: 10.75

Varuna STT 🌊

Varuna STT is a 0.6B-parameter Hindi automatic speech recognition (ASR) model fine-tuned from NVIDIA's nemotron-speech-streaming-en-0.6b base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi text — digits, ordinals (1st/3rd), Indian numbering (lakh/crore comma placement), and Devanagari punctuation (, ,, ?, !) — directly from the acoustic signal, ready to drop into voicebot / IVR / transcription pipelines without a separate ITN postprocessor.

  • Architecture: Conformer encoder + RNN-T decoder (NeMo EncDecRNNTBPEModel)
  • Parameters: 0.6 B
  • Language: Hindi (hi)
  • Sample rate: 16 kHz mono
  • Output style: Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation
  • License: SkunkWorks Modified MIT (see LICENSE)

⚡ Inference speed (NVIDIA H100 PCIe)

Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding:

Metric Value
RTFx 25.13×
Mean per-clip latency 208 ms
p50 latency 175 ms
p90 latency 362 ms

(RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.)

📊 Benchmark — Vistaar-style normalized WER % / CER %

Evaluated on six Hindi held-out subsets from the SkunkWorkLabs/hindi-asr-benchmark dataset. References and hypotheses both pass through the same Vistaar-style normalizer (Bhogale et al., Interspeech 2023) plus digit / ordinal expansion, so all systems are compared in a style-neutral way.

WER %

Subset n Varuna STT ElevenLabs Scribe v1 Deepgram Nova-2 Sarvam Saarika v2.5
indictts 98 9.75 🥇 13.20 15.41 14.71
fleurs (test) 417 17.29 11.93 21.22 15.74
kathbath 1,929 16.82 13.32 20.55 16.62
kathbath_noisy 1,929 19.06 13.16 21.98 17.75
commonvoice 1,727 24.16 17.02 28.34 19.32
mucs 3,897 24.60 10.97 20.54 12.72

CER %

Subset Varuna STT ElevenLabs Scribe v1 Deepgram Nova-2 Sarvam Saarika v2.5
indictts 2.75 🥇 4.16 8.53 6.51
fleurs (test) 7.20 5.68 16.74 7.08
kathbath 6.36 🥇 6.50 13.53 7.42
kathbath_noisy 8.00 5.87 14.75 7.82
commonvoice 10.72 8.96 20.25 9.87
mucs 10.75 3.94 9.94 4.79

Varuna leads on indictts (both metrics) and matches the leader on kathbath CER. It has more headroom on conversational / codec-degraded subsets (commonvoice, mucs).

🚀 Usage

from inference import VarunaSTT

model = VarunaSTT()                                    # auto-picks GPU if available
texts = model.transcribe(["clip1.wav", "clip2.wav"])   # 16 kHz mono
for t in texts: print(t)

CLI:

python inference.py --audio path/to/clip.wav

You'll need:

Files in this repo:

  • varuna.ckpt — fine-tuned weights
  • tokenizer.model, tokenizer.vocab, vocab.txt — bilingual EN-1024 / HI-512 BPE tokenizer
  • inference.py — minimal inference example

🛠 Training

Fine-tuned from NVIDIA nemotron-speech-streaming-en-0.6b using the NeMo ASR framework. Hindi training mix:

Source Approx. hours
Shrutilipi (Hindi) ~1,500
IndicVoices (Hindi) ~1,000
Kathbath (Hindi) ~137
IndicVoices-R ~150
Gramvaani ~100
Vaani ~50
Lahaja ~30
IndicTTS ~30
Short-form domain ~20

All Hindi training labels were ITN-normalized (digits, ordinals, /, punctuation, Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva Hindi ITN conventions.

Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on NVIDIA H100s.

📋 Output convention

Varuna emits ITN-style Hindi:

spoken output
पाँच सौ (five hundred) 500
दो लाख पचास हजार 2,50,000
तीन करोड़ 3,00,00,000
पहला (first) 1st
तीसरा 3rd
End of sentence

This is what voicebot / IVR / call-center products typically want. If your downstream consumer expects spelled-out Devanagari, post-process the model output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time (strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see AI4Bharat/vistaar/evaluation.py for the reference implementation.

⚠️ Limitations

  • Code-switching not supported yet. Varuna is trained on monolingual Hindi audio. Inputs that mix English words mid-sentence (e.g., conversational Hindi-English) may produce transliteration artifacts or substitutions. A bilingual fine-tune is on the roadmap.
  • Codec-degraded audio. Performance on telephony / heavily compressed audio (e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs 2.75 % on IndicTTS). Codec-augmentation training is planned.
  • Audio format. Expects 16 kHz mono. Other sample rates need resampling upstream.

🔗 Links

📬 Contact

Need help with the training recipe or want to fine-tune Varuna on your own data? Reach out: harshris2314@gmail.com.

📝 Citation

If you use Varuna STT in research or production, please cite:

@misc{skunkworks-varuna-stt-2026,
  title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron},
  author = {SkunkWorks Labs},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SkunkWorkLabs/varuna-stt}
}