Add contact email for fine-tuning help

65252a5 verified 21 days ago

9.51 kB

language:
  - hi
license: other
license_name: skunkworks-modified-mit
license_link: LICENSE
pretty_name: Varuna STT
library_name: nemo
tags:
  - automatic-speech-recognition
  - hindi
  - asr
  - speech
  - conformer
  - rnnt
  - nemo
  - varuna
pipeline_tag: automatic-speech-recognition
base_model: nvidia/nemotron-speech-streaming-en-0.6b
metrics:
  - wer
  - cer
model-index:
  - name: Varuna STT
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — kathbath
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: kathbath
          split: eval
        metrics:
          - type: wer
            value: 16.82
          - type: cer
            value: 6.36
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: kathbath_noisy
          split: eval
        metrics:
          - type: wer
            value: 19.06
          - type: cer
            value: 8
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — commonvoice
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: commonvoice
          split: eval
        metrics:
          - type: wer
            value: 24.16
          - type: cer
            value: 10.72
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — fleurs
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: fleurs
          split: eval
        metrics:
          - type: wer
            value: 17.29
          - type: cer
            value: 7.2
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — indictts
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: indictts
          split: eval
        metrics:
          - type: wer
            value: 9.75
          - type: cer
            value: 2.75
      - task:
          type: automatic-speech-recognition
        dataset:
          name: SkunkWorkLabs Hindi ASR Benchmark — mucs
          type: SkunkWorkLabs/hindi-asr-benchmark
          config: mucs
          split: eval
        metrics:
          - type: wer
            value: 24.6
          - type: cer
            value: 10.75

Varuna STT 🌊

Varuna STT is a 0.6B-parameter Hindi automatic speech recognition (ASR) model fine-tuned from NVIDIA's nemotron-speech-streaming-en-0.6b base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi text — digits, ordinals (1st/3rd), Indian numbering (lakh/crore comma placement), and Devanagari punctuation (।, ,, ?, !) — directly from the acoustic signal, ready to drop into voicebot / IVR / transcription pipelines without a separate ITN postprocessor.

Architecture: Conformer encoder + RNN-T decoder (NeMo EncDecRNNTBPEModel)
Parameters: 0.6 B
Language: Hindi (hi)
Sample rate: 16 kHz mono
Output style: Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation
License: SkunkWorks Modified MIT (see LICENSE)

⚡ Inference speed (NVIDIA H100 PCIe)

Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding:

Metric	Value
RTFx	25.13×
Mean per-clip latency	208 ms
p50 latency	175 ms
p90 latency	362 ms

(RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.)

📊 Benchmark — Vistaar-style normalized WER % / CER %

Evaluated on six Hindi held-out subsets from the SkunkWorkLabs/hindi-asr-benchmark dataset. References and hypotheses both pass through the same Vistaar-style normalizer (Bhogale et al., Interspeech 2023) plus digit / ordinal expansion, so all systems are compared in a style-neutral way.

WER %

Subset	n	Varuna STT	ElevenLabs Scribe v1	Deepgram Nova-2	Sarvam Saarika v2.5
indictts	98	9.75 🥇	13.20	15.41	14.71
fleurs (test)	417	17.29	11.93	21.22	15.74
kathbath	1,929	16.82	13.32	20.55	16.62
kathbath_noisy	1,929	19.06	13.16	21.98	17.75
commonvoice	1,727	24.16	17.02	28.34	19.32
mucs	3,897	24.60	10.97	20.54	12.72

CER %

Subset	Varuna STT	ElevenLabs Scribe v1	Deepgram Nova-2	Sarvam Saarika v2.5
indictts	2.75 🥇	4.16	8.53	6.51
fleurs (test)	7.20	5.68	16.74	7.08
kathbath	6.36 🥇	6.50	13.53	7.42
kathbath_noisy	8.00	5.87	14.75	7.82
commonvoice	10.72	8.96	20.25	9.87
mucs	10.75	3.94	9.94	4.79

Varuna leads on indictts (both metrics) and matches the leader on kathbath CER. It has more headroom on conversational / codec-degraded subsets (commonvoice, mucs).

🚀 Usage

from inference import VarunaSTT

model = VarunaSTT()                                    # auto-picks GPU if available
texts = model.transcribe(["clip1.wav", "clip2.wav"])   # 16 kHz mono
for t in texts: print(t)

CLI:

python inference.py --audio path/to/clip.wav

You'll need:

nemo_toolkit[asr]>=2.4
omegaconf, torch, soundfile
The base nemotron-speech-streaming-en-0.6b.nemo file (download separately from nvidia/nemotron-speech-streaming-en-0.6b)

Files in this repo:

varuna.ckpt — fine-tuned weights
tokenizer.model, tokenizer.vocab, vocab.txt — bilingual EN-1024 / HI-512 BPE tokenizer
inference.py — minimal inference example

🛠 Training

Fine-tuned from NVIDIA nemotron-speech-streaming-en-0.6b using the NeMo ASR framework. Hindi training mix:

Source	Approx. hours
Shrutilipi (Hindi)	~1,500
IndicVoices (Hindi)	~1,000
Kathbath (Hindi)	~137
IndicVoices-R	~150
Gramvaani	~100
Vaani	~50
Lahaja	~30
IndicTTS	~30
Short-form domain	~20

All Hindi training labels were ITN-normalized (digits, ordinals, ।/, punctuation, Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva Hindi ITN conventions.

Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on NVIDIA H100s.

📋 Output convention

Varuna emits ITN-style Hindi:

spoken	output
`पाँच सौ` (five hundred)	`500`
`दो लाख पचास हजार`	`2,50,000`
`तीन करोड़`	`3,00,00,000`
`पहला` (first)	`1st`
`तीसरा`	`3rd`
End of sentence	`।`

This is what voicebot / IVR / call-center products typically want. If your downstream consumer expects spelled-out Devanagari, post-process the model output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time (strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see AI4Bharat/vistaar/evaluation.py for the reference implementation.

⚠️ Limitations

Code-switching not supported yet. Varuna is trained on monolingual Hindi audio. Inputs that mix English words mid-sentence (e.g., conversational Hindi-English) may produce transliteration artifacts or substitutions. A bilingual fine-tune is on the roadmap.
Codec-degraded audio. Performance on telephony / heavily compressed audio (e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs 2.75 % on IndicTTS). Codec-augmentation training is planned.
Audio format. Expects 16 kHz mono. Other sample rates need resampling upstream.

🔗 Links

📊 Benchmark dataset: SkunkWorkLabs/hindi-asr-benchmark — 6 Hindi subsets with embedded audio + outputs from Varuna and 3 commercial systems.
🧪 Vistaar normalizer reference: AI4Bharat/vistaar
🛠 Base model: nvidia/nemotron-speech-streaming-en-0.6b

📬 Contact

Need help with the training recipe or want to fine-tune Varuna on your own data? Reach out: harshris2314@gmail.com.

📝 Citation

If you use Varuna STT in research or production, please cite:

@misc{skunkworks-varuna-stt-2026,
  title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron},
  author = {SkunkWorks Labs},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SkunkWorkLabs/varuna-stt}
}