--- language: - hi license: other license_name: skunkworks-modified-mit license_link: LICENSE pretty_name: Varuna STT library_name: nemo tags: - automatic-speech-recognition - hindi - asr - speech - conformer - rnnt - nemo - varuna pipeline_tag: automatic-speech-recognition base_model: nvidia/nemotron-speech-streaming-en-0.6b metrics: - wer - cer model-index: - name: Varuna STT results: - task: type: automatic-speech-recognition dataset: name: SkunkWorkLabs Hindi ASR Benchmark — kathbath type: SkunkWorkLabs/hindi-asr-benchmark config: kathbath split: eval metrics: - type: wer value: 16.82 - type: cer value: 6.36 - task: type: automatic-speech-recognition dataset: name: SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy type: SkunkWorkLabs/hindi-asr-benchmark config: kathbath_noisy split: eval metrics: - type: wer value: 19.06 - type: cer value: 8.00 - task: type: automatic-speech-recognition dataset: name: SkunkWorkLabs Hindi ASR Benchmark — commonvoice type: SkunkWorkLabs/hindi-asr-benchmark config: commonvoice split: eval metrics: - type: wer value: 24.16 - type: cer value: 10.72 - task: type: automatic-speech-recognition dataset: name: SkunkWorkLabs Hindi ASR Benchmark — fleurs type: SkunkWorkLabs/hindi-asr-benchmark config: fleurs split: eval metrics: - type: wer value: 17.29 - type: cer value: 7.20 - task: type: automatic-speech-recognition dataset: name: SkunkWorkLabs Hindi ASR Benchmark — indictts type: SkunkWorkLabs/hindi-asr-benchmark config: indictts split: eval metrics: - type: wer value: 9.75 - type: cer value: 2.75 - task: type: automatic-speech-recognition dataset: name: SkunkWorkLabs Hindi ASR Benchmark — mucs type: SkunkWorkLabs/hindi-asr-benchmark config: mucs split: eval metrics: - type: wer value: 24.60 - type: cer value: 10.75 --- # Varuna STT 🌊 **Varuna STT** is a 0.6B-parameter Hindi automatic speech recognition (ASR) model fine-tuned from NVIDIA's [`nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi text — digits, ordinals (`1st`/`3rd`), Indian numbering (lakh/crore comma placement), and Devanagari punctuation (`।`, `,`, `?`, `!`) — directly from the acoustic signal, ready to drop into voicebot / IVR / transcription pipelines without a separate ITN postprocessor. - **Architecture:** Conformer encoder + RNN-T decoder (NeMo `EncDecRNNTBPEModel`) - **Parameters:** 0.6 B - **Language:** Hindi (`hi`) - **Sample rate:** 16 kHz mono - **Output style:** Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation - **License:** SkunkWorks Modified MIT (see `LICENSE`) ## ⚡ Inference speed (NVIDIA H100 PCIe) Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding: | Metric | Value | |---|---| | **RTFx** | **25.13×** | | Mean per-clip latency | 208 ms | | p50 latency | 175 ms | | p90 latency | 362 ms | (RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.) ## 📊 Benchmark — Vistaar-style normalized WER % / CER % Evaluated on six Hindi held-out subsets from the [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) dataset. References and hypotheses both pass through the same Vistaar-style normalizer ([Bhogale et al., Interspeech 2023](https://www.isca-archive.org/interspeech_2023/bhogale23_interspeech.pdf)) plus digit / ordinal expansion, so all systems are compared in a style-neutral way. ### WER % | Subset | n | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 | |---|---|---|---|---|---| | **indictts** | 98 | **9.75 🥇** | 13.20 | 15.41 | 14.71 | | **fleurs (test)** | 417 | 17.29 | **11.93** | 21.22 | 15.74 | | **kathbath** | 1,929 | 16.82 | **13.32** | 20.55 | 16.62 | | **kathbath_noisy** | 1,929 | 19.06 | **13.16** | 21.98 | 17.75 | | **commonvoice** | 1,727 | 24.16 | **17.02** | 28.34 | 19.32 | | **mucs** | 3,897 | 24.60 | **10.97** | 20.54 | 12.72 | ### CER % | Subset | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 | |---|---|---|---|---| | **indictts** | **2.75 🥇** | 4.16 | 8.53 | 6.51 | | **fleurs (test)** | 7.20 | **5.68** | 16.74 | 7.08 | | **kathbath** | **6.36 🥇** | 6.50 | 13.53 | 7.42 | | **kathbath_noisy** | 8.00 | **5.87** | 14.75 | 7.82 | | **commonvoice** | 10.72 | **8.96** | 20.25 | 9.87 | | **mucs** | 10.75 | **3.94** | 9.94 | 4.79 | Varuna leads on `indictts` (both metrics) and matches the leader on `kathbath` CER. It has more headroom on conversational / codec-degraded subsets (`commonvoice`, `mucs`). ## 🚀 Usage ```python from inference import VarunaSTT model = VarunaSTT() # auto-picks GPU if available texts = model.transcribe(["clip1.wav", "clip2.wav"]) # 16 kHz mono for t in texts: print(t) ``` CLI: ```bash python inference.py --audio path/to/clip.wav ``` You'll need: - `nemo_toolkit[asr]>=2.4` - `omegaconf`, `torch`, `soundfile` - The base `nemotron-speech-streaming-en-0.6b.nemo` file (download separately from [`nvidia/nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)) Files in this repo: - `varuna.ckpt` — fine-tuned weights - `tokenizer.model`, `tokenizer.vocab`, `vocab.txt` — bilingual EN-1024 / HI-512 BPE tokenizer - `inference.py` — minimal inference example ## 🛠 Training Fine-tuned from **NVIDIA `nemotron-speech-streaming-en-0.6b`** using the NeMo ASR framework. Hindi training mix: | Source | Approx. hours | |---|---| | Shrutilipi (Hindi) | ~1,500 | | IndicVoices (Hindi) | ~1,000 | | Kathbath (Hindi) | ~137 | | IndicVoices-R | ~150 | | Gramvaani | ~100 | | Vaani | ~50 | | Lahaja | ~30 | | IndicTTS | ~30 | | Short-form domain | ~20 | All Hindi training labels were ITN-normalized (digits, ordinals, `।`/`,` punctuation, Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva Hindi ITN conventions. Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on NVIDIA H100s. ## 📋 Output convention Varuna emits **ITN-style** Hindi: | spoken | output | |---|---| | `पाँच सौ` (five hundred) | `500` | | `दो लाख पचास हजार` | `2,50,000` | | `तीन करोड़` | `3,00,00,000` | | `पहला` (first) | `1st` | | `तीसरा` | `3rd` | | End of sentence | `।` | This is what voicebot / IVR / call-center products typically want. If your downstream consumer expects spelled-out Devanagari, post-process the model output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time (strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see [AI4Bharat/vistaar/evaluation.py](https://github.com/AI4Bharat/vistaar/blob/master/evaluation.py) for the reference implementation. ## ⚠️ Limitations - **Code-switching not supported yet.** Varuna is trained on monolingual Hindi audio. Inputs that mix English words mid-sentence (e.g., conversational Hindi-English) may produce transliteration artifacts or substitutions. A bilingual fine-tune is on the roadmap. - **Codec-degraded audio.** Performance on telephony / heavily compressed audio (e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs 2.75 % on IndicTTS). Codec-augmentation training is planned. - **Audio format.** Expects 16 kHz mono. Other sample rates need resampling upstream. ## 🔗 Links - 📊 **Benchmark dataset:** [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) — 6 Hindi subsets with embedded audio + outputs from Varuna and 3 commercial systems. - 🧪 **Vistaar normalizer reference:** [AI4Bharat/vistaar](https://github.com/AI4Bharat/vistaar) - 🛠 **Base model:** [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) ## 📬 Contact Need help with the **training recipe** or want to **fine-tune Varuna** on your own data? Reach out: **harshris2314@gmail.com**. ## 📝 Citation If you use Varuna STT in research or production, please cite: ```bibtex @misc{skunkworks-varuna-stt-2026, title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron}, author = {SkunkWorks Labs}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/SkunkWorkLabs/varuna-stt} } ```