SkunkWorkLabs
/

varuna-stt

+SkunkWorks Modified MIT License
+Copyright (c) 2026 SkunkWorks Labs
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software, model weights, and associated documentation files (the
+"Software"), to deal in the Software without restriction, including without
+limitation the rights to use, copy, modify, merge, publish, distribute,
+sublicense, and/or sell copies of the Software, and to permit persons to whom
+the Software is furnished to do so, subject to the following conditions:
+1. ATTRIBUTION
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software. Any product, paper, or
+   public-facing distribution that uses Varuna STT must visibly credit
+   "Varuna STT by SkunkWorks Labs" with a link to the source repository.
+2. UPSTREAM ATTRIBUTION
+   Varuna STT is fine-tuned from NVIDIA NeMo's
+   `nemotron-speech-streaming-en-0.6b` base model. Use of this Software is
+   subject to the upstream NeMo / NVIDIA model license, which the user is
+   responsible for reviewing and complying with separately.
+3. NO WARRANTY
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+   FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+   IN THE SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,263 @@

+---
+language:
+  - hi
+license: other
+license_name: skunkworks-modified-mit
+license_link: LICENSE
+pretty_name: Varuna STT
+library_name: nemo
+tags:
+  - automatic-speech-recognition
+  - hindi
+  - asr
+  - speech
+  - conformer
+  - rnnt
+  - nemo
+  - varuna
+pipeline_tag: automatic-speech-recognition
+base_model: nvidia/nemotron-speech-streaming-en-0.6b
+metrics:
+  - wer
+  - cer
+model-index:
+  - name: Varuna STT
+    results:
+      - task:
+          type: automatic-speech-recognition
+        dataset:
+          name: SkunkWorkLabs Hindi ASR Benchmark — kathbath
+          type: SkunkWorkLabs/hindi-asr-benchmark
+          config: kathbath
+          split: eval
+        metrics:
+          - type: wer
+            value: 16.82
+          - type: cer
+            value: 6.36
+      - task:
+          type: automatic-speech-recognition
+        dataset:
+          name: SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy
+          type: SkunkWorkLabs/hindi-asr-benchmark
+          config: kathbath_noisy
+          split: eval
+        metrics:
+          - type: wer
+            value: 19.06
+          - type: cer
+            value: 8.00
+      - task:
+          type: automatic-speech-recognition
+        dataset:
+          name: SkunkWorkLabs Hindi ASR Benchmark — commonvoice
+          type: SkunkWorkLabs/hindi-asr-benchmark
+          config: commonvoice
+          split: eval
+        metrics:
+          - type: wer
+            value: 24.16
+          - type: cer
+            value: 10.72
+      - task:
+          type: automatic-speech-recognition
+        dataset:
+          name: SkunkWorkLabs Hindi ASR Benchmark — fleurs
+          type: SkunkWorkLabs/hindi-asr-benchmark
+          config: fleurs
+          split: eval
+        metrics:
+          - type: wer
+            value: 17.29
+          - type: cer
+            value: 7.20
+      - task:
+          type: automatic-speech-recognition
+        dataset:
+          name: SkunkWorkLabs Hindi ASR Benchmark — indictts
+          type: SkunkWorkLabs/hindi-asr-benchmark
+          config: indictts
+          split: eval
+        metrics:
+          - type: wer
+            value: 9.75
+          - type: cer
+            value: 2.75
+      - task:
+          type: automatic-speech-recognition
+        dataset:
+          name: SkunkWorkLabs Hindi ASR Benchmark — mucs
+          type: SkunkWorkLabs/hindi-asr-benchmark
+          config: mucs
+          split: eval
+        metrics:
+          - type: wer
+            value: 24.60
+          - type: cer
+            value: 10.75
+---
+# Varuna STT 🌊
+**Varuna STT** is a 0.6B-parameter Hindi automatic speech recognition (ASR) model
+fine-tuned from NVIDIA's [`nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
+base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi
+text — digits, ordinals (`1st`/`3rd`), Indian numbering (lakh/crore comma
+placement), and Devanagari punctuation (`।`, `,`, `?`, `!`) — directly from the
+acoustic signal, ready to drop into voicebot / IVR / transcription pipelines
+without a separate ITN postprocessor.
+- **Architecture:** Conformer encoder + RNN-T decoder (NeMo `EncDecRNNTBPEModel`)
+- **Parameters:** 0.6 B
+- **Language:** Hindi (`hi`)
+- **Sample rate:** 16 kHz mono
+- **Output style:** Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation
+- **License:** SkunkWorks Modified MIT (see `LICENSE`)
+## ⚡ Inference speed (NVIDIA H100 PCIe)
+Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding:
+| Metric | Value |
+|---|---|
+| **RTFx** | **25.13×** |
+| Mean per-clip latency | 208 ms |
+| p50 latency | 175 ms |
+| p90 latency | 362 ms |
+(RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.)
+## 📊 Benchmark — Vistaar-style normalized WER % / CER %
+Evaluated on six Hindi held-out subsets from the
+[`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) dataset.
+References and hypotheses both pass through the same Vistaar-style normalizer
+([Bhogale et al., Interspeech 2023](https://www.isca-archive.org/interspeech_2023/bhogale23_interspeech.pdf))
+plus digit / ordinal expansion, so all systems are compared in a style-neutral way.
+### WER %
+| Subset | n | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
+|---|---|---|---|---|---|
+| **indictts**       | 98    | **9.75 🥇** | 13.20 | 15.41 | 14.71 |
+| **fleurs (test)**  | 417   | 17.29       | **11.93** | 21.22 | 15.74 |
+| **kathbath**       | 1,929 | 16.82       | **13.32** | 20.55 | 16.62 |
+| **kathbath_noisy** | 1,929 | 19.06       | **13.16** | 21.98 | 17.75 |
+| **commonvoice**    | 1,727 | 24.16       | **17.02** | 28.34 | 19.32 |
+| **mucs**           | 3,897 | 24.60       | **10.97** | 20.54 | 12.72 |
+### CER %
+| Subset | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
+|---|---|---|---|---|
+| **indictts**       | **2.75 🥇** | 4.16 | 8.53 | 6.51 |
+| **fleurs (test)**  | 7.20        | **5.68** | 16.74 | 7.08 |
+| **kathbath**       | **6.36 🥇** | 6.50 | 13.53 | 7.42 |
+| **kathbath_noisy** | 8.00        | **5.87** | 14.75 | 7.82 |
+| **commonvoice**    | 10.72       | **8.96** | 20.25 | 9.87 |
+| **mucs**           | 10.75       | **3.94** | 9.94 | 4.79 |
+Varuna leads on `indictts` (both metrics) and matches the leader on `kathbath` CER. It has more headroom on conversational / codec-degraded subsets (`commonvoice`, `mucs`).
+## 🚀 Usage
+```python
+from inference import VarunaSTT
+model = VarunaSTT()                                    # auto-picks GPU if available
+texts = model.transcribe(["clip1.wav", "clip2.wav"])   # 16 kHz mono
+for t in texts: print(t)
+```
+CLI:
+```bash
+python inference.py --audio path/to/clip.wav
+```
+You'll need:
+- `nemo_toolkit[asr]>=2.4`
+- `omegaconf`, `torch`, `soundfile`
+- The base `nemotron-speech-streaming-en-0.6b.nemo` file (download separately
+  from [`nvidia/nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b))
+Files in this repo:
+- `varuna.ckpt` — fine-tuned weights
+- `tokenizer.model`, `tokenizer.vocab`, `vocab.txt` — bilingual EN-1024 / HI-512 BPE tokenizer
+- `inference.py` — minimal inference example
+## 🛠 Training
+Fine-tuned from **NVIDIA `nemotron-speech-streaming-en-0.6b`** using the NeMo
+ASR framework. Hindi training mix:
+| Source | Approx. hours |
+|---|---|
+| Shrutilipi (Hindi)  | ~1,500 |
+| IndicVoices (Hindi) | ~1,000 |
+| Kathbath (Hindi)    | ~137 |
+| IndicVoices-R       | ~150 |
+| Gramvaani           | ~100 |
+| Vaani               | ~50 |
+| Lahaja              | ~30 |
+| IndicTTS            | ~30 |
+| Short-form domain   | ~20 |
+All Hindi training labels were ITN-normalized (digits, ordinals, `।`/`,` punctuation,
+Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva
+Hindi ITN conventions.
+Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across
+languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on
+NVIDIA H100s.
+## 📋 Output convention
+Varuna emits **ITN-style** Hindi:
+| spoken | output |
+|---|---|
+| `पाँच सौ` (five hundred) | `500` |
+| `दो लाख पचास हजार` | `2,50,000` |
+| `तीन करोड़` | `3,00,00,000` |
+| `पहला` (first) | `1st` |
+| `तीसरा` | `3rd` |
+| End of sentence | `।` |
+This is what voicebot / IVR / call-center products typically want. If your
+downstream consumer expects spelled-out Devanagari, post-process the model
+output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time
+(strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see
+[AI4Bharat/vistaar/evaluation.py](https://github.com/AI4Bharat/vistaar/blob/master/evaluation.py)
+for the reference implementation.
+## ⚠️ Limitations
+- **Code-switching not supported yet.** Varuna is trained on monolingual Hindi
+  audio. Inputs that mix English words mid-sentence (e.g., conversational
+  Hindi-English) may produce transliteration artifacts or substitutions. A
+  bilingual fine-tune is on the roadmap.
+- **Codec-degraded audio.** Performance on telephony / heavily compressed audio
+  (e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs
+  2.75 % on IndicTTS). Codec-augmentation training is planned.
+- **Audio format.** Expects 16 kHz mono. Other sample rates need resampling
+  upstream.
+## 🔗 Links
+- 📊 **Benchmark dataset:** [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) — 6 Hindi subsets with embedded audio + outputs from Varuna and 3 commercial systems.
+- 🧪 **Vistaar normalizer reference:** [AI4Bharat/vistaar](https://github.com/AI4Bharat/vistaar)
+- 🛠 **Base model:** [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
+## 📝 Citation
+If you use Varuna STT in research or production, please cite:
+```bibtex
+@misc{skunkworks-varuna-stt-2026,
+  title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron},
+  author = {SkunkWorks Labs},
+  year = {2026},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/SkunkWorkLabs/varuna-stt}
+}
+```

inference.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""
+Varuna STT — inference example.
+Usage:
+    pip install nemo_toolkit[asr]>=2.4 omegaconf torch soundfile
+    python inference.py --audio path/to/clip.wav
+    # Programmatic
+    from inference import VarunaSTT
+    model = VarunaSTT()
+    print(model.transcribe(["a.wav", "b.wav"]))
+"""
+from __future__ import annotations
+import argparse
+from pathlib import Path
+import torch
+from omegaconf import OmegaConf, open_dict
+from nemo.collections.asr.models import EncDecRNNTBPEModel
+# ── Paths (adjust if you move the files) ──────────────────────────────────────
+HERE = Path(__file__).resolve().parent
+NEMOTRON_BASE = HERE / "nemotron-speech-streaming-en-0.6b.nemo"
+TOKENIZER_DIR = HERE                    # contains tokenizer.model, vocab.txt
+CKPT_PATH     = HERE / "varuna.ckpt"
+class VarunaSTT:
+    def __init__(self, device: str | None = None,
+                 base: Path = NEMOTRON_BASE,
+                 ckpt: Path = CKPT_PATH,
+                 tokenizer_dir: Path = TOKENIZER_DIR):
+        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
+        self.model = EncDecRNNTBPEModel.restore_from(str(base), map_location=self.device)
+        self.model.change_vocabulary(new_tokenizer_dir=str(tokenizer_dir),
+                                      new_tokenizer_type="bpe")
+        # Greedy-batch RNN-T decoding (deterministic, fast on GPU)
+        decoding_cfg = OmegaConf.to_container(self.model.cfg.decoding, resolve=True)
+        decoding_cfg = OmegaConf.create(decoding_cfg)
+        with open_dict(decoding_cfg):
+            decoding_cfg.strategy = "greedy_batch"
+            if "greedy" not in decoding_cfg:
+                decoding_cfg.greedy = {}
+            decoding_cfg.greedy.use_cuda_graph_decoder = False
+        self.model.change_decoding_strategy(decoding_cfg)
+        # Load fine-tuned weights
+        state = torch.load(str(ckpt), map_location=self.device, weights_only=False)
+        sd = state["state_dict"] if "state_dict" in state else state
+        self.model.load_state_dict(sd, strict=False)
+        self.model = self.model.to(self.device).eval()
+    @torch.inference_mode()
+    def transcribe(self, audio_paths: list[str], batch_size: int = 8) -> list[str]:
+        """Transcribe audio file(s) at 16 kHz mono. Returns plain Hindi text per clip."""
+        out = self.model.transcribe(audio=list(audio_paths),
+                                    batch_size=batch_size,
+                                    return_hypotheses=False,
+                                    verbose=False)
+        if isinstance(out, tuple):
+            out = out[0]
+        return [h.text if hasattr(h, "text") else h for h in out]
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--audio", nargs="+", required=True)
+    ap.add_argument("--batch-size", type=int, default=8)
+    ap.add_argument("--device", default=None)
+    args = ap.parse_args()
+    model = VarunaSTT(device=args.device)
+    for path, hyp in zip(args.audio, model.transcribe(args.audio, args.batch_size)):
+        print(f"[{path}]\n  {hyp}")
+if __name__ == "__main__":
+    main()