harsh2ai commited on
Commit
2e67c80
·
verified ·
1 Parent(s): a562499

initial release: Varuna STT

Browse files
Files changed (3) hide show
  1. LICENSE +31 -0
  2. README.md +263 -0
  3. inference.py +82 -0
LICENSE ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ SkunkWorks Modified MIT License
2
+
3
+ Copyright (c) 2026 SkunkWorks Labs
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software, model weights, and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including without
8
+ limitation the rights to use, copy, modify, merge, publish, distribute,
9
+ sublicense, and/or sell copies of the Software, and to permit persons to whom
10
+ the Software is furnished to do so, subject to the following conditions:
11
+
12
+ 1. ATTRIBUTION
13
+ The above copyright notice and this permission notice shall be included in
14
+ all copies or substantial portions of the Software. Any product, paper, or
15
+ public-facing distribution that uses Varuna STT must visibly credit
16
+ "Varuna STT by SkunkWorks Labs" with a link to the source repository.
17
+
18
+ 2. UPSTREAM ATTRIBUTION
19
+ Varuna STT is fine-tuned from NVIDIA NeMo's
20
+ `nemotron-speech-streaming-en-0.6b` base model. Use of this Software is
21
+ subject to the upstream NeMo / NVIDIA model license, which the user is
22
+ responsible for reviewing and complying with separately.
23
+
24
+ 3. NO WARRANTY
25
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
26
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
27
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
28
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
29
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
30
+ FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
31
+ IN THE SOFTWARE.
README.md ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - hi
4
+ license: other
5
+ license_name: skunkworks-modified-mit
6
+ license_link: LICENSE
7
+ pretty_name: Varuna STT
8
+ library_name: nemo
9
+ tags:
10
+ - automatic-speech-recognition
11
+ - hindi
12
+ - asr
13
+ - speech
14
+ - conformer
15
+ - rnnt
16
+ - nemo
17
+ - varuna
18
+ pipeline_tag: automatic-speech-recognition
19
+ base_model: nvidia/nemotron-speech-streaming-en-0.6b
20
+ metrics:
21
+ - wer
22
+ - cer
23
+ model-index:
24
+ - name: Varuna STT
25
+ results:
26
+ - task:
27
+ type: automatic-speech-recognition
28
+ dataset:
29
+ name: SkunkWorkLabs Hindi ASR Benchmark — kathbath
30
+ type: SkunkWorkLabs/hindi-asr-benchmark
31
+ config: kathbath
32
+ split: eval
33
+ metrics:
34
+ - type: wer
35
+ value: 16.82
36
+ - type: cer
37
+ value: 6.36
38
+ - task:
39
+ type: automatic-speech-recognition
40
+ dataset:
41
+ name: SkunkWorkLabs Hindi ASR Benchmark — kathbath_noisy
42
+ type: SkunkWorkLabs/hindi-asr-benchmark
43
+ config: kathbath_noisy
44
+ split: eval
45
+ metrics:
46
+ - type: wer
47
+ value: 19.06
48
+ - type: cer
49
+ value: 8.00
50
+ - task:
51
+ type: automatic-speech-recognition
52
+ dataset:
53
+ name: SkunkWorkLabs Hindi ASR Benchmark — commonvoice
54
+ type: SkunkWorkLabs/hindi-asr-benchmark
55
+ config: commonvoice
56
+ split: eval
57
+ metrics:
58
+ - type: wer
59
+ value: 24.16
60
+ - type: cer
61
+ value: 10.72
62
+ - task:
63
+ type: automatic-speech-recognition
64
+ dataset:
65
+ name: SkunkWorkLabs Hindi ASR Benchmark — fleurs
66
+ type: SkunkWorkLabs/hindi-asr-benchmark
67
+ config: fleurs
68
+ split: eval
69
+ metrics:
70
+ - type: wer
71
+ value: 17.29
72
+ - type: cer
73
+ value: 7.20
74
+ - task:
75
+ type: automatic-speech-recognition
76
+ dataset:
77
+ name: SkunkWorkLabs Hindi ASR Benchmark — indictts
78
+ type: SkunkWorkLabs/hindi-asr-benchmark
79
+ config: indictts
80
+ split: eval
81
+ metrics:
82
+ - type: wer
83
+ value: 9.75
84
+ - type: cer
85
+ value: 2.75
86
+ - task:
87
+ type: automatic-speech-recognition
88
+ dataset:
89
+ name: SkunkWorkLabs Hindi ASR Benchmark — mucs
90
+ type: SkunkWorkLabs/hindi-asr-benchmark
91
+ config: mucs
92
+ split: eval
93
+ metrics:
94
+ - type: wer
95
+ value: 24.60
96
+ - type: cer
97
+ value: 10.75
98
+ ---
99
+
100
+ # Varuna STT 🌊
101
+
102
+ **Varuna STT** is a 0.6B-parameter Hindi automatic speech recognition (ASR) model
103
+ fine-tuned from NVIDIA's [`nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
104
+ base on a curated mix of Hindi speech corpora. It outputs natural-style Hindi
105
+ text — digits, ordinals (`1st`/`3rd`), Indian numbering (lakh/crore comma
106
+ placement), and Devanagari punctuation (`।`, `,`, `?`, `!`) — directly from the
107
+ acoustic signal, ready to drop into voicebot / IVR / transcription pipelines
108
+ without a separate ITN postprocessor.
109
+
110
+ - **Architecture:** Conformer encoder + RNN-T decoder (NeMo `EncDecRNNTBPEModel`)
111
+ - **Parameters:** 0.6 B
112
+ - **Language:** Hindi (`hi`)
113
+ - **Sample rate:** 16 kHz mono
114
+ - **Output style:** Inverse-Text-Normalized (ITN) — digits, ordinals, punctuation
115
+ - **License:** SkunkWorks Modified MIT (see `LICENSE`)
116
+
117
+ ## ⚡ Inference speed (NVIDIA H100 PCIe)
118
+
119
+ Measured on 20 sample clips from Kathbath val (~5 s mean clip duration), batch=1, greedy_batch RNN-T decoding:
120
+
121
+ | Metric | Value |
122
+ |---|---|
123
+ | **RTFx** | **25.13×** |
124
+ | Mean per-clip latency | 208 ms |
125
+ | p50 latency | 175 ms |
126
+ | p90 latency | 362 ms |
127
+
128
+ (RTFx = audio_seconds / wall_seconds. 25× means 1 hour of audio is transcribed in ~2.4 minutes on a single H100.)
129
+
130
+ ## 📊 Benchmark — Vistaar-style normalized WER % / CER %
131
+
132
+ Evaluated on six Hindi held-out subsets from the
133
+ [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) dataset.
134
+ References and hypotheses both pass through the same Vistaar-style normalizer
135
+ ([Bhogale et al., Interspeech 2023](https://www.isca-archive.org/interspeech_2023/bhogale23_interspeech.pdf))
136
+ plus digit / ordinal expansion, so all systems are compared in a style-neutral way.
137
+
138
+ ### WER %
139
+
140
+ | Subset | n | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
141
+ |---|---|---|---|---|---|
142
+ | **indictts** | 98 | **9.75 🥇** | 13.20 | 15.41 | 14.71 |
143
+ | **fleurs (test)** | 417 | 17.29 | **11.93** | 21.22 | 15.74 |
144
+ | **kathbath** | 1,929 | 16.82 | **13.32** | 20.55 | 16.62 |
145
+ | **kathbath_noisy** | 1,929 | 19.06 | **13.16** | 21.98 | 17.75 |
146
+ | **commonvoice** | 1,727 | 24.16 | **17.02** | 28.34 | 19.32 |
147
+ | **mucs** | 3,897 | 24.60 | **10.97** | 20.54 | 12.72 |
148
+
149
+ ### CER %
150
+
151
+ | Subset | **Varuna STT** | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
152
+ |---|---|---|---|---|
153
+ | **indictts** | **2.75 🥇** | 4.16 | 8.53 | 6.51 |
154
+ | **fleurs (test)** | 7.20 | **5.68** | 16.74 | 7.08 |
155
+ | **kathbath** | **6.36 🥇** | 6.50 | 13.53 | 7.42 |
156
+ | **kathbath_noisy** | 8.00 | **5.87** | 14.75 | 7.82 |
157
+ | **commonvoice** | 10.72 | **8.96** | 20.25 | 9.87 |
158
+ | **mucs** | 10.75 | **3.94** | 9.94 | 4.79 |
159
+
160
+ Varuna leads on `indictts` (both metrics) and matches the leader on `kathbath` CER. It has more headroom on conversational / codec-degraded subsets (`commonvoice`, `mucs`).
161
+
162
+ ## 🚀 Usage
163
+
164
+ ```python
165
+ from inference import VarunaSTT
166
+
167
+ model = VarunaSTT() # auto-picks GPU if available
168
+ texts = model.transcribe(["clip1.wav", "clip2.wav"]) # 16 kHz mono
169
+ for t in texts: print(t)
170
+ ```
171
+
172
+ CLI:
173
+ ```bash
174
+ python inference.py --audio path/to/clip.wav
175
+ ```
176
+
177
+ You'll need:
178
+ - `nemo_toolkit[asr]>=2.4`
179
+ - `omegaconf`, `torch`, `soundfile`
180
+ - The base `nemotron-speech-streaming-en-0.6b.nemo` file (download separately
181
+ from [`nvidia/nemotron-speech-streaming-en-0.6b`](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b))
182
+
183
+ Files in this repo:
184
+ - `varuna.ckpt` — fine-tuned weights
185
+ - `tokenizer.model`, `tokenizer.vocab`, `vocab.txt` — bilingual EN-1024 / HI-512 BPE tokenizer
186
+ - `inference.py` — minimal inference example
187
+
188
+ ## 🛠 Training
189
+
190
+ Fine-tuned from **NVIDIA `nemotron-speech-streaming-en-0.6b`** using the NeMo
191
+ ASR framework. Hindi training mix:
192
+
193
+ | Source | Approx. hours |
194
+ |---|---|
195
+ | Shrutilipi (Hindi) | ~1,500 |
196
+ | IndicVoices (Hindi) | ~1,000 |
197
+ | Kathbath (Hindi) | ~137 |
198
+ | IndicVoices-R | ~150 |
199
+ | Gramvaani | ~100 |
200
+ | Vaani | ~50 |
201
+ | Lahaja | ~30 |
202
+ | IndicTTS | ~30 |
203
+ | Short-form domain | ~20 |
204
+
205
+ All Hindi training labels were ITN-normalized (digits, ordinals, `।`/`,` punctuation,
206
+ Indian-numbering commas) using a Gemma 4 normalization pass following NVIDIA Riva
207
+ Hindi ITN conventions.
208
+
209
+ Bilingual EN-1024 + HI-512 BPE tokenizer (1,536 sub-word tokens total) shared across
210
+ languages. RNN-T loss with SpecAugment, mixed-precision (bf16) training on
211
+ NVIDIA H100s.
212
+
213
+ ## 📋 Output convention
214
+
215
+ Varuna emits **ITN-style** Hindi:
216
+
217
+ | spoken | output |
218
+ |---|---|
219
+ | `पाँच सौ` (five hundred) | `500` |
220
+ | `दो लाख पचास हजार` | `2,50,000` |
221
+ | `तीन करोड़` | `3,00,00,000` |
222
+ | `पहला` (first) | `1st` |
223
+ | `तीसरा` | `3rd` |
224
+ | End of sentence | `।` |
225
+
226
+ This is what voicebot / IVR / call-center products typically want. If your
227
+ downstream consumer expects spelled-out Devanagari, post-process the model
228
+ output with a reverse-ITN. We use a Vistaar-style normalizer at benchmark time
229
+ (strip punctuation + IndicNormalizer NFC/NFD + digit/ordinal expansion) — see
230
+ [AI4Bharat/vistaar/evaluation.py](https://github.com/AI4Bharat/vistaar/blob/master/evaluation.py)
231
+ for the reference implementation.
232
+
233
+ ## ⚠️ Limitations
234
+
235
+ - **Code-switching not supported yet.** Varuna is trained on monolingual Hindi
236
+ audio. Inputs that mix English words mid-sentence (e.g., conversational
237
+ Hindi-English) may produce transliteration artifacts or substitutions. A
238
+ bilingual fine-tune is on the roadmap.
239
+ - **Codec-degraded audio.** Performance on telephony / heavily compressed audio
240
+ (e.g., MUCS subset) is weaker than on studio-clean speech (CER 10.75 % vs
241
+ 2.75 % on IndicTTS). Codec-augmentation training is planned.
242
+ - **Audio format.** Expects 16 kHz mono. Other sample rates need resampling
243
+ upstream.
244
+
245
+ ## 🔗 Links
246
+
247
+ - 📊 **Benchmark dataset:** [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark) — 6 Hindi subsets with embedded audio + outputs from Varuna and 3 commercial systems.
248
+ - 🧪 **Vistaar normalizer reference:** [AI4Bharat/vistaar](https://github.com/AI4Bharat/vistaar)
249
+ - 🛠 **Base model:** [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
250
+
251
+ ## 📝 Citation
252
+
253
+ If you use Varuna STT in research or production, please cite:
254
+
255
+ ```bibtex
256
+ @misc{skunkworks-varuna-stt-2026,
257
+ title = {Varuna STT: A Hindi ASR model fine-tuned from NVIDIA NeMo Nemotron},
258
+ author = {SkunkWorks Labs},
259
+ year = {2026},
260
+ publisher = {Hugging Face},
261
+ url = {https://huggingface.co/SkunkWorkLabs/varuna-stt}
262
+ }
263
+ ```
inference.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Varuna STT — inference example.
3
+
4
+ Usage:
5
+ pip install nemo_toolkit[asr]>=2.4 omegaconf torch soundfile
6
+
7
+ python inference.py --audio path/to/clip.wav
8
+
9
+ # Programmatic
10
+ from inference import VarunaSTT
11
+ model = VarunaSTT()
12
+ print(model.transcribe(["a.wav", "b.wav"]))
13
+ """
14
+ from __future__ import annotations
15
+
16
+ import argparse
17
+ from pathlib import Path
18
+
19
+ import torch
20
+ from omegaconf import OmegaConf, open_dict
21
+
22
+ from nemo.collections.asr.models import EncDecRNNTBPEModel
23
+
24
+ # ── Paths (adjust if you move the files) ──────────────────────────────────────
25
+ HERE = Path(__file__).resolve().parent
26
+ NEMOTRON_BASE = HERE / "nemotron-speech-streaming-en-0.6b.nemo"
27
+ TOKENIZER_DIR = HERE # contains tokenizer.model, vocab.txt
28
+ CKPT_PATH = HERE / "varuna.ckpt"
29
+
30
+
31
+ class VarunaSTT:
32
+ def __init__(self, device: str | None = None,
33
+ base: Path = NEMOTRON_BASE,
34
+ ckpt: Path = CKPT_PATH,
35
+ tokenizer_dir: Path = TOKENIZER_DIR):
36
+ self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
37
+ self.model = EncDecRNNTBPEModel.restore_from(str(base), map_location=self.device)
38
+ self.model.change_vocabulary(new_tokenizer_dir=str(tokenizer_dir),
39
+ new_tokenizer_type="bpe")
40
+
41
+ # Greedy-batch RNN-T decoding (deterministic, fast on GPU)
42
+ decoding_cfg = OmegaConf.to_container(self.model.cfg.decoding, resolve=True)
43
+ decoding_cfg = OmegaConf.create(decoding_cfg)
44
+ with open_dict(decoding_cfg):
45
+ decoding_cfg.strategy = "greedy_batch"
46
+ if "greedy" not in decoding_cfg:
47
+ decoding_cfg.greedy = {}
48
+ decoding_cfg.greedy.use_cuda_graph_decoder = False
49
+ self.model.change_decoding_strategy(decoding_cfg)
50
+
51
+ # Load fine-tuned weights
52
+ state = torch.load(str(ckpt), map_location=self.device, weights_only=False)
53
+ sd = state["state_dict"] if "state_dict" in state else state
54
+ self.model.load_state_dict(sd, strict=False)
55
+ self.model = self.model.to(self.device).eval()
56
+
57
+ @torch.inference_mode()
58
+ def transcribe(self, audio_paths: list[str], batch_size: int = 8) -> list[str]:
59
+ """Transcribe audio file(s) at 16 kHz mono. Returns plain Hindi text per clip."""
60
+ out = self.model.transcribe(audio=list(audio_paths),
61
+ batch_size=batch_size,
62
+ return_hypotheses=False,
63
+ verbose=False)
64
+ if isinstance(out, tuple):
65
+ out = out[0]
66
+ return [h.text if hasattr(h, "text") else h for h in out]
67
+
68
+
69
+ def main():
70
+ ap = argparse.ArgumentParser()
71
+ ap.add_argument("--audio", nargs="+", required=True)
72
+ ap.add_argument("--batch-size", type=int, default=8)
73
+ ap.add_argument("--device", default=None)
74
+ args = ap.parse_args()
75
+
76
+ model = VarunaSTT(device=args.device)
77
+ for path, hyp in zip(args.audio, model.transcribe(args.audio, args.batch_size)):
78
+ print(f"[{path}]\n {hyp}")
79
+
80
+
81
+ if __name__ == "__main__":
82
+ main()