ChristophSchuhmann commited on
Commit
c438395
·
verified ·
1 Parent(s): 2b9e5d6

Mirror nvidia/nemotron-3.5-asr-streaming-0.6b

Browse files
.gitattributes CHANGED
@@ -76,3 +76,7 @@ code/VibeVoice/finetuning-asr/toy_dataset/1.mp3 filter=lfs diff=lfs merge=lfs -t
76
  models/parakeet-tdt-0.6b-v3/parakeet-tdt-0.6b-v3.nemo filter=lfs diff=lfs merge=lfs -text
77
  models/diar_sortformer_4spk-v1/diar_sortformer_4spk-v1.nemo filter=lfs diff=lfs merge=lfs -text
78
  code/universal-audio-annotation-pipeline/docs/gemma12_dicow_demo.html filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
76
  models/parakeet-tdt-0.6b-v3/parakeet-tdt-0.6b-v3.nemo filter=lfs diff=lfs merge=lfs -text
77
  models/diar_sortformer_4spk-v1/diar_sortformer_4spk-v1.nemo filter=lfs diff=lfs merge=lfs -text
78
  code/universal-audio-annotation-pipeline/docs/gemma12_dicow_demo.html filter=lfs diff=lfs merge=lfs -text
79
+ models/nemotron-3.5-asr-streaming-0.6b/latency_vs_parallel.png filter=lfs diff=lfs merge=lfs -text
80
+ models/nemotron-3.5-asr-streaming-0.6b/model_architecture.png filter=lfs diff=lfs merge=lfs -text
81
+ models/nemotron-3.5-asr-streaming-0.6b/model_overview.png filter=lfs diff=lfs merge=lfs -text
82
+ models/nemotron-3.5-asr-streaming-0.6b/nemotron-3.5-asr-streaming-0.6b.nemo filter=lfs diff=lfs merge=lfs -text
models/nemotron-3.5-asr-streaming-0.6b/.gitattributes ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.nemo filter=lfs diff=lfs merge=lfs -text
37
+ *.png filter=lfs diff=lfs merge=lfs -text
models/nemotron-3.5-asr-streaming-0.6b/README.md ADDED
@@ -0,0 +1,593 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: openmdw-1.1
4
+ license_link: >-
5
+ https://openmdw.ai/license/1-1/
6
+ library_name: nemo
7
+ language:
8
+ - en
9
+ - es
10
+ - de
11
+ - fr
12
+ - it
13
+ - ar
14
+ - ja
15
+ - ko
16
+ - pt
17
+ - ru
18
+ - hi
19
+ - zh
20
+ - vi
21
+ - he
22
+ - nl
23
+ - cs
24
+ - da
25
+ - pl
26
+ - 'no'
27
+ - sv
28
+ - th
29
+ - tr
30
+ - bg
31
+ - el
32
+ - et
33
+ - fi
34
+ - hr
35
+ - hu
36
+ - lt
37
+ - lv
38
+ - ro
39
+ - sk
40
+ - uk
41
+ - mt
42
+ - sl
43
+ datasets:
44
+ - nvidia/Granary
45
+ - multilingual_librispeech
46
+ - fleurs
47
+ - mozilla-foundation/common_voice_8_0
48
+ - voxpopuli
49
+ - europarl
50
+ thumbnail: null
51
+ tags:
52
+ - speech-recognition
53
+ - cache-aware ASR
54
+ - automatic-speech-recognition
55
+ - streaming-asr
56
+ - multilingual
57
+ - speech
58
+ - audio
59
+ - FastConformer
60
+ - RNNT
61
+ - Parakeet
62
+ - ASR
63
+ - pytorch
64
+ - NeMo
65
+ widget:
66
+ - example_title: Librispeech sample 1
67
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
68
+ - example_title: Librispeech sample 2
69
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
70
+ model-index:
71
+ - name: nemotron-asr-streaming-multilingual-0.6b
72
+ results:
73
+ - task:
74
+ name: Automatic Speech Recognition
75
+ type: automatic-speech-recognition
76
+ dataset:
77
+ name: FLEURS (English)
78
+ type: google/fleurs
79
+ config: en_us
80
+ split: test
81
+ metrics:
82
+ - name: WER (1.12s frame size, LangID)
83
+ type: wer
84
+ value: 7.91
85
+ - task:
86
+ name: Automatic Speech Recognition
87
+ type: automatic-speech-recognition
88
+ dataset:
89
+ name: FLEURS (Spanish)
90
+ type: google/fleurs
91
+ config: es_419
92
+ split: test
93
+ metrics:
94
+ - name: WER (1.12s frame size, LangID)
95
+ type: wer
96
+ value: 4.11
97
+ - task:
98
+ name: Automatic Speech Recognition
99
+ type: automatic-speech-recognition
100
+ dataset:
101
+ name: FLEURS (French)
102
+ type: google/fleurs
103
+ config: fr_fr
104
+ split: test
105
+ metrics:
106
+ - name: WER (1.12s frame size, LangID)
107
+ type: wer
108
+ value: 9.03
109
+ - task:
110
+ name: Automatic Speech Recognition
111
+ type: automatic-speech-recognition
112
+ dataset:
113
+ name: FLEURS (Italian)
114
+ type: google/fleurs
115
+ config: it_it
116
+ split: test
117
+ metrics:
118
+ - name: WER (1.12s frame size, LangID)
119
+ type: wer
120
+ value: 4.25
121
+ - task:
122
+ name: Automatic Speech Recognition
123
+ type: automatic-speech-recognition
124
+ dataset:
125
+ name: FLEURS (Portuguese)
126
+ type: google/fleurs
127
+ config: pt_br
128
+ split: test
129
+ metrics:
130
+ - name: WER (1.12s frame size, LangID)
131
+ type: wer
132
+ value: 5.48
133
+ - task:
134
+ name: Automatic Speech Recognition
135
+ type: automatic-speech-recognition
136
+ dataset:
137
+ name: FLEURS (German)
138
+ type: google/fleurs
139
+ config: de_de
140
+ split: test
141
+ metrics:
142
+ - name: WER (1.12s frame size, LangID)
143
+ type: wer
144
+ value: 8.31
145
+ - task:
146
+ name: Automatic Speech Recognition
147
+ type: automatic-speech-recognition
148
+ dataset:
149
+ name: FLEURS (Hindi)
150
+ type: google/fleurs
151
+ config: hi_in
152
+ split: test
153
+ metrics:
154
+ - name: WER (1.12s frame size, LangID)
155
+ type: wer
156
+ value: 6.81
157
+ - task:
158
+ name: Automatic Speech Recognition
159
+ type: automatic-speech-recognition
160
+ dataset:
161
+ name: FLEURS (Korean)
162
+ type: google/fleurs
163
+ config: ko_kr
164
+ split: test
165
+ metrics:
166
+ - name: WER (1.12s frame size, LangID)
167
+ type: wer
168
+ value: 7.12
169
+ metrics:
170
+ - wer
171
+ pipeline_tag: automatic-speech-recognition
172
+ ---
173
+
174
+ # Nemotron 3.5 ASR
175
+
176
+ <style>
177
+ h1, h2, h3, h4, h5, h6 {
178
+ color: #76b900; /* NVIDIA green */
179
+ font-weight: 700;
180
+ }
181
+
182
+ hr {
183
+ border: none;
184
+ border-top: 1px solid #e5e7eb;
185
+ margin: 2rem 0;
186
+ }
187
+
188
+ /* Improve list spacing */
189
+ ul, ol {
190
+ margin-top: 0.5rem;
191
+ margin-bottom: 0.5rem;
192
+ }
193
+
194
+ /* Badge alignment consistency */
195
+ img {
196
+ display: inline;
197
+ vertical-align: middle;
198
+ }
199
+ </style>
200
+
201
+ <p align="center">
202
+ <a href="#model-architecture"><img src="https://img.shields.io/badge/Model_Arch-FastConformer--CacheAware--RNNT-lightgrey#model-badge" alt="Model architecture"/></a>
203
+ &nbsp;
204
+ <a href="#model-architecture"><img src="https://img.shields.io/badge/Params-600M-lightgrey#model-badge" alt="Model size"/></a>
205
+ &nbsp;
206
+ <a href="#supported-languages"><img src="https://img.shields.io/badge/Language-Multilingual-lightgrey#model-badge" alt="Language"/></a>
207
+ </p>
208
+
209
+ <p align="center">
210
+ <img src="model_overview.png" alt="Nemotron 3.5 ASR overview: multilingual audio across 40 language-locales is transcribed by a cache-aware FastConformer-RNNT model with language-ID prompting into punctuated text with an automatic language tag" width="900"/>
211
+ </p>
212
+
213
+ > [!Note]
214
+ > This model is the multilingual extension of [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b), adding language-ID prompt conditioning to support transcription across **40 language-locales** from a single model.
215
+
216
+ **Nemotron 3.5 ASR** is a multilingual, streaming Automatic Speech Recognition (ASR) model engineered to deliver high-quality multilingual transcription across both low-latency streaming and high-throughput batch workloads. Developed by NVIDIA, this 600M parameter model transcribes speech into text with native support for punctuation and capitalization, and offers runtime flexibility with configurable chunk sizes, including 80ms, 160ms, 320ms, 560ms, and 1120ms.
217
+
218
+ By leveraging a state-of-the-art **Cache-Aware FastConformer-RNNT** architecture, the model eliminates redundant overlapping computations common in traditional "buffered" streaming. This allows it to process only new audio chunks while reusing cached encoder context, significantly improving computational efficiency and minimizing end-to-end delay without sacrificing accuracy.
219
+
220
+ It was trained on a massive ASR dataset and is engineered to perform across diverse and challenging acoustic conditions.
221
+
222
+ This model is ready for commercial use.
223
+
224
+ ---
225
+
226
+ ## License/Terms of Use
227
+
228
+ Governing Terms: Use of the model is governed by the [OpenMDW-1.1](https://openmdw.ai/license/1-1/) license.
229
+
230
+ ## Deployment Geography
231
+
232
+ Global
233
+
234
+ ## Use Case
235
+
236
+ This model is for transcription of multilingual audio.
237
+
238
+ ## Release Date
239
+
240
+ - Hugging Face [06/04/2026] via https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b
241
+
242
+ ## References
243
+
244
+ <a id="ref-1"></a>[1] [Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition](https://arxiv.org/abs/2312.17279)
245
+
246
+ <a id="ref-2"></a>[2] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
247
+
248
+ <a id="ref-3"></a>[3] [NVIDIA Granary](https://huggingface.co/datasets/nvidia/Granary)
249
+
250
+ <a id="ref-4"></a>[4] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
251
+
252
+ ## Why Choose Nemotron 3.5 ASR?
253
+
254
+ - 🌍 **Single Multilingual Model:** Transcribes 40 language-locales from one model through language-ID prompt conditioning, with optional automatic language detection.
255
+ - ⚡ **Native Streaming Architecture:** Cache-aware design enables efficient processing of continuous audio streams, designed and optimized for low-latency voice agent applications.
256
+ - 💰 **Improved Operational Efficiency:** Delivers superior throughput compared to traditional buffered streaming approaches. This allows for a higher number of parallel streams within the same GPU memory constraints, directly reducing operational costs for production environments.
257
+ - 🎛️ **Dynamic Runtime Flexibility:** Choose the optimal operating point on the latency-accuracy Pareto curve at inference time. No re-training is required to adjust for different use-case requirements.
258
+ - 📝 **Punctuation & Capitalization:** Built-in support for punctuation and capitalization in output text.
259
+
260
+ ---
261
+
262
+ ## Supported Languages
263
+
264
+ The model supports **40 language-locales** in total, across three tiers:
265
+
266
+ - **Transcription-ready (19 locales):** highest-accuracy ASR, ready out of the box.
267
+ - **Broad-coverage (13 locales):** production ASR across an additional 13 locales.
268
+ - **Adaptation-ready (8 locales):** recognized by the tokenizer; fine-tune on in-domain data to unlock full transcription.
269
+
270
+ | Tier | Languages (locales) |
271
+ | :--- | :--- |
272
+ | **Transcription-ready (19 locales)** | English (en-US, en-GB), Spanish (es-US, es-ES), French (fr-FR, fr-CA), Italian (it-IT), Portuguese (pt-BR, pt-PT), Dutch (nl-NL), German (de-DE), Turkish (tr-TR), Russian (ru-RU), Arabic (ar-AR), Hindi (hi-IN), Japanese (ja-JP), Korean (ko-KR), Vietnamese (vi-VN), Ukrainian (uk-UA) |
273
+ | **Broad-coverage (13 locales)** | Polish (pl-PL), Swedish (sv-SE), Czech (cs-CZ), Norwegian Bokmål (nb-NO), Danish (da-DK), Bulgarian (bg-BG), Finnish (fi-FI), Croatian (hr-HR), Slovak (sk-SK), Mandarin (zh-CN), Hungarian (hu-HU), Romanian (ro-RO), Estonian (et-EE) |
274
+ | **Adaptation-ready (8 locales)** | Greek (el-GR), Lithuanian (lt-LT), Latvian (lv-LV), Maltese (mt-MT), Slovenian (sl-SI), Hebrew (he-IL), Thai (th-TH), Norwegian Nynorsk (nn-NO) |
275
+
276
+ > **Note:** Transcription-ready and broad-coverage locales (**32 total**) produce ASR transcription out of the box; adaptation-ready locales require fine-tuning on in-domain data to enable full transcription. The model supports uppercase and lowercase letters, punctuation, spaces, and apostrophes.
277
+
278
+ > **Note:** We would recommend [Nemotron ASR Streaming (English)](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) model for English-only transcription use cases. For all other transcription ready locales, we recommend Nemotron 3.5 ASR to leverage its expanded multilingual capabilities.
279
+
280
+ > [!Tip]
281
+ > **Automatic language detection / language tagging:** When run with `target_lang=auto`, the model detects the spoken language and emits the corresponding **language code/tag** in the output following the terminal punctuation. This lets a single deployment transcribe mixed-language traffic and automatically label each utterance with its detected language — no separate language-ID component required.
282
+
283
+ ---
284
+
285
+ ## Model Architecture
286
+
287
+ **Architecture Type:** FastConformer-CacheAware-RNNT with Prompt
288
+
289
+ This model consists of a cache-aware streaming Parakeet (FastConformer) encoder with an RNN-T decoder and language-ID prompt conditioning. It is based on the Cache-Aware [\[1\]](#ref-1) FastConformer [\[2\]](#ref-2) architecture with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The cache-aware streaming design enables efficient processing of audio in chunks while maintaining context from previous frames. Unlike buffered inference, this model maintains caches for all encoder self-attention and convolution layers. This enables reuse of hidden states at every streaming step, where cached activations eliminate redundant computations. As a result, there are no overlapping computations; each processed frame is strictly non-overlapping. This model leverages prompts to guide the transcription process, enabling language-specific transcription from a single ASR model through language ID conditioning.
290
+
291
+ <p align="center">
292
+ <img src="model_architecture.png" alt="Nemotron 3.5 ASR architecture: FastConformer encoder and language-ID encoding are concatenated, projected, and fed to the RNNT decoder" width="900"/>
293
+ </p>
294
+
295
+ The language-ID prompt is fused with the acoustic representation as follows:
296
+
297
+ - **FastConformer encoder** processes audio into an acoustic embedding of shape (D=512, T).
298
+ - **Language Encoding** expands a 128-dim one-hot language vector across the time axis → (K=128, T), broadcasting the language identity to every frame.
299
+ - **Concatenation** along the feature axis → fused tensor (D + K, T).
300
+ - **Projection layer** maps the fused features to the RNNT decoder.
301
+
302
+ **Network Architecture:**
303
+ - Encoder: Cache-Aware FastConformer with 24 layers
304
+ - Decoder: RNNT (Recurrent Neural Network Transducer)
305
+ - Parameters: 600M
306
+
307
+ **This model was developed based on [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b).**
308
+
309
+ ---
310
+
311
+ ## Results at a Glance
312
+
313
+ ASR performance is measured using Word Error Rate (WER) on the **FLEURS** test sets. Accuracy stays strong across both modes and improves as the chunk size grows, while remaining competitive even at the lowest-latency 80ms setting. Full tables are in [Performance](#performance).
314
+
315
+ <p align="center">
316
+ <img src="fleurs_wer_vs_chunk_size.png" alt="FLEURS average WER vs streaming chunk size (LangID vs Auto-detect)" width="900"/>
317
+ </p>
318
+
319
+ <p align="center">
320
+ <img src="fleurs_langid_vs_auto.png" alt="FLEURS WER by language: LangID vs Auto-detect at 320ms chunk" width="900"/>
321
+ </p>
322
+
323
+ > **Note:** Japanese and Korean are measured using Character Error Rate (CER) rather than WER, as is standard for these languages.
324
+
325
+ ---
326
+
327
+ ## Throughput & Efficiency
328
+
329
+ Despite being **roughly half the size** (0.6B vs. 1.1B), Nemotron 3.5 ASR serves **far more concurrent streams at far lower latency** than the [Parakeet RNNT 1.1B multilingual model](https://build.nvidia.com/nvidia/parakeet-1_1b-rnnt-multilingual-asr), which runs on buffered streaming. The cache-aware streaming design avoids the redundant recomputation of buffered inference, so a single H100 can sustain dramatically higher concurrency at every chunk size — directly lowering the cost per stream in production. At the lowest-latency 80ms setting, Nemotron sustains **~17× more concurrent streams** (240 vs. 14); at the 1120ms setting it sustains **6× more** (2,400 vs. 400). The latency-vs-concurrency curves tell the same story: Nemotron (solid green) holds low final-token latency well past 1,000 parallel requests, while Parakeet RNNT 1.1B (dashed blue) saturates after only a few hundred.
330
+
331
+ <p align="center">
332
+ <img src="throughput_vs_chunk.png" alt="Concurrent streams supported on a single H100: Nemotron ASR streaming vs Parakeet RNNT, across chunk sizes" width="900"/>
333
+ </p>
334
+
335
+ <p align="center">
336
+ <img src="latency_vs_parallel.png" alt="Median final-token latency vs number of parallel requests on a single H100, Nemotron vs Parakeet RNNT across chunk sizes" width="900"/>
337
+ </p>
338
+
339
+ > Measured on a single NVIDIA H100. Throughput is the number of real-time streams sustainable in parallel; latency is the median final-token latency at a given level of concurrency.
340
+
341
+ ---
342
+
343
+ ## Explore more from NVIDIA
344
+
345
+ For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at [developer.nvidia.com](https://developer.nvidia.com/).
346
+ Join the community to access tools, support, and resources to accelerate your development with NVIDIA's NeMo, Speech NIM, and foundation models.
347
+
348
+ - What is [Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/)?
349
+ - NVIDIA Developer [Nemotron](https://developer.nvidia.com/nemotron)
350
+ - [NVIDIA Speech NIM](https://docs.nvidia.com/nim/speech/latest/about/index.html)
351
+ - [NeMo Documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html)
352
+
353
+ Also, check out the following NVIDIA speech models:
354
+ - Nemotron ASR Streaming (English) (Nemotron 3 ASR) - https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b
355
+ - Multitalker Parakeet Streaming - https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1
356
+ - Parakeet Realtime EOU - https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1
357
+
358
+ ---
359
+
360
+ ## NVIDIA NeMo
361
+
362
+ To train, fine-tune or perform inference with this model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) [\[4\]](#ref-4). We recommend you install it after you've installed Cython and latest PyTorch version.
363
+
364
+ ```bash
365
+ apt-get update && apt-get install -y libsndfile1 ffmpeg
366
+ pip install Cython packaging
367
+ pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
368
+ ```
369
+
370
+ ## How to Use this Model
371
+
372
+ The model is available for use in the NeMo Framework, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
373
+
374
+ ### Loading the Model
375
+
376
+ ```python
377
+ import nemo.collections.asr as nemo_asr
378
+ asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/nemotron-3.5-asr-streaming-0.6b")
379
+ ```
380
+
381
+ ### Streaming Inference
382
+
383
+ You can use the cache-aware streaming inference script from NeMo - [NeMo/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py](https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py)
384
+
385
+ This is a prompt-conditioned multilingual model: pass the target language with `target_lang` (e.g. `en-US`, `es-ES`, `de-DE`), or use `target_lang=auto` for automatic language detection.
386
+
387
+ ```bash
388
+ cd NeMo
389
+ python examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \
390
+ model_path=<model_path> \
391
+ dataset_manifest=<dataset_manifest> \
392
+ batch_size=<batch_size> \
393
+ target_lang=<lang_id> \ #language key (e.g. en-US) or "auto" for automatic language detection
394
+ att_context_size="[56,13]" \ #set the second value to the desired right context from {0,1,3,6,13}
395
+ strip_lang_tags=true \ #true: remove the detected language tag from the text; false: keep it in the output
396
+ output_path=<output_folder>
397
+ ```
398
+
399
+ **`strip_lang_tags`** controls how the detected language tag is handled in the output. The model appends a language tag (e.g. `<en-US>`) after the transcript's terminal punctuation:
400
+ - `strip_lang_tags=false` (keep): the tag is left in the output, so you can read the detected language directly from each utterance — useful for mixed-language traffic and language labeling.
401
+ - `strip_lang_tags=true` (remove): the tag is stripped, leaving only the clean transcript text — useful when you only need the spoken words.
402
+
403
+ ### Setting up Streaming Configuration
404
+
405
+ Latency is defined by the `att_context_size` param, where att_context_size = `{num_frames_left_context, num_frame_right_context}`, all measured in **80ms frames**:
406
+
407
+ * [56, 0]: Chunk size = 1 (1 × 80ms = 0.08s)
408
+ * [56, 1]: Chunk size = 2 (2 × 80ms = 0.16s)
409
+ * [56, 3]: Chunk size = 4 (4 × 80ms = 0.32s)
410
+ * [56, 6]: Chunk size = 7 (7 × 80ms = 0.56s)
411
+ * [56, 13]: Chunk size = 14 (14 × 80ms = 1.12s)
412
+
413
+ Here, chunk size = current frame + right context; each chunk is processed in non-overlapping fashion.
414
+
415
+ ### Input(s): <br>
416
+
417
+ **Input Type(s):** Audio, Lang ID <br>
418
+
419
+ **Input Format(s):** wav, string <br>
420
+
421
+ **Input Parameters:** One-Dimensional (1D) for audio and One-Dimensional (1D) for Lang ID <br>
422
+
423
+ **Other Properties Related to Input:** Maximum Length in seconds specific to GPU Memory, No Pre-Processing Needed, Mono channel is required.
424
+
425
+ By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
426
+
427
+ ### Output
428
+
429
+ **Output Type(s):** Text String in Input Language <br>
430
+
431
+ **Output Format(s):** String <br>
432
+
433
+ **Output Parameters:** One-Dimensional (1D) <br>
434
+
435
+ **Other Properties Related to Output:** No Maximum Character Length, transcribe punctuation and capitalization.
436
+
437
+ By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
438
+
439
+ ---
440
+
441
+ ## Software Integration
442
+
443
+ **Runtime Engine:** NeMo 26.06
444
+
445
+ **Supported Hardware Microarchitecture Compatibility:**
446
+ - NVIDIA Ampere
447
+ - NVIDIA Blackwell
448
+ - NVIDIA Hopper
449
+ - NVIDIA Jetson
450
+ - NVIDIA Lovelace
451
+ - NVIDIA Turing
452
+ - NVIDIA Volta
453
+
454
+ **Supported Operating System(s):**
455
+ * Linux <br>
456
+ * Linux 4 Tegra <br>
457
+
458
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.<vr>
459
+
460
+
461
+
462
+ ---
463
+
464
+
465
+ ## Model Version(s):
466
+ nemotron-3.5-asr-streaming-0.6b-v1 <br>
467
+
468
+ ## Training and Evaluation Datasets:
469
+
470
+ ### Training Datasets
471
+
472
+ It was trained on speech data across 40 language-locales. The training data is a dynamic blend of public and proprietary internal datasets normalized to have spoken forms in text with punctuation and capitalization, including:
473
+
474
+
475
+ - NVIDIA Riva multilingual ASR training set (Proprietary)
476
+ - NVIDIA Granary [\[3\]](#ref-3)
477
+ - Multilingual LibriSpeech (MLS)
478
+ - Mozilla Common Voice
479
+ - FLEURS
480
+ - VoxPopuli / Europarl-ASR
481
+
482
+ ** Data Modality: Audio <br>
483
+
484
+ ** Audio Training Data Size: 10,000 to 1 Million Hours <br>
485
+
486
+ ** Data Collection Method by dataset <br>
487
+ * Human <br>
488
+
489
+ ** Labeling Method by dataset <br>
490
+ * Human <br>
491
+ * Synthetic: Synthetic labels were generated from an ensemble of ASR models ([NVIDIA Canary](https://build.nvidia.com/nvidia/canary-1b-asr), [Parakeet Multilingual 1.1B RNNT](https://build.nvidia.com/nvidia/parakeet-1_1b-rnnt-multilingual-asr), [Parakeet CTC 1.1B](https://build.nvidia.com/nvidia/parakeet-ctc-1_1b-asr), [OpenAI Whisper](https://huggingface.co/openai/whisper-large-v3), and [FunASR](https://github.com/modelscope/FunASR)), with punctuation and capitalization (PnC) generated from [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B).
492
+
493
+
494
+
495
+
496
+ ### Evaluation Datasets
497
+
498
+ The model was evaluated on multilingual ASR benchmarks:
499
+
500
+ - FLEURS
501
+ - Mozilla Common Voice (MCV)
502
+ - Multilingual LibriSpeech (MLS)
503
+ - NVIDIA internal multilingual evaluation sets
504
+
505
+ ** Data Collection Method by dataset <br>
506
+ * Human <br>
507
+
508
+ ** Labeling Method by dataset <br>
509
+ * Human <br>
510
+
511
+ ---
512
+
513
+ ## Performance
514
+
515
+ ASR performance is measured using the Word Error Rate (WER). The tables below report WER (%) on the **FLEURS** test sets across configurable streaming chunk sizes, in two modes:
516
+ - **Language Input (LangID):** the target language is provided to the model.
517
+ - **Auto-detect:** the model automatically detects the spoken language.
518
+
519
+ > **Note:** Japanese, Korean, and Mandarin are evaluated using Character Error Rate (CER) rather than WER, as is standard for these languages.
520
+ > **Note on text normalization:** WER/CER are computed after text normalization that aligns the reference and hypothesis (e.g., casing, punctuation, numerals, and formatting conventions). Normalization is not perfect across all 40 language-locales, and residual mismatches between normalized text can inflate the reported error rates — actual transcription quality may be somewhat better than the numbers suggest.
521
+
522
+ ### Transcription-ready (19 locales)
523
+
524
+ _Languages are ordered by accuracy (lowest WER first)._
525
+
526
+ <table>
527
+ <thead>
528
+ <tr><th rowspan="2" align="left">Language</th><th colspan="5" align="center" style="background-color:#76b900;color:#ffffff">Language Input (LangID)</th><th colspan="5" align="center" style="background-color:#6b7280;color:#ffffff;border-left:2px solid #cbd5e1;">Auto-detect</th></tr>
529
+ <tr><th align="center" style="background-color:#eef6e0">80ms</th><th align="center" style="background-color:#eef6e0">160ms</th><th align="center" style="background-color:#eef6e0">320ms</th><th align="center" style="background-color:#eef6e0">560ms</th><th align="center" style="background-color:#eef6e0">1.12s</th><th align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">80ms</th><th align="center" style="background-color:#f3f4f6;">160ms</th><th align="center" style="background-color:#f3f4f6;">320ms</th><th align="center" style="background-color:#f3f4f6;">560ms</th><th align="center" style="background-color:#f3f4f6;">1.12s</th></tr>
530
+ </thead>
531
+ <tbody>
532
+ <tr><td align="left">Spanish (es-US, es-ES)</td><td align="center" style="background-color:#eef6e0;">4.87</td><td align="center" style="background-color:#eef6e0;">4.64</td><td align="center" style="background-color:#eef6e0;">4.39</td><td align="center" style="background-color:#eef6e0;">4.26</td><td align="center" style="background-color:#eef6e0;">4.11</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">5.04</td><td align="center" style="background-color:#f3f4f6;">4.82</td><td align="center" style="background-color:#f3f4f6;">4.48</td><td align="center" style="background-color:#f3f4f6;">4.34</td><td align="center" style="background-color:#f3f4f6;">4.13</td></tr>
533
+ <tr><td align="left">Italian (it-IT)</td><td align="center" style="background-color:#eef6e0;">5.23</td><td align="center" style="background-color:#eef6e0;">4.85</td><td align="center" style="background-color:#eef6e0;">4.83</td><td align="center" style="background-color:#eef6e0;">4.41</td><td align="center" style="background-color:#eef6e0;">4.25</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">5.28</td><td align="center" style="background-color:#f3f4f6;">4.89</td><td align="center" style="background-color:#f3f4f6;">4.84</td><td align="center" style="background-color:#f3f4f6;">4.47</td><td align="center" style="background-color:#f3f4f6;">4.32</td></tr>
534
+ <tr><td align="left">Portuguese (pt-BR, pt-PT)</td><td align="center" style="background-color:#eef6e0;">6.29</td><td align="center" style="background-color:#eef6e0;">6.10</td><td align="center" style="background-color:#eef6e0;">5.81</td><td align="center" style="background-color:#eef6e0;">5.65</td><td align="center" style="background-color:#eef6e0;">5.48</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">6.41</td><td align="center" style="background-color:#f3f4f6;">6.19</td><td align="center" style="background-color:#f3f4f6;">5.82</td><td align="center" style="background-color:#f3f4f6;">5.57</td><td align="center" style="background-color:#f3f4f6;">5.47</td></tr>
535
+ <tr><td align="left">Hindi (hi-IN)</td><td align="center" style="background-color:#eef6e0;">8.13</td><td align="center" style="background-color:#eef6e0;">7.97</td><td align="center" style="background-color:#eef6e0;">7.41</td><td align="center" style="background-color:#eef6e0;">7.05</td><td align="center" style="background-color:#eef6e0;">6.81</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">11.47</td><td align="center" style="background-color:#f3f4f6;">10.83</td><td align="center" style="background-color:#f3f4f6;">9.88</td><td align="center" style="background-color:#f3f4f6;">9.26</td><td align="center" style="background-color:#f3f4f6;">8.23</td></tr>
536
+ <tr><td align="left">Korean (ko-KR)</td><td align="center" style="background-color:#eef6e0;">7.59</td><td align="center" style="background-color:#eef6e0;">7.70</td><td align="center" style="background-color:#eef6e0;">7.27</td><td align="center" style="background-color:#eef6e0;">7.18</td><td align="center" style="background-color:#eef6e0;">7.12</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">8.31</td><td align="center" style="background-color:#f3f4f6;">8.18</td><td align="center" style="background-color:#f3f4f6;">7.81</td><td align="center" style="background-color:#f3f4f6;">7.49</td><td align="center" style="background-color:#f3f4f6;">7.30</td></tr>
537
+ <tr><td align="left">English (en-US, en-GB)</td><td align="center" style="background-color:#eef6e0;">9.43</td><td align="center" style="background-color:#eef6e0;">8.88</td><td align="center" style="background-color:#eef6e0;">8.27</td><td align="center" style="background-color:#eef6e0;">7.99</td><td align="center" style="background-color:#eef6e0;">7.91</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">9.72</td><td align="center" style="background-color:#f3f4f6;">9.34</td><td align="center" style="background-color:#f3f4f6;">8.84</td><td align="center" style="background-color:#f3f4f6;">8.80</td><td align="center" style="background-color:#f3f4f6;">8.84</td></tr>
538
+ <tr><td align="left">German (de-DE)</td><td align="center" style="background-color:#eef6e0;">9.81</td><td align="center" style="background-color:#eef6e0;">9.21</td><td align="center" style="background-color:#eef6e0;">8.83</td><td align="center" style="background-color:#eef6e0;">8.42</td><td align="center" style="background-color:#eef6e0;">8.31</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">9.90</td><td align="center" style="background-color:#f3f4f6;">9.37</td><td align="center" style="background-color:#f3f4f6;">8.87</td><td align="center" style="background-color:#f3f4f6;">8.58</td><td align="center" style="background-color:#f3f4f6;">8.22</td></tr>
539
+ <tr><td align="left">French (fr-FR, fr-CA)</td><td align="center" style="background-color:#eef6e0;">10.97</td><td align="center" style="background-color:#eef6e0;">10.60</td><td align="center" style="background-color:#eef6e0;">9.79</td><td align="center" style="background-color:#eef6e0;">9.45</td><td align="center" style="background-color:#eef6e0;">9.03</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">11.03</td><td align="center" style="background-color:#f3f4f6;">10.60</td><td align="center" style="background-color:#f3f4f6;">9.84</td><td align="center" style="background-color:#f3f4f6;">9.46</td><td align="center" style="background-color:#f3f4f6;">9.02</td></tr>
540
+ <tr><td align="left">Russian (ru-RU)</td><td align="center" style="background-color:#eef6e0;">10.84</td><td align="center" style="background-color:#eef6e0;">10.73</td><td align="center" style="background-color:#eef6e0;">9.87</td><td align="center" style="background-color:#eef6e0;">9.60</td><td align="center" style="background-color:#eef6e0;">9.17</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">12.47</td><td align="center" style="background-color:#f3f4f6;">12.09</td><td align="center" style="background-color:#f3f4f6;">11.01</td><td align="center" style="background-color:#f3f4f6;">10.57</td><td align="center" style="background-color:#f3f4f6;">10.03</td></tr>
541
+ <tr><td align="left">Turkish (tr-TR)</td><td align="center" style="background-color:#eef6e0;">12.34</td><td align="center" style="background-color:#eef6e0;">12.33</td><td align="center" style="background-color:#eef6e0;">12.05</td><td align="center" style="background-color:#eef6e0;">11.34</td><td align="center" style="background-color:#eef6e0;">11.17</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">12.61</td><td align="center" style="background-color:#f3f4f6;">12.28</td><td align="center" style="background-color:#f3f4f6;">11.93</td><td align="center" style="background-color:#f3f4f6;">11.51</td><td align="center" style="background-color:#f3f4f6;">11.32</td></tr>
542
+ <tr><td align="left">Vietnamese (vi-VN)</td><td align="center" style="background-color:#eef6e0;">13.41</td><td align="center" style="background-color:#eef6e0;">12.87</td><td align="center" style="background-color:#eef6e0;">12.29</td><td align="center" style="background-color:#eef6e0;">11.78</td><td align="center" style="background-color:#eef6e0;">11.18</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">13.59</td><td align="center" style="background-color:#f3f4f6;">13.02</td><td align="center" style="background-color:#f3f4f6;">12.40</td><td align="center" style="background-color:#f3f4f6;">12.02</td><td align="center" style="background-color:#f3f4f6;">11.22</td></tr>
543
+ <tr><td align="left">Dutch (nl-NL)</td><td align="center" style="background-color:#eef6e0;">14.03</td><td align="center" style="background-color:#eef6e0;">13.43</td><td align="center" style="background-color:#eef6e0;">12.17</td><td align="center" style="background-color:#eef6e0;">11.97</td><td align="center" style="background-color:#eef6e0;">11.46</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">14.09</td><td align="center" style="background-color:#f3f4f6;">13.80</td><td align="center" style="background-color:#f3f4f6;">12.62</td><td align="center" style="background-color:#f3f4f6;">12.24</td><td align="center" style="background-color:#f3f4f6;">11.70</td></tr>
544
+ <tr><td align="left">Japanese (ja-JP)</td><td align="center" style="background-color:#eef6e0;">13.87</td><td align="center" style="background-color:#eef6e0;">12.90</td><td align="center" style="background-color:#eef6e0;">12.22</td><td align="center" style="background-color:#eef6e0;">11.91</td><td align="center" style="background-color:#eef6e0;">11.48</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">14.97</td><td align="center" style="background-color:#f3f4f6;">13.85</td><td align="center" style="background-color:#f3f4f6;">13.00</td><td align="center" style="background-color:#f3f4f6;">12.38</td><td align="center" style="background-color:#f3f4f6;">11.66</td></tr>
545
+ <tr><td align="left">Arabic (ar-AR)</td><td align="center" style="background-color:#eef6e0;">13.17</td><td align="center" style="background-color:#eef6e0;">12.65</td><td align="center" style="background-color:#eef6e0;">12.55</td><td align="center" style="background-color:#eef6e0;">12.13</td><td align="center" style="background-color:#eef6e0;">12.03</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">13.47</td><td align="center" style="background-color:#f3f4f6;">12.85</td><td align="center" style="background-color:#f3f4f6;">12.67</td><td align="center" style="background-color:#f3f4f6;">12.18</td><td align="center" style="background-color:#f3f4f6;">12.06</td></tr>
546
+ <tr><td align="left">Ukrainian (uk-UA)</td><td align="center" style="background-color:#eef6e0;">15.70</td><td align="center" style="background-color:#eef6e0;">15.21</td><td align="center" style="background-color:#eef6e0;">14.55</td><td align="center" style="background-color:#eef6e0;">13.67</td><td align="center" style="background-color:#eef6e0;">13.07</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">18.81</td><td align="center" style="background-color:#f3f4f6;">17.96</td><td align="center" style="background-color:#f3f4f6;">16.79</td><td align="center" style="background-color:#f3f4f6;">15.60</td><td align="center" style="background-color:#f3f4f6;">14.59</td></tr>
547
+ <tr><td align="left"><strong>Average</strong></td><td align="center" style="background-color:#eef6e0;"><strong>10.38</strong></td><td align="center" style="background-color:#eef6e0;"><strong>10.00</strong></td><td align="center" style="background-color:#eef6e0;"><strong>9.49</strong></td><td align="center" style="background-color:#eef6e0;"><strong>9.12</strong></td><td align="center" style="background-color:#eef6e0;"><strong>8.84</strong></td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;"><strong>11.14</strong></td><td align="center" style="background-color:#f3f4f6;"><strong>10.67</strong></td><td align="center" style="background-color:#f3f4f6;"><strong>10.05</strong></td><td align="center" style="background-color:#f3f4f6;"><strong>9.63</strong></td><td align="center" style="background-color:#f3f4f6;"><strong>9.21</strong></td></tr>
548
+ </tbody>
549
+ </table>
550
+
551
+ ### Broad-coverage (13 locales)
552
+
553
+ _Languages are ordered by accuracy (lowest WER first)._
554
+
555
+ <table>
556
+ <thead>
557
+ <tr><th rowspan="2" align="left">Language</th><th colspan="5" align="center" style="background-color:#76b900;color:#ffffff">Language Input (LangID)</th><th colspan="5" align="center" style="background-color:#6b7280;color:#ffffff;border-left:2px solid #cbd5e1;">Auto-detect</th></tr>
558
+ <tr><th align="center" style="background-color:#eef6e0">80ms</th><th align="center" style="background-color:#eef6e0">160ms</th><th align="center" style="background-color:#eef6e0">320ms</th><th align="center" style="background-color:#eef6e0">560ms</th><th align="center" style="background-color:#eef6e0">1.12s</th><th align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">80ms</th><th align="center" style="background-color:#f3f4f6;">160ms</th><th align="center" style="background-color:#f3f4f6;">320ms</th><th align="center" style="background-color:#f3f4f6;">560ms</th><th align="center" style="background-color:#f3f4f6;">1.12s</th></tr>
559
+ </thead>
560
+ <tbody>
561
+ <tr><td align="left">Polish (pl-PL)</td><td align="center" style="background-color:#eef6e0;">19.88</td><td align="center" style="background-color:#eef6e0;">18.92</td><td align="center" style="background-color:#eef6e0;">17.48</td><td align="center" style="background-color:#eef6e0;">16.61</td><td align="center" style="background-color:#eef6e0;">15.15</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">22.65</td><td align="center" style="background-color:#f3f4f6;">21.63</td><td align="center" style="background-color:#f3f4f6;">20.05</td><td align="center" style="background-color:#f3f4f6;">18.52</td><td align="center" style="background-color:#f3f4f6;">16.55</td></tr>
562
+ <tr><td align="left">Norwegian Bokmål (nb-NO)</td><td align="center" style="background-color:#eef6e0;">20.43</td><td align="center" style="background-color:#eef6e0;">20.07</td><td align="center" style="background-color:#eef6e0;">18.90</td><td align="center" style="background-color:#eef6e0;">18.44</td><td align="center" style="background-color:#eef6e0;">18.10</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">20.91</td><td align="center" style="background-color:#f3f4f6;">20.19</td><td align="center" style="background-color:#f3f4f6;">19.29</td><td align="center" style="background-color:#f3f4f6;">18.76</td><td align="center" style="background-color:#f3f4f6;">18.01</td></tr>
563
+ <tr><td align="left">Finnish (fi-FI)</td><td align="center" style="background-color:#eef6e0;">21.19</td><td align="center" style="background-color:#eef6e0;">20.57</td><td align="center" style="background-color:#eef6e0;">20.05</td><td align="center" style="background-color:#eef6e0;">18.94</td><td align="center" style="background-color:#eef6e0;">18.34</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">21.61</td><td align="center" style="background-color:#f3f4f6;">20.88</td><td align="center" style="background-color:#f3f4f6;">20.40</td><td align="center" style="background-color:#f3f4f6;">19.36</td><td align="center" style="background-color:#f3f4f6;">18.72</td></tr>
564
+ <tr><td align="left">Mandarin (zh-CN)</td><td align="center" style="background-color:#eef6e0;">20.56</td><td align="center" style="background-color:#eef6e0;">20.22</td><td align="center" style="background-color:#eef6e0;">20.03</td><td align="center" style="background-color:#eef6e0;">19.51</td><td align="center" style="background-color:#eef6e0;">19.28</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">22.45</td><td align="center" style="background-color:#f3f4f6;">21.07</td><td align="center" style="background-color:#f3f4f6;">20.59</td><td align="center" style="background-color:#f3f4f6;">20.40</td><td align="center" style="background-color:#f3f4f6;">19.87</td></tr>
565
+ <tr><td align="left">Czech (cs-CZ)</td><td align="center" style="background-color:#eef6e0;">24.18</td><td align="center" style="background-color:#eef6e0;">23.20</td><td align="center" style="background-color:#eef6e0;">22.41</td><td align="center" style="background-color:#eef6e0;">21.04</td><td align="center" style="background-color:#eef6e0;">20.41</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">25.81</td><td align="center" style="background-color:#f3f4f6;">25.12</td><td align="center" style="background-color:#f3f4f6;">23.68</td><td align="center" style="background-color:#f3f4f6;">22.55</td><td align="center" style="background-color:#f3f4f6;">21.45</td></tr>
566
+ <tr><td align="left">Bulgarian (bg-BG)</td><td align="center" style="background-color:#eef6e0;">24.50</td><td align="center" style="background-color:#eef6e0;">23.58</td><td align="center" style="background-color:#eef6e0;">22.80</td><td align="center" style="background-color:#eef6e0;">21.70</td><td align="center" style="background-color:#eef6e0;">20.53</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">28.28</td><td align="center" style="background-color:#f3f4f6;">27.22</td><td align="center" style="background-color:#f3f4f6;">25.54</td><td align="center" style="background-color:#f3f4f6;">24.05</td><td align="center" style="background-color:#f3f4f6;">21.84</td></tr>
567
+ <tr><td align="left">Slovak (sk-SK)</td><td align="center" style="background-color:#eef6e0;">25.08</td><td align="center" style="background-color:#eef6e0;">24.14</td><td align="center" style="background-color:#eef6e0;">23.73</td><td align="center" style="background-color:#eef6e0;">22.51</td><td align="center" style="background-color:#eef6e0;">21.28</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">27.59</td><td align="center" style="background-color:#f3f4f6;">26.06</td><td align="center" style="background-color:#f3f4f6;">25.61</td><td align="center" style="background-color:#f3f4f6;">24.15</td><td align="center" style="background-color:#f3f4f6;">22.68</td></tr>
568
+ <tr><td align="left">Swedish (sv-SE)</td><td align="center" style="background-color:#eef6e0;">25.61</td><td align="center" style="background-color:#eef6e0;">24.85</td><td align="center" style="background-color:#eef6e0;">23.63</td><td align="center" style="background-color:#eef6e0;">22.72</td><td align="center" style="background-color:#eef6e0;">22.17</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">26.28</td><td align="center" style="background-color:#f3f4f6;">25.56</td><td align="center" style="background-color:#f3f4f6;">24.18</td><td align="center" style="background-color:#f3f4f6;">23.57</td><td align="center" style="background-color:#f3f4f6;">22.53</td></tr>
569
+ <tr><td align="left">Croatian (hr-HR)</td><td align="center" style="background-color:#eef6e0;">27.92</td><td align="center" style="background-color:#eef6e0;">27.09</td><td align="center" style="background-color:#eef6e0;">25.79</td><td align="center" style="background-color:#eef6e0;">24.92</td><td align="center" style="background-color:#eef6e0;">23.97</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">32.13</td><td align="center" style="background-color:#f3f4f6;">31.20</td><td align="center" style="background-color:#f3f4f6;">29.65</td><td align="center" style="background-color:#f3f4f6;">28.95</td><td align="center" style="background-color:#f3f4f6;">27.46</td></tr>
570
+ <tr><td align="left">Romanian (ro-RO)</td><td align="center" style="background-color:#eef6e0;">31.52</td><td align="center" style="background-color:#eef6e0;">30.93</td><td align="center" style="background-color:#eef6e0;">29.04</td><td align="center" style="background-color:#eef6e0;">27.77</td><td align="center" style="background-color:#eef6e0;">25.90</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">34.22</td><td align="center" style="background-color:#f3f4f6;">33.26</td><td align="center" style="background-color:#f3f4f6;">30.97</td><td align="center" style="background-color:#f3f4f6;">29.84</td><td align="center" style="background-color:#f3f4f6;">26.88</td></tr>
571
+ <tr><td align="left">Estonian (et-EE)</td><td align="center" style="background-color:#eef6e0;">29.95</td><td align="center" style="background-color:#eef6e0;">29.66</td><td align="center" style="background-color:#eef6e0;">28.59</td><td align="center" style="background-color:#eef6e0;">27.37</td><td align="center" style="background-color:#eef6e0;">26.35</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">30.58</td><td align="center" style="background-color:#f3f4f6;">30.09</td><td align="center" style="background-color:#f3f4f6;">28.72</td><td align="center" style="background-color:#f3f4f6;">28.03</td><td align="center" style="background-color:#f3f4f6;">27.19</td></tr>
572
+ <tr><td align="left">Danish (da-DK)</td><td align="center" style="background-color:#eef6e0;">32.62</td><td align="center" style="background-color:#eef6e0;">31.51</td><td align="center" style="background-color:#eef6e0;">30.00</td><td align="center" style="background-color:#eef6e0;">28.92</td><td align="center" style="background-color:#eef6e0;">27.49</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">33.15</td><td align="center" style="background-color:#f3f4f6;">31.77</td><td align="center" style="background-color:#f3f4f6;">30.22</td><td align="center" style="background-color:#f3f4f6;">29.33</td><td align="center" style="background-color:#f3f4f6;">27.81</td></tr>
573
+ <tr><td align="left">Hungarian (hu-HU)</td><td align="center" style="background-color:#eef6e0;">32.70</td><td align="center" style="background-color:#eef6e0;">32.03</td><td align="center" style="background-color:#eef6e0;">30.92</td><td align="center" style="background-color:#eef6e0;">29.72</td><td align="center" style="background-color:#eef6e0;">28.68</td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;">33.40</td><td align="center" style="background-color:#f3f4f6;">32.39</td><td align="center" style="background-color:#f3f4f6;">31.49</td><td align="center" style="background-color:#f3f4f6;">30.20</td><td align="center" style="background-color:#f3f4f6;">29.18</td></tr>
574
+ <tr><td align="left"><strong>Average</strong></td><td align="center" style="background-color:#eef6e0;"><strong>25.86</strong></td><td align="center" style="background-color:#eef6e0;"><strong>25.14</strong></td><td align="center" style="background-color:#eef6e0;"><strong>24.11</strong></td><td align="center" style="background-color:#eef6e0;"><strong>23.09</strong></td><td align="center" style="background-color:#eef6e0;"><strong>22.13</strong></td><td align="center" style="background-color:#f3f4f6;border-left:2px solid #cbd5e1;"><strong>27.62</strong></td><td align="center" style="background-color:#f3f4f6;"><strong>26.65</strong></td><td align="center" style="background-color:#f3f4f6;"><strong>25.41</strong></td><td align="center" style="background-color:#f3f4f6;"><strong>24.44</strong></td><td align="center" style="background-color:#f3f4f6;"><strong>23.09</strong></td></tr>
575
+ </tbody>
576
+ </table>
577
+
578
+ ### Adaptation-ready languages (fine-tune to enable)
579
+
580
+ These **8 language-locales** are recognized by the tokenizer but are not tuned for production transcription out of the box: **Greek (el-GR), Hebrew (he-IL), Lithuanian (lt-LT), Slovenian (sl-SI), Latvian (lv-LV), Maltese (mt-MT), Thai (th-TH), and Norwegian Nynorsk (nn-NO)**. Fine-tuning on in-domain data is recommended to bring them to production quality.
581
+
582
+ Check our [blog post](https://huggingface.co/blog/nvidia/fine-tuning-nemotron-35-asr) of **how to fine-tune Nemotron 3.5 ASR to improve these languages**, including before/after results.
583
+
584
+
585
+ ---
586
+
587
+ ## Ethical Considerations
588
+
589
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
590
+
591
+ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
592
+
593
+ ---
models/nemotron-3.5-asr-streaming-0.6b/arch_slide10.png ADDED
models/nemotron-3.5-asr-streaming-0.6b/avg_wer_summary.png ADDED
models/nemotron-3.5-asr-streaming-0.6b/bias.md ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ Field | Response
2
+ :---------------------------------------------------------------------------------------------------|:---------------
3
+ What is the language balance of the model validation data? | en-US: 25.63%, es-ES: 10.50%, pt-PT: 5.26%, hi-IN: 5.10%, fr-FR: 4.56%, de-DE: 4.28%, pt-BR: 3.14%, mt-MT: 2.83%, hu-HU: 2.82%, ro-RO: 2.81%, bg-BG: 2.78%, el-GR: 2.77%, lt-LT: 2.26%, fi-FI: 2.26%, it-IT: 2.14%, zh-CN: 2.14%, lv-LV: 2.06%, ja-JP: 1.98%, sk-SK: 1.90%, ru-RU: 1.76%, ko-KR: 1.72%, et-EE: 1.67%, sl-SI: 1.59%, ar-AR: 1.57%, es-US: 1.43%, he-IL: 0.85%, vi-VN: 0.83%, fr-CA: 0.32%, nl-NL: 0.21%, tr-TR: 0.19%, en-GB: 0.14%, pl-PL: 0.14%, uk-UA: 0.11%, th-TH: 0.09%, hr-HR: 0.04%, cs-CZ: 0.03%, da-DK: 0.03%, nb-NO: 0.03%, sv-SE: 0.03%, nn-NO: 0.00%
4
+ What is the geographic origin language balance of the model validation data? | Europe: 55.01%, North America: 27.38%, Asia: 11.86%, South America: 3.14%, Middle East: 2.61%
5
+ What is the accent balance of the model validation data? | en-US: 25.63%, es-ES: 10.50%, pt-PT: 5.26%, hi-IN: 5.10%, fr-FR: 4.56%, de-DE: 4.28%, pt-BR: 3.14%, mt-MT: 2.83%, hu-HU: 2.82%, ro-RO: 2.81%, bg-BG: 2.78%, el-GR: 2.77%, lt-LT: 2.26%, fi-FI: 2.26%, it-IT: 2.14%, zh-CN: 2.14%, lv-LV: 2.06%, ja-JP: 1.98%, sk-SK: 1.90%, ru-RU: 1.76%, ko-KR: 1.72%, et-EE: 1.67%, sl-SI: 1.59%, ar-AR: 1.57%, es-US: 1.43%, he-IL: 0.85%, vi-VN: 0.83%, fr-CA: 0.32%, nl-NL: 0.21%, tr-TR: 0.19%, en-GB: 0.14%, pl-PL: 0.14%, uk-UA: 0.11%, th-TH: 0.09%, hr-HR: 0.04%, cs-CZ: 0.03%, da-DK: 0.03%, nb-NO: 0.03%, sv-SE: 0.03%, nn-NO: 0.00%
6
+ Participation considerations from adversely impacted groups ([protected classes](https://www.senate.ca.gov/protected-classes)) in model design and testing: | Age, Gender, Linguistic Background
7
+ Measures taken to mitigate against unwanted bias: | Used a custom dataset to evaluate model performance across genders, age groups, and linguistic backgrounds.
8
+
9
+
models/nemotron-3.5-asr-streaming-0.6b/explainability.md ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Field | Response
2
+ :------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
3
+ Intended Task/Domain: | Speech Transcription
4
+ Model Type: | FastConformer-CacheAware-RNNT
5
+ Intended Users: | This model is intended for developers and data scientists building interactive call centers, virtual assistants, and language learning assistants.
6
+ Output: | Transcribed text with timestamps and confidence scores
7
+ Describe how the model works: | Model transcribes audio input into text for the input language
8
+ Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Age, Gender, National Origin
9
+ Technical Limitations & Mitigation: | Transcripts may not be 100% accurate. Accuracy varies depending on the characteristics of the input audio, such as domain, use case, accent, noise, speech type, and speech context.
10
+ Verified to have met prescribed NVIDIA quality standards: | Yes
11
+ Performance Metrics: | Word Error Rate (WER), Silence Robustness (Characters/mins of silent audio), Latency (in milliseconds), Throughput (Total audio processed per unit of time)
12
+ Potential Known Risks: | Not recommended for word-for-word transcription as accuracy varies based on the characteristics of input audio (domain, use case, accent, noise, speech type, and context of speech)
13
+ Licensing: | Governing Terms: Use of the model is governed by the [OpenMDW-1.1](https://openmdw.ai/license/1-1/) license.
14
+
15
+
models/nemotron-3.5-asr-streaming-0.6b/fleurs_langid_vs_auto.png ADDED
models/nemotron-3.5-asr-streaming-0.6b/fleurs_wer_vs_chunk_size.png ADDED
models/nemotron-3.5-asr-streaming-0.6b/latency_vs_parallel.png ADDED

Git LFS Details

  • SHA256: 3e5636204786c68914f69c3f3158b4e698774b6e0dbd9120dbd996947e396638
  • Pointer size: 131 Bytes
  • Size of remote file: 139 kB
models/nemotron-3.5-asr-streaming-0.6b/model_architecture.png ADDED

Git LFS Details

  • SHA256: 3146643d1a7c8dd424adcb221a5dabbda6951da1ef3937f7d15b29e46e5fa272
  • Pointer size: 131 Bytes
  • Size of remote file: 151 kB
models/nemotron-3.5-asr-streaming-0.6b/model_overview.png ADDED

Git LFS Details

  • SHA256: e4fa275d1d6cb0b01064df92c051dc0888dc63d18a8b3c8ae9b35268f2c89427
  • Pointer size: 131 Bytes
  • Size of remote file: 205 kB
models/nemotron-3.5-asr-streaming-0.6b/nemotron-3.5-asr-streaming-0.6b.nemo ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:210214ed94039bf6bfbb9a047c7fa289628db75b103e2bf6381fa78285436a74
3
+ size 2368284501
models/nemotron-3.5-asr-streaming-0.6b/privacy.md ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Field | Response
2
+ :----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
3
+ Generatable or reverse engineerable personal data? | No
4
+ Personal data used to create this model? | Yes - Voice
5
+ Was consent obtained for any personal data used? | Yes
6
+ Is a mechanism in place to honor data subject right of access or deletion of personal data? | Yes
7
+ If personal data was collected for the development of the model, was it collected directly by NVIDIA? | Yes
8
+ If personal data was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? | Yes
9
+ If personal data was collected for the development of this AI model, was it minimized to only what was required? | Yes
10
+ Is there provenance for all datasets used in training? | Yes
11
+ Does data labeling (annotation, metadata) comply with privacy laws? | Yes
12
+ Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data.
13
+ Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/
14
+ How often is dataset reviewed? | Dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for changes.
15
+ Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? | No
16
+
17
+
models/nemotron-3.5-asr-streaming-0.6b/safety.md ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ Field | Response
2
+ :---------------------------------------------------|:----------------------------------
3
+ Model Application Field(s): | Speech Transcription
4
+ Describe the life critical impact (if present). | Not Applicable
5
+ Use Case Restrictions: | Abide by Governing Terms: Use of the model is governed by the [OpenMDW-1.1](https://openmdw.ai/license/1-1/) license.
6
+ Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
7
+
8
+
models/nemotron-3.5-asr-streaming-0.6b/throughput_vs_chunk.png ADDED