22 4 10

Kunal Dhawan

kunaldhawan

https://kunal-dhawan.weebly.com/

KunalDhawan

AI & ML interests

Conversational AI, NLP, Multimodal Machine Learning

Recent Activity

new activity 6 days ago

nvidia/nemotron-speech-streaming-en-0.6b:docs(readme): add NIM Try via API (Nemotron ASR Streaming)

updated a model about 2 months ago

nvidia/nemotron-speech-streaming-en-0.6b

new activity 3 months ago

nvidia/nemotron-speech-streaming-en-0.6b:Deploy Streaming nemotron speech model

View all activity

Organizations

New activity in nvidia/nemotron-speech-streaming-en-0.6b 6 days ago

docs(readme): add NIM Try via API (Nemotron ASR Streaming)

#16 opened 7 days ago by

Amargolin

updated a model about 2 months ago

nvidia/nemotron-speech-streaming-en-0.6b

Automatic Speech Recognition • Updated 6 days ago • 11.5k • 534

New activity in nvidia/nemotron-speech-streaming-en-0.6b 3 months ago

Deploy Streaming nemotron speech model

#10 opened 3 months ago by

MinhHan1009

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR 3 months ago

Yes @Amirjab21 , all the code is open-sourced :)
Training script: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py
Streaming config: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/conf/fastconformer/cache_aware_streaming/fastconformer_ctc_bpe_streaming.yaml
Inference script: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR 3 months ago

Thanks for raising this, @Amirjab21 . As discussed and confirmed in the Hugging Face model page thread, the model’s forward pass maintains a fixed-size encoder cache and a fixed-size RNN-T decoder hidden state, both of which are independent of the total audio duration and do not grow with input length.

After retesting, we’re glad to see that you no longer observe a degradation in inference speed as audio length increases. This aligns with the intended design and expected performance characteristics of the cache-aware streaming architecture.

Thanks again for taking the time to investigate and share your findings, and please feel free to reach out if you encounter any other issues or have additional questions.

New activity in nvidia/nemotron-speech-streaming-en-0.6b 4 months ago

Does decoding efficiency decrease as the audio length increases?

👀 1

#9 opened 4 months ago by

Kerwin11

Smaller model planned?

#8 opened 4 months ago by

downtown1629

updated a collection 4 months ago

Nemotron Speech

Collection

Open, state-of-the-art, production‑ready enterprise speech models from the NVIDIA Speech research team for ASR, TTS, Speaker Diarization and S2S • 12 items • Updated about 24 hours ago • 51

upvoted a collection 4 months ago

Nemotron Speech

Collection

Open, state-of-the-art, production‑ready enterprise speech models from the NVIDIA Speech research team for ASR, TTS, Speaker Diarization and S2S • 12 items • Updated about 24 hours ago • 51

New activity in nvidia/nemotron-speech-streaming-en-0.6b 4 months ago

Can we expect an ONNX quant?

➕ 3

#6 opened 4 months ago by

SuperPauly

Multilingual version planned?

#2 opened 4 months ago by

fosple

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR 4 months ago

Thank you for the question, @Amirjab21 ! This is one of the key advantages of a native streaming model. The audio is not processed in a single pass over the full input; instead, it is consumed incrementally in small chunks as they arrive, with relevant contextual information preserved in the model’s cache. This design allows the model to handle arbitrarily long audio streams without an explicit duration limit, since context is carried forward through the cache and computation is performed only on the new incoming frames, rather than reprocessing the entire audio or chunking it to a fixed maximum length.

New activity in nvidia/nemotron-speech-streaming-en-0.6b 4 months ago

Installation Video and Testing - Step by Step

❤️ 1

#4 opened 4 months ago by

fahdmirzac

RNNT decoder stalls after sentence boundaries in streaming mode

#5 opened 4 months ago by

chatboo

liked a Space 4 months ago

Nemotron Speech Streaming

🎤

Real-time speech recognition with NVIDIA Triton

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR 4 months ago

Great question, @RakshitAralimatti . To better handle real-world conversational dynamics such as interruptions and rapid turn-taking, we recently released a cache-aware model that jointly performs ASR and end-of-utterance (EOU) detection. The EOU signal can be used to explicitly trigger cache resets at turn boundaries, enabling robust behavior in interactive, streaming settings. You can find the model here: https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1

New activity in nvidia/nemotron-speech-streaming-en-0.6b 4 months ago

MLX version planned?

#3 opened 4 months ago by

Amit-I

New activity in nvidia/canary-1b-flash 4 months ago

colab notebooks do not work

#15 opened 4 months ago by

malinphy

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR 4 months ago

Hi @kavyamanohar , thank you for the question. Unlike parakeet-tdt-0.6b-v2, this model was trained in a single stage. To enable proper punctuation and capitalization, we leveraged the Granary dataset and pipeline, which provides pseudo punctuation and capitalization labels generated using a strong LLM (e.g., Qwen-2.5-7B-Instruct).

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR 4 months ago

Hi @TomSchelsen , thank you for the question. This blog is written from an end-user perspective, focusing on why and when one should use the Nemotron Speech ASR model. For that reason, we chose to compare models that deliver similar performance in terms of accuracy and WER.
In particular, nemotron-speech-streaming-en-0.6b achieves comparable (and in some cases better) accuracy than our leading streaming parakeet-ctc-1.1b-asr model across multiple evaluation datasets, while also providing the scaling and latency advantages highlighted in the blog. A comparison with parakeet-ctc-0.6b-asr is reasonable; however, that model does not match nemotron-speech-streaming-en-0.6b in terms of overall accuracy and WER.
We will try to address this better in a followup blog and also share more interesting results using the model. Thank you!

Kunal Dhawan

AI & ML interests

Recent Activity

Organizations

kunaldhawan's activity

docs(readme): add NIM Try via API (Nemotron ASR Streaming)

Deploy Streaming nemotron speech model

Does decoding efficiency decrease as the audio length increases?

Smaller model planned?

Can we expect an ONNX quant?

Multilingual version planned?

Installation Video and Testing - Step by Step

RNNT decoder stalls after sentence boundaries in streaming mode

Nemotron Speech Streaming

MLX version planned?

colab notebooks do not work