Kunal Dhawan
AI & ML interests
Recent Activity
Organizations
Add 🤗 Transformers support
Add 🤗 Transformers support
Add 🤗 Transformers support
docs(readme): add 🤗 Transformers usage + transformers/hf-asr-leaderboard tags
Add 🤗 Transformers support
Thanks for the feedback, @alvesman ! We released nemotron-3.5-asr-streaming-0.6b today, it supports 40 language-locales across the globe and is based on the same cache-aware streaming architecture discussed in this blog. Please test it out and let us know how it goes.
Code-switching ?
How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent
docs(readme): add NIM Try via API (Nemotron ASR Streaming)
Deploy Streaming nemotron speech model
Yes @Amirjab21 , all the code is open-sourced :)
Training script: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py
Streaming config: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/conf/fastconformer/cache_aware_streaming/fastconformer_ctc_bpe_streaming.yaml
Inference script: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py
Thanks for raising this, @Amirjab21 . As discussed and confirmed in the Hugging Face model page thread, the model’s forward pass maintains a fixed-size encoder cache and a fixed-size RNN-T decoder hidden state, both of which are independent of the total audio duration and do not grow with input length.
After retesting, we’re glad to see that you no longer observe a degradation in inference speed as audio length increases. This aligns with the intended design and expected performance characteristics of the cache-aware streaming architecture.
Thanks again for taking the time to investigate and share your findings, and please feel free to reach out if you encounter any other issues or have additional questions.
Does decoding efficiency decrease as the audio length increases?
Smaller model planned?
Can we expect an ONNX quant?
Multilingual version planned?
Thank you for the question, @Amirjab21 ! This is one of the key advantages of a native streaming model. The audio is not processed in a single pass over the full input; instead, it is consumed incrementally in small chunks as they arrive, with relevant contextual information preserved in the model’s cache. This design allows the model to handle arbitrarily long audio streams without an explicit duration limit, since context is carried forward through the cache and computation is performed only on the new incoming frames, rather than reprocessing the entire audio or chunking it to a fixed maximum length.