Thank you for the question, @Amirjab21 ! This is one of the key advantages of a native streaming model. The audio is not processed in a single pass over the full input; instead, it is consumed incrementally in small chunks as they arrive, with relevant contextual information preserved in the model’s cache. This design allows the model to handle arbitrarily long audio streams without an explicit duration limit, since context is carried forward through the cache and computation is performed only on the new incoming frames, rather than reprocessing the entire audio or chunking it to a fixed maximum length.
Hi kunaldhawan ~
@kunaldhawan
Based on my testing, it seems that the inference speed decreases as the audio length increases. I tested a 30-minute audio file, and the speed dropped by approximately 10ms in the final stage.
This is my test discussions
https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b/discussions/9