Thanks for raising this, @Amirjab21 . As discussed and confirmed in the Hugging Face model page thread, the model’s forward pass maintains a fixed-size encoder cache and a fixed-size RNN-T decoder hidden state, both of which are independent of the total audio duration and do not grow with input length.
After retesting, we’re glad to see that you no longer observe a degradation in inference speed as audio length increases. This aligns with the intended design and expected performance characteristics of the cache-aware streaming architecture.
Thanks again for taking the time to investigate and share your findings, and please feel free to reach out if you encounter any other issues or have additional questions.