This is super interesting work - thanks for sharing your empirical results as well.
On the HF demo, it says that there is no duration limit for the ASR model. Does this means that the ASR model does the sliding window over long audio? Or is this handled by an inference pipeline which then feeds the model a sequence of maximum length?