Model Not performing on audios beyond 30sec

by abhayatwork - opened Jun 3, 2025

Discussion

abhayatwork

Jun 3, 2025

I am getting incomplete transcripts when used on audios beyond 30secs. Is it an expected behavior from the model?

akshat311

Jun 4, 2025

Yes, the model processes 30-second audio chunks at max. You can manually split your audio into 30-second segments and then concatenate the transcriptions afterward.

Alternatively, you can use the Hugging Face pipeline for automatic speech recognition (pipeline("automatic-speech-recognition")), which handles the chunking and stitching internally, making the process much easier and more seamless.

abhayatwork

Jun 4, 2025

•

edited Jun 4, 2025

Thanks, I anticipated it and was looking for any mention on the model page. You should mention it there. I implemented vad + ASR as an alternative.

abhayatwork changed discussion status to closed Jun 4, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment