Model Not performing on audios beyond 30sec

#1
by abhayatwork - opened

I am getting incomplete transcripts when used on audios beyond 30secs. Is it an expected behavior from the model?

Yes, the model processes 30-second audio chunks at max. You can manually split your audio into 30-second segments and then concatenate the transcriptions afterward.

Alternatively, you can use the Hugging Face pipeline for automatic speech recognition (pipeline("automatic-speech-recognition")), which handles the chunking and stitching internally, making the process much easier and more seamless.

Thanks, I anticipated it and was looking for any mention on the model page. You should mention it there. I implemented vad + ASR as an alternative.

abhayatwork changed discussion status to closed

Sign up or log in to comment