Model Not performing on audios beyond 30sec
#1
by
abhayatwork
- opened
I am getting incomplete transcripts when used on audios beyond 30secs. Is it an expected behavior from the model?
Yes, the model processes 30-second audio chunks at max. You can manually split your audio into 30-second segments and then concatenate the transcriptions afterward.
Alternatively, you can use the Hugging Face pipeline for automatic speech recognition (pipeline("automatic-speech-recognition")), which handles the chunking and stitching internally, making the process much easier and more seamless.
Thanks, I anticipated it and was looking for any mention on the model page. You should mention it there. I implemented vad + ASR as an alternative.
abhayatwork
changed discussion status to
closed