Issue with transcribing audio as a list when using timestamps
I ran into an issue when transcribing audio provided as a list using a model with timestamps. I get the following error:
AttributeError: 'tuple' object has no attribute 'audio'
audio=[audio.squeeze()[:audio_len] for audio, audio_len in zip(batch.audio, batch.audio_lens)]
In the case of not using timestamps, the model can transcribe audio even in this format.
I’m wondering how it’s possible to use the model with timestamps for this type of audio input - numpy/torch array.
Has anyone encountered this problem or has experience handling this scenario?
in EncDecMultiTaskModel._transcribe_output_processing(self, outputs, trcfg)
1045 del enc_states, enc_mask, decoder_input_ids
1047 if trcfg.timestamps and self.timestamps_asr_model is not None:
1048 hypotheses = get_forced_aligned_timestamps_with_external_model(
-> 1049 audio=[audio.squeeze()[:audio_len] for audio, audio_len in zip(batch.audio, batch.audio_lens)],
1050 batch_size=len(batch.audio),
1051 external_ctc_model=self.timestamps_asr_model,
1052 main_model_predictions=hypotheses,
1053 timestamp_type='char' if merge_to_be_done else ['word', 'segment'],
1054 viterbi_device=trcfg._internal.device,
1055 )
1056 elif trcfg.timestamps:
1057 hypotheses = process_aed_timestamp_outputs(
1058 hypotheses, self.encoder.subsampling_factor, self.cfg['preprocessor']['window_stride']
1059 )
AttributeError: 'tuple' object has no attribute 'audio'