How do i Reduce Hallucination and mostly audio's miss the last line after chunking of 30 seconds

#2
by iamgrootns - opened

So i was trying to to do Transcription using north model and the thing is i do get transcriptions but in longer audios the transcription is hallucinating and i am transcribing the audios with chunks of 30 seconds , which results in the last line of chunk not getting transcripted , and gets missing from the transcription

for ex- i said hey how are you, and someone said i have been good how have you been
i said hey how are you, and someone said i have been good <-- i get this part how have you been but this small part gets missed from transcription

class JiviService:

def __init__(self):
    print(torch.cuda)
    self.device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using JIVI SERVICE device: {self.device}")
    try:
        self.processor = WhisperProcessor.from_pretrained("jiviai/audioX-north-v1")
        self.model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-north-v1").to(self.device)
        self.model.config.forced_decoder_ids = None
    except Exception as e:
        print(f"Error loading model: {e}")
        raise RuntimeError(f"Could not load the transcription model: {e}")

This is how i am trying to use the Model , i have 2 funcs one for transcribing audio and one for transcribing the chunks

        inputs = self.processor(audio_np, sampling_rate=16000, return_tensors="pt")
        input_features = inputs.input_features.to(self.device)

        predicted_ids = self.model.generate(input_features, task="transcribe", language="hi")
        transcription = self.processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

this is how i am using the chunk func to transcribe audio's

any help will be great.

also is there any way to get TimeStamps??

Sign up or log in to comment