How do i Reduce Hallucination and mostly audio's miss the last line after chunking of 30 seconds

by iamgrootns - opened Jun 25, 2025

Jun 25, 2025

So i was trying to to do Transcription using north model and the thing is i do get transcriptions but in longer audios the transcription is hallucinating and i am transcribing the audios with chunks of 30 seconds , which results in the last line of chunk not getting transcripted , and gets missing from the transcription

for ex- i said hey how are you, and someone said i have been good how have you been
i said hey how are you, and someone said i have been good <-- i get this part how have you been but this small part gets missed from transcription

class JiviService:

def __init__(self):
    print(torch.cuda)
    self.device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using JIVI SERVICE device: {self.device}")
    try:
        self.processor = WhisperProcessor.from_pretrained("jiviai/audioX-north-v1")
        self.model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-north-v1").to(self.device)
        self.model.config.forced_decoder_ids = None
    except Exception as e:
        print(f"Error loading model: {e}")
        raise RuntimeError(f"Could not load the transcription model: {e}")

This is how i am trying to use the Model , i have 2 funcs one for transcribing audio and one for transcribing the chunks

        inputs = self.processor(audio_np, sampling_rate=16000, return_tensors="pt")
        input_features = inputs.input_features.to(self.device)

        predicted_ids = self.model.generate(input_features, task="transcribe", language="hi")
        transcription = self.processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

this is how i am using the chunk func to transcribe audio's

any help will be great.

also is there any way to get TimeStamps??

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment