Hints for language

#19
by oleslav - opened

Hi, I would like to know, if it is possible to provide a language hint before starting a transcription via vLLM.

I have a example where the whole audio recorded in german but the first language are transcribed in Portugal (Brasilian).

"""
Vou tentar com os resultados que tomam para o forward support. Achim Müller ist mein Name. Ja, hallo, Heinz Heinisch ist mein Name. Ich besitze einen Thermomix TM6. Leider geht die Maschine nicht mehr an. Haben Sie versucht, Ihren Thermomix auch an anderen Steckdosen anzuschließen? Ja, ich hatte sämtliche Steckdosen in der Küche ausprobiert. Anscheinend ist das Gerät defekt und muss zu uns eingesandt werden. Ich benötige Ihre E-Mail-Adresse und dann würde ich Ihnen alle Versandpapiere dann per E-Mail zukommen lassen. Ja, sehr gerne. Meine E-Mail lautet heinzheinrich at gmail.com. Danke Ihnen auch wieder.
"""

Any tips to make it work properly? I have an idea sending a first audio chunk with a really strong German accent pronunciation, but it really not the best way to do so.

We need the official to provide the training methodology, and then train the specific model based on the official training methodology provided.

@liuyt6515 I am sorry, but I didn't understand you, could you rephrase your sentence?

Mistral AI_ org

We need the official to provide the training methodology, and then train the specific model based on the official training methodology provided.

I think that's unrelated to the question

Mistral AI_ org

@oleslav voxtral realtime doesn't support any language hints like voxtral mini does for example. So there is no way of "hinting" a language via the Request (we might add this in a next version).

What you can do to make sure your model stays in the correct language is pretty much exactly what you said - just have a "dummy 1 second " audio chunk of the clean target language (or longer). That will bias the model nicely towards your target language.

@patrickvonplaten Okay, I got it. Thank you for you response & for a great model! I do really like it, it beats a enterprise solutions like Deepgram :D

@oleslav Where you be able to fix it with the bias trick? I tried, but no matter what I do, it seems some part are always transcribing in either arab or russian, even if the input is clear Italian in my case.

@bugtoo Unfortunately, I didn't try, but if I get any results. I will let you know

@oleslav I tried and in fact it works some times, but the trick is not usable for real time agents. I guess we really need to wait for an explicit hint flag.
I found that with the same clear Italian audio, it will mistake it for Arabic, French or Russian by simply adding different silence padding in the beginning (like 100ms, 200ms, 300ms) - if audio starts right away with the speech (no initial silence, not even 30ms), the detection is mostly ok.

@bugtoo thank you for insight 🤗

oleslav changed discussion status to closed

Sign up or log in to comment