Spaces:

microsoft
/

paza-bench

Running

Finetuning for low-resource languages.

by martinoywa - opened Mar 3

Mar 3

I was attempting to finetune the OpenAI Whisper model for Dholuo, and I noticed that the tokeniser requires specifying a “similar” language. I’m assuming this relates to linguistic similarity, possibly language families or shared phonological features.

How did your team determine which of the languages supported by these open-source models are most suitable as proxies for a low-resource language like Dholuo? What criteria or methodology do you use to decide linguistic similarity in this context?

muchai-mercy

Microsoft org Mar 3

Hey @martinoywa
For Whisper finetuning, we did not use proxy language tokens. We extended the tokenizer to add languages not originally supported in the base model. Have a look at the training procedure in the paza-whisper-large-v3 model card to learn more.

muchai-mercy changed discussion status to closed Mar 5

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment