Spaces:
Running
Finetuning for low-resource languages.
I was attempting to finetune the OpenAI Whisper model for Dholuo, and I noticed that the tokeniser requires specifying a “similar” language. I’m assuming this relates to linguistic similarity, possibly language families or shared phonological features.
How did your team determine which of the languages supported by these open-source models are most suitable as proxies for a low-resource language like Dholuo? What criteria or methodology do you use to decide linguistic similarity in this context?
Hey @martinoywa
For Whisper finetuning, we did not use proxy language tokens. We extended the tokenizer to add languages not originally supported in the base model. Have a look at the training procedure in the paza-whisper-large-v3 model card to learn more.