lightonai/LightOnOCR-2-1B · Experience on finetuning for specific language

Experience on finetuning for specific language

#13

by Huy227 - opened Jan 23

Jan 23

Hi, i've tested the model performed well for almost of my cases. However, there still small issue with tone marks. My language is Vietnamese - is based on the Latin alphabet. Model sometime will confuse example "Phường" -> "Phòng", "Tỉnh" -> "Tinh" so i want to enhance performance on Vietnamese but also prevent catastrophic forgetting. I've checked the finetuning notebook and i think i should freeze the vision weights and unfreeze language weights for my case maybe i should LoRA instead of full fine-tuning to adapt correct tone mark? And how many sample will be a good starting point? I should collect pdf files with diverse structure to maintain the generic of model?

staghado

LightOn AI org Jan 27

Hi,
Yes, i think finetuning makes a lot of sense in this case!

johnlockejrr

Jan 27

Hi, i've tested the model performed well for almost of my cases. However, there still small issue with tone marks. My language is Vietnamese - is based on the Latin alphabet. Model sometime will confuse example "Phường" -> "Phòng", "Tỉnh" -> "Tinh" so i want to enhance performance on Vietnamese but also prevent catastrophic forgetting. I've checked the finetuning notebook and i think i should freeze the vision weights and unfreeze language weights for my case maybe i should LoRA instead of full fine-tuning to adapt correct tone mark? And how many sample will be a good starting point? I should collect pdf files with diverse structure to maintain the generic of model?

Yes! Freeze vision weights and train only on language weights with LoRA, works like magic! Tested.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment