Training scripts ?

by danielschnell - opened May 19, 2024

May 19, 2024

Hi,

we were using this model for training of Icelandic Homographs. The results were quite good. See https://github.com/grammatek/IceHoc.
I'd be interested in the training scripts of this LM. Especially if it comes to dataset preparation and cleaning. Would you share those scripts ?

Kv,
Daniel.

jonfd

Owner May 22, 2024

Hi Daniel,

Happy to hear that the model performed so well on homograph classification. When pre-training the model, I followed Stefan Schweter's instructions:

https://github.com/stefan-it/turkish-bert/blob/master/convbert/CHEATSHEET.md
https://github.com/stefan-it/turkish-bert/blob/master/electra/CHEATSHEET.md

I used the pre-training script from the ConvBERT repository. Since the pre-training corpus (i.e., the Icelandic Gigaword Corpus) doesn't contain any web-crawled or noisy documents, I didn't perform any filtering or cleaning beforehand.

Best regards,
Jón

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment