| This is an encoder Language Model pre-trained from scratch on transcriptions of the archives of the Dutch East India Company. It is therefore a model specialized on Early Modern Dutch as used in the archive (1602–1800). | |
| The model follows a RoBERTa architecture. It can be fine-tuned on any NLP task. | |
| This version of the model is the best performing GloBERTise model when tested on binary event detection of the four I have pre-trained (in august 2025) | |
| Comparison to other models: Adapted settings for 'num_training_steps' and 'num_warmup_steps' compared to GloBERTise-v01 and GloBERTise-v01-rerun, otherwise the same. Different seed compared to GloBERTise-rerun, same parameter settings. | |
| See my GitHub repos | |
| - for pre-training: https://github.com/globalise-huygens/GloBERTise | |
| - for evaluation: https://github.com/globalise-huygens/GloBERTise-eval | |
| And a small presentation: https://docs.google.com/presentation/d/1gkg5hChWAMXA6mxfgFkkvIieWdj_17yKitwBkBNcJBo/edit?usp=sharing | |
| Most important parameter settings: | |
| | | | | |
| |------------------|--------------| | |
| | learning rate | 0.0003 | | |
| | betas | [ 0.9, 0.98] | | |
| | weight_decay | 0.01 | | |
| | num_train_epochs | 2 | | |
| | per_device_train_batch_size | 40 | | |
| | gradient_accumulation_steps | 10 | | |
| | fp16 | true | | |