| # Lb_mBERT | |
| Lb_mBERT is a BERT-like language model for the Luxembourgish language. | |
| We used the weights of the multilingual BERT (mBERT) language model as a starting point and continued pre-training it on the MLM task using the same corpus that we used for our LuxemBERT model (https://huggingface.co/lothritz/LuxemBERT). | |
| We achieved higher performances on some downstream tasks than the original LuxemBERT, and another Luxembourgish BERT model called DA BERT (https://huggingface.co/iolariu/DA_BERT). | |
| If you would like to know more about our work, the pre-training corpus, or use our models or datasets, please check out/cite the following papers: | |
| ``` | |
| @inproceedings{lothritz-etal-2022-luxembert, | |
| title = "{L}uxem{BERT}: Simple and Practical Data Augmentation in Language Model Pre-Training for {L}uxembourgish", | |
| author = "Lothritz, Cedric and | |
| Lebichot, Bertrand and | |
| Allix, Kevin and | |
| Veiber, Lisa and | |
| Bissyande, Tegawende and | |
| Klein, Jacques and | |
| Boytsov, Andrey and | |
| Lefebvre, Cl{\'e}ment and | |
| Goujon, Anne", | |
| booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", | |
| month = jun, | |
| year = "2022", | |
| address = "Marseille, France", | |
| publisher = "European Language Resources Association", | |
| url = "https://aclanthology.org/2022.lrec-1.543", | |
| pages = "5080--5089", | |
| abstract = "Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.", | |
| } | |
| ``` | |
| ``` | |
| @inproceedings{lothritz2023comparing, | |
| title={Comparing Pre-Training Schemes for Luxembourgish BERT Models}, | |
| author={Lothritz, Cedric and Ezzini, Saad and Purschke, Christoph and Bissyande, Tegawend{\'e} Fran{\c{c}}ois D Assise and Klein, Jacques and Olariu, Isabella and Boytsov, Andrey and Lefebvre, Clement and Goujon, Anne}, | |
| booktitle={Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)}, | |
| year={2023} | |
| } | |
| ``` |