|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# BERnaT: Basque Encoders for Representing Natural Textual Diversity |
|
|
|
|
|
Submitted to LREC 2026 |
|
|
|
|
|
## Model Description |
|
|
|
|
|
BERnaT is a family of monolingual Basque encoder-only language models trained to better represent linguistic variation—including standard, dialectal, historical, and informal Basque—rather than focusing solely on standard textual corpora. Models were trained on corpora that combine high-quality standard Basque with varied sources such as social media and historical texts, aiming to enhance robustness and generalization across natural language understanding (NLU) tasks. |
|
|
|
|
|
- **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU) |
|
|
- **Funded by:** Ikergaitu and ALIA projects (Basque and Spanish Government) |
|
|
- **License:** Apache 2.0 |
|
|
- **Model Type**: Encoder-only Transformer models (RoBERTa-style) |
|
|
- **Languages**: Basque (Euskara) |
|
|
|
|
|
|
|
|
## Getting Started |
|
|
|
|
|
You can either use this model directly as the example below, or fine-tune it to your task of interest. |
|
|
|
|
|
```python |
|
|
>>> from transformers import pipeline |
|
|
|
|
|
>>> pipe = pipeline("fill-mask", model='HiTZ/BERnaT-base') |
|
|
|
|
|
>>> pipe("Kaixo! Ni <mask> naiz!") |
|
|
[{'score': 0.022003261372447014, |
|
|
'token': 7497, |
|
|
'token_str': ' euskalduna', |
|
|
'sequence': 'Kaixo! Ni euskalduna naiz!'}, |
|
|
{'score': 0.016429167240858078, |
|
|
'token': 14067, |
|
|
'token_str': ' Olentzero', |
|
|
'sequence': 'Kaixo! Ni Olentzero naiz!'}, |
|
|
{'score': 0.012804778292775154, |
|
|
'token': 31087, |
|
|
'token_str': ' ahobizi', |
|
|
'sequence': 'Kaixo! Ni ahobizi naiz!'}, |
|
|
{'score': 0.01173020526766777, |
|
|
'token': 331, |
|
|
'token_str': ' ez', |
|
|
'sequence': 'Kaixo! Ni ez naiz!'}, |
|
|
{'score': 0.010091394186019897, |
|
|
'token': 7618, |
|
|
'token_str': ' irakaslea', |
|
|
'sequence': 'Kaixo! Ni irakaslea naiz!'}] |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The BERnaT family was pre-trained on a combination of: |
|
|
- Standard Basque corpora (e.g., Wikipedia, Egunkaria, EusCrawl). |
|
|
- Diverse corpora including Basque social media text and historical Basque books. |
|
|
- Combined corpora for the unified BERnaT models. |
|
|
|
|
|
Training objective is masked language modeling (MLM) on encoder-only architectures across medium (51M), base (124M), and large (355M) sizes. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
| | **AVG standard tasks** | **AVG diverse tasks** | **AVG overall** | |
|
|
|---------------------|:----------------------:|:---------------------:|:---------------:| |
|
|
| **BERnaT_standard** | | | | |
|
|
| medium | 74.10 | 70.30 | 72.58 | |
|
|
| base | 75.33 | 71.26 | 73.70 | |
|
|
| large | 76.83 | 73.13 | 75.35 | |
|
|
| **BERnaT_diverse** | | | | |
|
|
| medium | 71.66 | 69.91 | 70.96 | |
|
|
| base | 72.44 | 71.43 | 72.04 | |
|
|
| large | 74.48 | 71.87 | 73.43 | |
|
|
| **BERnaT** | | | | |
|
|
| medium | 73.56 | 70.59 | 72.37 | |
|
|
| base | 75.42 | 71.28 | 73.76 | |
|
|
| large | **77.88** | **73.77** | **76.24** | |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This work has been partially supported by the Basque Government (Research group funding IT1570-22 and IKER-GAITU project), the Spanish Ministry for Digital Transformation and Civil Service, and the EU-funded NextGenerationEU Recovery, Transformation and Resilience Plan (ILENIA project, 2022/TL22/00215335; and ALIA project). The project also received funding from the European Union’s Horizon Europe research and innovation program under Grant Agreement No 101135724, Topic HORIZON-CL4-2023-HUMAN-01-21 and DeepKnowledge (PID2021-127777OB-C21) founded by MCIN/AEI/10.13039/501100011033 and FEDER. Jaione Bengoetxea, Julen Etxaniz and Ekhi Azurmendi hold a PhD grant from the Basque Government (PRE_2024_1_0028, PRE_2024_2_0028 and PRE_2024_1_0035, respectively). Maite Heredia and Mikel Zubillaga hold a PhD grant from the University of the Basque Country UPV/EHU (PIF23/218 and PIF24/04, respectively). The models were trained on the Leonardo supercomputer at CINECA under the EuroHPC Joint Undertaking, project EHPC-EXT-2024E01-042. |
|
|
|
|
|
## Citation: |
|
|
|
|
|
To cite our work, please use: |
|
|
|
|
|
```bibtex |
|
|
@misc{azurmendi2025bernatbasqueencodersrepresenting, |
|
|
title={BERnaT: Basque Encoders for Representing Natural Textual Diversity}, |
|
|
author={Ekhi Azurmendi and Joseba Fernandez de Landa and Jaione Bengoetxea and Maite Heredia and Julen Etxaniz and Mikel Zubillaga and Ander Soraluze and Aitor Soroa}, |
|
|
year={2025}, |
|
|
eprint={2512.03903}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2512.03903}, |
|
|
} |
|
|
``` |