Fill-Mask
Transformers
Safetensors
roberta
File size: 5,114 Bytes
41223c4
 
e754251
41223c4
 
e754251
 
 
 
ded50f8
8ed80a0
 
 
012f2ad
 
 
 
 
 
8ed80a0
ded50f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ed80a0
 
 
 
 
 
 
 
 
 
e754251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
library_name: transformers
license: apache-2.0
---

# BERnaT: Basque Encoders for Representing Natural Textual Diversity

Submitted to LREC 2026

## Model Description

BERnaT is a family of monolingual Basque encoder-only language models trained to better represent linguistic variation—including standard, dialectal, historical, and informal Basque—rather than focusing solely on standard textual corpora. Models were trained on corpora that combine high-quality standard Basque with varied sources such as social media and historical texts, aiming to enhance robustness and generalization across natural language understanding (NLU) tasks.

- **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
- **Funded by:** Ikergaitu and ALIA projects (Basque and Spanish Government)
- **License:** Apache 2.0
- **Model Type**: Encoder-only Transformer models (RoBERTa-style)
- **Languages**: Basque (Euskara)


## Getting Started

You can either use this model directly as the example below, or fine-tune it to your task of interest.

```python
>>> from transformers import pipeline
>>> pipe = pipeline("fill-mask", model='HiTZ/BERnaT-base')
>>> pipe("Kaixo! Ni <mask> naiz!")
[{'score': 0.022003261372447014,
  'token': 7497,
  'token_str': ' euskalduna',
  'sequence': 'Kaixo! Ni euskalduna naiz!'},
 {'score': 0.016429167240858078,
  'token': 14067,
  'token_str': ' Olentzero',
  'sequence': 'Kaixo! Ni Olentzero naiz!'},
 {'score': 0.012804778292775154,
  'token': 31087,
  'token_str': ' ahobizi',
  'sequence': 'Kaixo! Ni ahobizi naiz!'},
 {'score': 0.01173020526766777,
  'token': 331,
  'token_str': ' ez',
  'sequence': 'Kaixo! Ni ez naiz!'},
 {'score': 0.010091394186019897,
  'token': 7618,
  'token_str': ' irakaslea',
  'sequence': 'Kaixo! Ni irakaslea naiz!'}]
```

## Training Data

The BERnaT family was pre-trained on a combination of:
- Standard Basque corpora (e.g., Wikipedia, Egunkaria, EusCrawl).
- Diverse corpora including Basque social media text and historical Basque books.
- Combined corpora for the unified BERnaT models.

Training objective is masked language modeling (MLM) on encoder-only architectures across medium (51M), base (124M), and large (355M) sizes.

## Evaluation

|                     | **AVG standard tasks** | **AVG diverse tasks** | **AVG overall** |
|---------------------|:----------------------:|:---------------------:|:---------------:|
| **BERnaT_standard** |                        |                       |                 |
| medium              |          74.10         |         70.30         |      72.58      |
| base                |          75.33         |         71.26         |      73.70      |
| large               |          76.83         |         73.13         |      75.35      |
| **BERnaT_diverse**  |                        |                       |                 |
| medium              |          71.66         |         69.91         |      70.96      |
| base                |          72.44         |         71.43         |      72.04      |
| large               |          74.48         |         71.87         |      73.43      |
| **BERnaT**          |                        |                       |                 |
| medium              |          73.56         |         70.59         |      72.37      |
| base                |          75.42         |         71.28         |      73.76      |
| large               |        **77.88**       |       **73.77**       |    **76.24**    |

## Acknowledgments

This work has been partially supported by the Basque Government (Research group funding IT1570-22 and IKER-GAITU project), the Spanish Ministry for Digital Transformation and Civil Service, and the EU-funded NextGenerationEU Recovery, Transformation and Resilience Plan (ILENIA project, 2022/TL22/00215335; and ALIA project). The project also received funding from the European Union’s Horizon Europe research and innovation program under Grant Agreement No 101135724, Topic HORIZON-CL4-2023-HUMAN-01-21 and DeepKnowledge (PID2021-127777OB-C21) founded by MCIN/AEI/10.13039/501100011033 and FEDER. Jaione Bengoetxea, Julen Etxaniz and Ekhi Azurmendi hold a PhD grant from the Basque Government (PRE_2024_1_0028, PRE_2024_2_0028 and PRE_2024_1_0035, respectively). Maite Heredia and Mikel Zubillaga hold a PhD grant from the University of the Basque Country UPV/EHU (PIF23/218 and PIF24/04, respectively). The models were trained on the Leonardo supercomputer at CINECA under the EuroHPC Joint Undertaking, project EHPC-EXT-2024E01-042.

## Citation:

To cite our work, please use:

```bibtex
@misc{azurmendi2025bernatbasqueencodersrepresenting,
      title={BERnaT: Basque Encoders for Representing Natural Textual Diversity}, 
      author={Ekhi Azurmendi and Joseba Fernandez de Landa and Jaione Bengoetxea and Maite Heredia and Julen Etxaniz and Mikel Zubillaga and Ander Soraluze and Aitor Soroa},
      year={2025},
      eprint={2512.03903},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.03903}, 
}
```