BERnaT-base / README.md

Update README.md

bd1e162 verified 14 days ago

5.12 kB

	---
	library_name: transformers
	license: apache-2.0
	---

	# BERnaT: Basque Encoders for Representing Natural Textual Diversity

	Submitted to LREC 2026

	## Model Description

	BERnaT is a family of monolingual Basque encoder-only language models trained to better represent linguistic variation—including standard, dialectal, historical, and informal Basque—rather than focusing solely on standard textual corpora. Models were trained on corpora that combine high-quality standard Basque with varied sources such as social media and historical texts, aiming to enhance robustness and generalization across natural language understanding (NLU) tasks.

	- Developed by: HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
	- Funded by: Ikergaitu and ALIA projects (Basque and Spanish Government)
	- License: Apache 2.0
	- Model Type: Encoder-only Transformer models (RoBERTa-style)
	- Languages: Basque (Euskara)


	## Getting Started

	You can either use this model directly as the example below, or fine-tune it to your task of interest.

	```python
	>>> from transformers import pipeline

	>>> pipe = pipeline("fill-mask", model='HiTZ/BERnaT-base')

	>>> pipe("Kaixo! Ni <mask> naiz!")
	[{'score': 0.022003261372447014,
	'token': 7497,
	'token_str': ' euskalduna',
	'sequence': 'Kaixo! Ni euskalduna naiz!'},
	{'score': 0.016429167240858078,
	'token': 14067,
	'token_str': ' Olentzero',
	'sequence': 'Kaixo! Ni Olentzero naiz!'},
	{'score': 0.012804778292775154,
	'token': 31087,
	'token_str': ' ahobizi',
	'sequence': 'Kaixo! Ni ahobizi naiz!'},
	{'score': 0.01173020526766777,
	'token': 331,
	'token_str': ' ez',
	'sequence': 'Kaixo! Ni ez naiz!'},
	{'score': 0.010091394186019897,
	'token': 7618,
	'token_str': ' irakaslea',
	'sequence': 'Kaixo! Ni irakaslea naiz!'}]
	```

	## Training Data

	The BERnaT family was pre-trained on a combination of:
	- Standard Basque corpora (e.g., Wikipedia, Egunkaria, EusCrawl).
	- Diverse corpora including Basque social media text and historical Basque books.
	- Combined corpora for the unified BERnaT models.

	Training objective is masked language modeling (MLM) on encoder-only architectures across medium (51M), base (124M), and large (355M) sizes.

	## Evaluation

	\| \| AVG standard tasks \| AVG diverse tasks \| AVG overall \|
	\|---------------------\|:----------------------:\|:---------------------:\|:---------------:\|
	\| BERnaT_standard \| \| \| \|
	\| medium \| 74.10 \| 70.30 \| 72.58 \|
	\| base \| 75.33 \| 71.26 \| 73.70 \|
	\| large \| 76.83 \| 73.13 \| 75.35 \|
	\| BERnaT_diverse \| \| \| \|
	\| medium \| 71.66 \| 69.91 \| 70.96 \|
	\| base \| 72.44 \| 71.43 \| 72.04 \|
	\| large \| 74.48 \| 71.87 \| 73.43 \|
	\| BERnaT \| \| \| \|
	\| medium \| 73.56 \| 70.59 \| 72.37 \|
	\| base \| 75.42 \| 71.28 \| 73.76 \|
	\| large \| 77.88 \| 73.77 \| 76.24 \|

	## Acknowledgments

	This work has been partially supported by the Basque Government (Research group funding IT1570-22 and IKER-GAITU project), the Spanish Ministry for Digital Transformation and Civil Service, and the EU-funded NextGenerationEU Recovery, Transformation and Resilience Plan (ILENIA project, 2022/TL22/00215335; and ALIA project). The project also received funding from the European Union’s Horizon Europe research and innovation program under Grant Agreement No 101135724, Topic HORIZON-CL4-2023-HUMAN-01-21 and DeepKnowledge (PID2021-127777OB-C21) founded by MCIN/AEI/10.13039/501100011033 and FEDER. Jaione Bengoetxea, Julen Etxaniz and Ekhi Azurmendi hold a PhD grant from the Basque Government (PRE_2024_1_0028, PRE_2024_2_0028 and PRE_2024_1_0035, respectively). Maite Heredia and Mikel Zubillaga hold a PhD grant from the University of the Basque Country UPV/EHU (PIF23/218 and PIF24/04, respectively). The models were trained on the Leonardo supercomputer at CINECA under the EuroHPC Joint Undertaking, project EHPC-EXT-2024E01-042.

	## Citation:

	To cite our work, please use:

	```bibtex
	@misc{azurmendi2025bernatbasqueencodersrepresenting,
	title={BERnaT: Basque Encoders for Representing Natural Textual Diversity},
	author={Ekhi Azurmendi and Joseba Fernandez de Landa and Jaione Bengoetxea and Maite Heredia and Julen Etxaniz and Mikel Zubillaga and Ander Soraluze and Aitor Soroa},
	year={2025},
	eprint={2512.03903},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2512.03903},
	}
	```