HiTZ
/

BERnaT-medium

Model card Files Files and versions

nineunaiz commited on 27 days ago

Commit

8ed80a0

·

verified ·

1 Parent(s): e754251

Update README.md

Files changed (1) hide show

README.md +17 -15

README.md CHANGED Viewed

@@ -7,21 +7,23 @@ license: apache-2.0
 Submitted to LREC 2026
-## Abstract
-Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally
-exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this
-paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal,
-historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich
-and low-resource language, we construct new corpora combining standard, social media, and historical sources, and
-pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We
-further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard
-and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and
-diverse data consistently outperform those trained on standard corpora, improving performance across all task types
-without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in
-building inclusive, generalizable language models.
-## Results
 |                     | **AVG standard tasks** | **AVG diverse tasks** | **AVG overall** |
 |---------------------|:----------------------:|:---------------------:|:---------------:|

 Submitted to LREC 2026
+## Model Description
+BERnaT is a family of monolingual Basque encoder-only language models trained to better represent linguistic variation—including standard, dialectal, historical, and informal Basque—rather than focusing solely on standard textual corpora. Models were trained on corpora that combine high-quality standard Basque with varied sources such as social media and historical texts, aiming to enhance robustness and generalization across natural language understanding (NLU) tasks.
+**Model Types**: Encoder-only Transformer models (RoBERTa-style)
+**Languages**: Basque (Euskara)
+## Training Data
+The BERnaT family was pre-trained on a combination of:
+- Standard Basque corpora (e.g., Wikipedia, Egunkaria, EusCrawl).
+- Diverse corpora including Basque social media text and historical Basque books.
+- Combined corpora for the unified BERnaT models.
+Training objective is masked language modeling (MLM) on encoder-only architectures across medium (51M), base (124M), and large (355M) sizes.
+## Evaluation
 |                     | **AVG standard tasks** | **AVG diverse tasks** | **AVG overall** |
 |---------------------|:----------------------:|:---------------------:|:---------------:|