Fill-Mask
Transformers
Safetensors
roberta
nineunaiz commited on
Commit
8ed80a0
·
verified ·
1 Parent(s): e754251

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -15
README.md CHANGED
@@ -7,21 +7,23 @@ license: apache-2.0
7
 
8
  Submitted to LREC 2026
9
 
10
- ## Abstract
11
-
12
- Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally
13
- exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this
14
- paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal,
15
- historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich
16
- and low-resource language, we construct new corpora combining standard, social media, and historical sources, and
17
- pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We
18
- further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard
19
- and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and
20
- diverse data consistently outperform those trained on standard corpora, improving performance across all task types
21
- without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in
22
- building inclusive, generalizable language models.
23
-
24
- ## Results
 
 
25
 
26
  | | **AVG standard tasks** | **AVG diverse tasks** | **AVG overall** |
27
  |---------------------|:----------------------:|:---------------------:|:---------------:|
 
7
 
8
  Submitted to LREC 2026
9
 
10
+ ## Model Description
11
+
12
+ BERnaT is a family of monolingual Basque encoder-only language models trained to better represent linguistic variation—including standard, dialectal, historical, and informal Basque—rather than focusing solely on standard textual corpora. Models were trained on corpora that combine high-quality standard Basque with varied sources such as social media and historical texts, aiming to enhance robustness and generalization across natural language understanding (NLU) tasks.
13
+
14
+ **Model Types**: Encoder-only Transformer models (RoBERTa-style)
15
+ **Languages**: Basque (Euskara)
16
+
17
+ ## Training Data
18
+
19
+ The BERnaT family was pre-trained on a combination of:
20
+ - Standard Basque corpora (e.g., Wikipedia, Egunkaria, EusCrawl).
21
+ - Diverse corpora including Basque social media text and historical Basque books.
22
+ - Combined corpora for the unified BERnaT models.
23
+
24
+ Training objective is masked language modeling (MLM) on encoder-only architectures across medium (51M), base (124M), and large (355M) sizes.
25
+
26
+ ## Evaluation
27
 
28
  | | **AVG standard tasks** | **AVG diverse tasks** | **AVG overall** |
29
  |---------------------|:----------------------:|:---------------------:|:---------------:|