Fill-Mask
Transformers
Safetensors
roberta
nineunaiz commited on
Commit
f71174e
·
verified ·
1 Parent(s): 1d0db7f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -5
README.md CHANGED
@@ -10,19 +10,14 @@ Submitted to LREC 2026
10
  ## Model Description
11
 
12
  BERnaT is a family of monolingual Basque encoder-only language models trained to better represent linguistic variation—including standard, dialectal, historical, and informal Basque—rather than focusing solely on standard textual corpora. Models were trained on corpora that combine high-quality standard Basque with varied sources such as social media and historical texts, aiming to enhance robustness and generalization across natural language understanding (NLU) tasks.
13
-
14
  **Model Types**: Encoder-only Transformer models (RoBERTa-style)
15
-
16
  **Languages**: Basque (Euskara)
17
 
18
  ## Training Data
19
 
20
  The BERnaT family was pre-trained on a combination of:
21
-
22
  - Standard Basque corpora (e.g., Wikipedia, Egunkaria, EusCrawl).
23
-
24
  - Diverse corpora including Basque social media text and historical Basque books.
25
-
26
  - Combined corpora for the unified BERnaT models.
27
 
28
  Training objective is masked language modeling (MLM) on encoder-only architectures across medium (51M), base (124M), and large (355M) sizes.
 
10
  ## Model Description
11
 
12
  BERnaT is a family of monolingual Basque encoder-only language models trained to better represent linguistic variation—including standard, dialectal, historical, and informal Basque—rather than focusing solely on standard textual corpora. Models were trained on corpora that combine high-quality standard Basque with varied sources such as social media and historical texts, aiming to enhance robustness and generalization across natural language understanding (NLU) tasks.
 
13
  **Model Types**: Encoder-only Transformer models (RoBERTa-style)
 
14
  **Languages**: Basque (Euskara)
15
 
16
  ## Training Data
17
 
18
  The BERnaT family was pre-trained on a combination of:
 
19
  - Standard Basque corpora (e.g., Wikipedia, Egunkaria, EusCrawl).
 
20
  - Diverse corpora including Basque social media text and historical Basque books.
 
21
  - Combined corpora for the unified BERnaT models.
22
 
23
  Training objective is masked language modeling (MLM) on encoder-only architectures across medium (51M), base (124M), and large (355M) sizes.