File size: 1,561 Bytes
83d843e
 
d574358
 
 
 
 
 
83d843e
 
6b569cf
83d843e
6b569cf
83d843e
d574358
83d843e
d574358
 
 
 
 
 
 
 
83d843e
d574358
83d843e
d574358
 
 
83d843e
d574358
83d843e
d574358
 
83d843e
d574358
83d843e
d574358
 
83d843e
d574358
83d843e
d574358
 
83d843e
51696fb
 
 
d574358
83d843e
d574358
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
library_name: transformers
license: mit
language:
- gl
base_model:
- microsoft/mdeberta-v3-base
pipeline_tag: fill-mask
---

# mDeBERTa-gl

**mDeBERTa-gl** is a continued pretraining checkpoint based on [**microsoft/mdeberta-v3-base**](https://huggingface.co/microsoft/mdeberta-v3-base), adapted to Galician through large-scale masked-language modeling. It is intended as a strong general-purpose encoder for downstream NLP tasks in Galician.

## Training

- **Base model:** microsoft/mdeberta-v3-base  
- **Epochs:** 3  
- **Learning rate:** 6e-4  
- **MLM probability:** 0.15  
- **Max sequence length:** 512  
- **Total batch size:** 1024  
- **Training examples:** 10,335,227
- **Mask token**: [MASK]

## Intended uses

- Masked language modeling (fill-mask)  
- Encoder for classification, NER, QA, and general Galician NLP tasks  
- Further domain adaptation via fine-tuning  

## How to use

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model_id = "proxectonos/mdeberta-gl"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

fill_mask("O Parlamento de Galicia aprobou a [MASK] hoxe.")
```

## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA

## Citation

Please reference this model as: **mdeberta-gl (Proxecto Nós Team, 2025)**.