|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
language: |
|
|
- gl |
|
|
base_model: |
|
|
- microsoft/mdeberta-v3-base |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
|
|
|
# mDeBERTa-gl |
|
|
|
|
|
**mDeBERTa-gl** is a continued pretraining checkpoint based on [**microsoft/mdeberta-v3-base**](https://huggingface.co/microsoft/mdeberta-v3-base), adapted to Galician through large-scale masked-language modeling. It is intended as a strong general-purpose encoder for downstream NLP tasks in Galician. |
|
|
|
|
|
## Training |
|
|
|
|
|
- **Base model:** microsoft/mdeberta-v3-base |
|
|
- **Epochs:** 3 |
|
|
- **Learning rate:** 6e-4 |
|
|
- **MLM probability:** 0.15 |
|
|
- **Max sequence length:** 512 |
|
|
- **Total batch size:** 1024 |
|
|
- **Training examples:** 10,335,227 |
|
|
- **Mask token**: [MASK] |
|
|
|
|
|
## Intended uses |
|
|
|
|
|
- Masked language modeling (fill-mask) |
|
|
- Encoder for classification, NER, QA, and general Galician NLP tasks |
|
|
- Further domain adaptation via fine-tuning |
|
|
|
|
|
## How to use |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline |
|
|
|
|
|
model_id = "proxectonos/mdeberta-gl" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForMaskedLM.from_pretrained(model_id) |
|
|
|
|
|
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
|
|
|
|
fill_mask("O Parlamento de Galicia aprobou a [MASK] hoxe.") |
|
|
``` |
|
|
|
|
|
## Funding |
|
|
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA |
|
|
|
|
|
## Citation |
|
|
|
|
|
Please reference this model as: **mdeberta-gl (Proxecto Nós Team, 2025)**. |
|
|
|