MrBERT-nos-gl
MrBERT-nos-gl is a domain-adapted encoder model obtained by continued pre-training of BSC-LT/MrBERT on CorpusNÓS, a large-scale Galician corpus (~1.9B tokens). It inherits MrBERT's ModernBERT architecture — with efficient long-context modeling (up to 1,024 tokens) via RoPE and sliding-window attention — and extends its coverage of Galician and Portuguese, two closely related Iberian languages underrepresented in most multilingual encoders.
The model is designed as a general-purpose encoder suitable for fine-tuning on downstream tasks such as named entity recognition, part-of-speech tagging, text classification, semantic similarity, question answering, and cross-lingual retrieval. It is the foundation for the MrBERT-nos-gl model collection.
Developed as part of Proxecto Nós, an initiative to build language technology for the Galician language.
Technical Description
MrBERT-nos-gl starts from the MrBERT base checkpoint and continues pre-training with a masked language modelling objective on a combined Galician and Portuguese corpus. The architecture is identical to the base model; only the training regime and data distribution differ.
Model Architecture
| Description | Value |
|---|---|
| Base Model | BSC-LT/MrBERT |
| Model Parameters | 308M |
| Tokenizer | SentencePiece (SPM) |
| Vocabulary Size | 256,000 |
| Hidden Layers | 22 |
| Hidden Size | 768 |
| Intermediate Size (FFN) | 1,152 |
| Attention Heads | 12 (head size: 64) |
| Attention Type | RoPE (Rotary Positional Embedding) |
| Sliding Window Size | 128 tokens |
| Global Attention Every N Layers | 3 |
| FFN Layer | Gated Linear Unit (GLU) |
| Normalization | Pre-norm LayerNorm (ε=1e-5) |
| Activation Function | GeLU |
| Precision | bfloat16 (AMP) |
| Context Length | 1,024 tokens |
| Weight Initialisation | Full Megatron |
Continued Pre-training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Objective | Masked Language Modeling (MLM) |
| Train MLM Mask Probability | 30% |
| Eval MLM Mask Probability | 15% |
| Peak Learning Rate | 1e-5 |
| LR Scheduler | Warmup–Stable–Decay (WSD) |
| Decay Budget | 0 tokens (no decay phase) |
| Final LR Factor (α_f) | 0.0 |
| Optimizer | Decoupled StableAdamW |
| Optimizer β1 / β2 | 0.9 / 0.98 |
| Optimizer ε | 1e-6 |
| Weight Decay | 1e-5 |
| Bias & Norm Weight Decay | Disabled (filter_bias_norm_wd: true) |
| Global Batch Size | 512 sequences |
| Device Microbatch Size (train) | 16 sequences |
| Device Batch Size (eval) | 32 sequences |
| Training Budget | 300,000,000 tokens |
| Sequence Packing | Disabled |
| Padding Strategy | Unpadded (Flash Attention compatible) |
| Count Padding Tokens in Budget | No |
| Batch Size Warmup | From microbatch size up over 30M tokens |
| Attention Dropout | 0.0 (train) / 0.1 (output projection) |
| Seed | 17 |
Usage
This model is a base encoder intended for fine-tuning, not for direct text generation. Load it with the fill-mask pipeline for masked language modeling, or use it as a backbone for downstream task fine-tuning.
Installation
pip install transformers torch
Masked language modeling
from transformers import pipeline
mlm = pipeline("fill-mask", model="proxectonos/MrBERT-nos-gl")
results = mlm("A lingua galega é unha das linguas [MASK] de Europa.")
for r in results:
print(f"{r['token_str']:<20} {r['score']*100:.1f}%")
Feature extraction / embeddings
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("proxectonos/MrBERT-nos-gl")
model = AutoModel.from_pretrained("proxectonos/MrBERT-nos-gl")
inputs = tokenizer("A lingua galega é unha das linguas romances de Europa.", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Mean-pool the last hidden state for a sentence embedding
embeddings = outputs.last_hidden_state.mean(dim=1)
Acknowledgements
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. (Esta publicación del proyecto Desarrollo de Modelos ALIA está financiada por el Ministerio para la Transformación Digital y de la Función Pública y por el Plan de Recuperación, Transformación y Resiliencia – Financiado por la Unión Europea – NextGenerationEU)
- Downloads last month
- 33