MrBERT-nos-gl

MrBERT-nos-gl is a domain-adapted encoder model obtained by continued pre-training of BSC-LT/MrBERT on CorpusNÓS, a large-scale Galician corpus (~1.9B tokens). It inherits MrBERT's ModernBERT architecture — with efficient long-context modeling (up to 1,024 tokens) via RoPE and sliding-window attention — and extends its coverage of Galician and Portuguese, two closely related Iberian languages underrepresented in most multilingual encoders.

The model is designed as a general-purpose encoder suitable for fine-tuning on downstream tasks such as named entity recognition, part-of-speech tagging, text classification, semantic similarity, question answering, and cross-lingual retrieval. It is the foundation for the MrBERT-nos-gl model collection.

Developed as part of Proxecto Nós, an initiative to build language technology for the Galician language.

Technical Description

MrBERT-nos-gl starts from the MrBERT base checkpoint and continues pre-training with a masked language modelling objective on a combined Galician and Portuguese corpus. The architecture is identical to the base model; only the training regime and data distribution differ.

Model Architecture

Description Value
Base Model BSC-LT/MrBERT
Model Parameters 308M
Tokenizer SentencePiece (SPM)
Vocabulary Size 256,000
Hidden Layers 22
Hidden Size 768
Intermediate Size (FFN) 1,152
Attention Heads 12 (head size: 64)
Attention Type RoPE (Rotary Positional Embedding)
Sliding Window Size 128 tokens
Global Attention Every N Layers 3
FFN Layer Gated Linear Unit (GLU)
Normalization Pre-norm LayerNorm (ε=1e-5)
Activation Function GeLU
Precision bfloat16 (AMP)
Context Length 1,024 tokens
Weight Initialisation Full Megatron

Continued Pre-training Hyperparameters

Hyperparameter Value
Objective Masked Language Modeling (MLM)
Train MLM Mask Probability 30%
Eval MLM Mask Probability 15%
Peak Learning Rate 1e-5
LR Scheduler Warmup–Stable–Decay (WSD)
Decay Budget 0 tokens (no decay phase)
Final LR Factor (α_f) 0.0
Optimizer Decoupled StableAdamW
Optimizer β1 / β2 0.9 / 0.98
Optimizer ε 1e-6
Weight Decay 1e-5
Bias & Norm Weight Decay Disabled (filter_bias_norm_wd: true)
Global Batch Size 512 sequences
Device Microbatch Size (train) 16 sequences
Device Batch Size (eval) 32 sequences
Training Budget 300,000,000 tokens
Sequence Packing Disabled
Padding Strategy Unpadded (Flash Attention compatible)
Count Padding Tokens in Budget No
Batch Size Warmup From microbatch size up over 30M tokens
Attention Dropout 0.0 (train) / 0.1 (output projection)
Seed 17

Usage

This model is a base encoder intended for fine-tuning, not for direct text generation. Load it with the fill-mask pipeline for masked language modeling, or use it as a backbone for downstream task fine-tuning.

Installation

pip install transformers torch

Masked language modeling

from transformers import pipeline

mlm = pipeline("fill-mask", model="proxectonos/MrBERT-nos-gl")

results = mlm("A lingua galega é unha das linguas [MASK] de Europa.")
for r in results:
    print(f"{r['token_str']:<20} {r['score']*100:.1f}%")

Feature extraction / embeddings

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("proxectonos/MrBERT-nos-gl")
model = AutoModel.from_pretrained("proxectonos/MrBERT-nos-gl")

inputs = tokenizer("A lingua galega é unha das linguas romances de Europa.", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# Mean-pool the last hidden state for a sentence embedding
embeddings = outputs.last_hidden_state.mean(dim=1)

Acknowledgements

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. (Esta publicación del proyecto Desarrollo de Modelos ALIA está financiada por el Ministerio para la Transformación Digital y de la Función Pública y por el Plan de Recuperación, Transformación y Resiliencia – Financiado por la Unión Europea – NextGenerationEU)

Downloads last month
33
Safetensors
Model size
0.3B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for proxectonos/MrBERT-nos-gl

Base model

BSC-LT/MrBERT
Finetuned
(10)
this model
Finetunes
4 models

Dataset used to train proxectonos/MrBERT-nos-gl

Collection including proxectonos/MrBERT-nos-gl