MrBERT Model Card
MrBERT is a multilingual foundational encoder model based on the ModernBERT architecture. It is pre-trained from scratch on a large-scale corpus of 6.1 trillion tokens covering 35 European languages as well as code. By building on ModernBERT’s modernized BERT-style design, MrBERT combines strong bidirectional representations with efficient long-context modeling.
Designed as a general-purpose multilingual encoder, MrBERT is well suited for a wide range of downstream tasks such as retrieval, classification, semantic search, and cross-lingual understanding across diverse languages.
Technical Description
Technical details of the MrBERT model.
| Description | Value |
|---|---|
| Model Parameters | 308M |
| Tokenizer Type | SPM |
| Vocabulary size | 256,000 |
| Precision | bfloat16 |
| Context length | 8192 |
Training Hyperparemeters
| Hyperparameter | Value |
|---|---|
| Pretraining Objective | Masked Language Modeling |
| Learning Rate | 1E-03 |
| Learning Rate Scheduler | WSD |
| Warmup | 3,000,000,000 tokens |
| Optimizer | decoupled_stableadamw |
| Optimizer Hyperparameters | AdamW (β1=0.9,β2=0.98,ε =1e-06 ) |
| Weight Decay | 1E-05 |
| Global Batch Size | 4096 (Short Context) / 512 (Long Context) |
| Dropout | 1E-01 |
| Activation Function | GeLU |
How to use
You can use the pipeline for masked language modeling:
>>> from transformers import pipeline
>>> from pprint import pprint
>>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT')
>>> pprint(unmasker("I love the<mask>of Barcelona.",top_k=3))
[{'score': 0.29333314299583435,
'sequence': 'I love the city of Barcelona.',
'token': 31489,
'token_str': 'city'},
{'score': 0.06682543456554413,
'sequence': 'I love the capital of Barcelona.',
'token': 10859,
'token_str': 'capital'},
{'score': 0.05594080686569214,
'sequence': 'I love the streets of Barcelona.',
'token': 178738,
'token_str': 'streets'}]
>>> pprint(unmasker("Me encanta la<mask>de Barcelona.",top_k=3))
[{'score': 0.4422685205936432,
'sequence': 'Me encanta la ciudad de Barcelona.',
'token': 19587,
'token_str': 'ciudad'},
{'score': 0.059732843190431595,
'sequence': 'Me encanta la capital de Barcelona.',
'token': 10859,
'token_str': 'capital'},
{'score': 0.03484857454895973,
'sequence': 'Me encanta la arquitectura de Barcelona.',
'token': 83374,
'token_str': 'arquitectura'}]
>>> pprint(unmasker("M'encanta la<mask>de Barcelona.",top_k=3))
[{'score': 0.45476993918418884,
'sequence': "M'encanta la ciutat de Barcelona.",
'token': 17128,
'token_str': 'ciutat'},
{'score': 0.05597861483693123,
'sequence': "M'encanta la capital de Barcelona.",
'token': 10859,
'token_str': 'capital'},
{'score': 0.04105329513549805,
'sequence': "M'encanta la música de Barcelona.",
'token': 16051,
'token_str': 'música'}]
Alternatively, you can also extract the logits associated with the sequences and perform the calculations by hand:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model = AutoModelForMaskedLM.from_pretrained("BSC-LT/MrBERT")
tokenizer = AutoTokenizer.from_pretrained("BSC-LT/MrBERT")
outputs = model(**tokenizer("The capital of Spain is<mask>", return_tensors="pt")).logits
# The index of "<mask>" token is -2 given that the -1 position is the EOS token "</s>".
predicted_token = tokenizer.decode(torch.argmax(outputs[0,-2,:]))
print(f"The decoded element is \"{predicted_token}\"." ) # This will give "Madrid"
Data
Pretraining Corpus
The pretraining corpus comprises 6.1 trillion tokens, covering 35 European languages and code. Training was conducted in three distinct phases to balance broad knowledge acquisition with long-context performance:
| Phase | Context Length | Token Count |
|---|---|---|
| Short Context | 1,024 tokens | 5.5T |
| Long Context | 8,192 tokens | 500B |
| Annealing | 8,192 tokens | 100B |
Language Distribution & Sources
While English constitutes 67.4% of the total data, the remaining distribution spans a diverse set of European languages as shown below:
The full list of datasets used will be published soon.
Multilingual Evaluation and Performance
Evaluation is done using multilingual benchmarks in order to assess the multilingual capabilities of the models.
The following multilingual benchmarks have been considered:
| Benchmark | Description | Languages | Source |
|---|---|---|---|
| XTREME | Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | LINK |
| EvalES | Spanish benchmark for evaluating language understanding across multiple NLP tasks | es | LINK |
| CLUB | Human-Annotated Catalan Benchmark | ca | LINK |
| Biomedical Benchmark | NER Spanish tasks and Multilingual Text Embedding Benchmark for evaluating sentence and document embedding quality across retrieval tasks using ColBERT | en, es | LINK |
| Legal Benchmark | Text Classification tasks and Multilingual Text Embedding Benchmark for evaluating sentence and document embedding quality across retrieval tasks using ColBERT | en, es | LINK |
In addition to the MrBERT family, the following base foundation models were considered:
| Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
|---|---|---|---|
| RoBERTa-ca | 125M | 50K | RoBERTa-ca is a Catalan-specific language model obtained by using vocabulary adaptation from mRoBERTa. |
| xlm-roberta-base | 279M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
| mRoBERTa | 283M | 256K | RoBERTa base model pretrained with 35 European languages and a larger vocabulary size. |
| mmBERT | 308M | 250K | Multilingual ModernBERT pre-trained with staged language learning. |
| mGTE | 306M | 250K | Multilingual encoder also adapted for retrieval tasks. |
| Clinical ModernBERT | 137M | 50K | Pre-trained model on biomedical data using ModernBERT architecture |
| BioClinical-ModernBERT | 150M | 50K | Domain adaptation from ModernBERT to bioclinical data |
| legal-bert-base-uncased | 110M | 31k | BERT-base model pre-trained on legal corpora |
RESULTS
This section presents results across various multilingual benchmarks, with the maximum values highlighted in bold and the second-highest values underlined.
XTREME Benchmark
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is designed to assess the cross-lingual generalization capabilities of pre-trained multilingual models. It comprises nine tasks that collectively test reasoning across various levels of syntax and semantics. The results reported here are from the test set, using the learning rate chosen based on the best-performing model on the validation set.
Given that retrieval tasks generally achieve higher performance with multi-vector methods such as ColBERT, we evaluate these tasks separately using m-TEB.
In the table below we just show the average results across languages:
| task | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | mGTE (306M) | MrBERT (308M) |
|---|---|---|---|---|---|
| xnli (TC) | 78.25 | 79.09 | 80.54 | 77.90 | 81.26 |
| pawsx (TC) | 89.50 | 90.36 | 92.34 | 89.55 | 91.32 |
| udpos (POS) | 85.55 | 85.36 | 84.33 | 82.07 | 83.74 |
| panx (NER) | 73.69 | 75.65 | 73.89 | 73.05 | 72.06 |
| tydiqa (QA) | 56.41 | 53.96 | 63.95 | 51.07 | 56.34 |
| mlqa (QA) | 68.91 | 68.67 | 71.48 | 68.05 | 70.67 |
| xquad (QA) | 75.61 | 75.45 | 77.79 | 74.37 | 77.91 |
| average | 75.42 | 75.50 | 77.76 | 73.72 | 76.19 |
For a detailed description for each language, we provide the full table description:
🔵 Sentence Classification
🔵 XNLI
Metric used: Accuracy.
| langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | mGTE (306M) | MrBERT (308M) |
|---|---|---|---|---|---|
| bg | 77.60 | 78.20 | 79.24 | 76.79 | 80.04 |
| de | 77.05 | 77.68 | 80.16 | 76.03 | 80.16 |
| el | 75.63 | 76.53 | 77.17 | 74.97 | 79.74 |
| en | 84.49 | 85.63 | 86.73 | 84.87 | 88.02 |
| es | 79.00 | 80.32 | 81.88 | 79.08 | 82.53 |
| fr | 78.04 | 78.94 | 80.76 | 78.68 | 80.78 |
| ru | 75.91 | 76.31 | 77.88 | 74.87 | 77.56 |
🔵 PAWS-X
Metric used: Accuracy.
| langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | mGTE (306M) | MrBERT (308M) |
|---|---|---|---|---|---|
| de | 87.25 | 88.50 | 90.95 | 86.55 | 89.05 |
| en | 94.10 | 94.50 | 95.50 | 93.70 | 95.60 |
| es | 87.90 | 88.95 | 91.40 | 89.50 | 89.95 |
| fr | 88.75 | 89.50 | 91.50 | 88.45 | 90.70 |
🟣 Structured Prediction: POS
🟣 POS (UDPOS)
Metric used: F1.
| langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | mGTE (306M) | MrBERT (308M) |
|---|---|---|---|---|---|
| bg | 88.31 | 88.10 | 87.48 | 86.33 | 86.90 |
| de | 88.35 | 87.54 | 86.94 | 85.52 | 86.82 |
| el | 87.52 | 83.90 | 81.80 | 83.39 | 83.95 |
| en | 95.89 | 95.80 | 95.96 | 95.65 | 95.84 |
| es | 87.49 | 87.45 | 87.70 | 86.03 | 87.84 |
| et | 84.60 | 85.55 | 81.08 | 79.52 | 77.43 |
| eu | 66.98 | 67.43 | 64.67 | 64.32 | 64.22 |
| fi | 84.77 | 83.58 | 82.38 | 79.01 | 77.05 |
| fr | 85.93 | 86.84 | 86.55 | 82.76 | 85.86 |
| hu | 83.12 | 82.15 | 80.00 | 77.72 | 81.69 |
| it | 86.95 | 89.02 | 89.14 | 85.73 | 87.42 |
| lt | 83.10 | 81.03 | 82.00 | 75.81 | 80.42 |
| nl | 89.16 | 89.36 | 89.17 | 86.66 | 89.02 |
| pl | 83.90 | 84.33 | 82.80 | 81.22 | 85.00 |
| pt | 86.75 | 87.78 | 87.35 | 85.07 | 87.28 |
| ro | 83.47 | 81.39 | 82.54 | 76.21 | 77.47 |
| ru | 88.92 | 89.59 | 87.84 | 85.48 | 88.75 |
| uk | 84.72 | 85.55 | 82.54 | 80.77 | 84.44 |
🟣 NER (PANX)
Metric used: F1.
| langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | mGTE (306M) | MrBERT (308M) |
|---|---|---|---|---|---|
| bg | 76.42 | 77.92 | 74.29 | 73.79 | 74.97 |
| de | 74.60 | 78.36 | 75.98 | 75.59 | 76.11 |
| el | 73.32 | 76.31 | 68.62 | 68.95 | 70.88 |
| en | 82.39 | 83.43 | 82.82 | 83.99 | 82.05 |
| es | 71.04 | 81.19 | 79.93 | 78.14 | 77.46 |
| et | 71.85 | 72.87 | 70.32 | 69.92 | 67.04 |
| eu | 59.58 | 56.52 | 51.70 | 59.46 | 55.84 |
| fi | 75.80 | 76.07 | 74.67 | 74.78 | 70.54 |
| fr | 77.61 | 77.65 | 81.15 | 78.14 | 77.63 |
| hu | 76.66 | 73.42 | 73.03 | 74.14 | 73.26 |
| it | 76.73 | 80.02 | 80.49 | 79.65 | 76.98 |
| lt | 72.04 | 73.52 | 70.14 | 68.79 | 66.40 |
| nl | 79.85 | 81.91 | 80.10 | 78.87 | 79.37 |
| pl | 77.45 | 80.33 | 77.81 | 78.03 | 77.75 |
| pt | 75.93 | 79.21 | 80.36 | 78.86 | 77.36 |
| ro | 71.78 | 71.49 | 75.37 | 69.47 | 66.02 |
| ru | 63.39 | 67.31 | 64.13 | 61.73 | 60.89 |
| uk | 70.00 | 74.20 | 69.17 | 62.54 | 66.49 |
⚫ Question Answering
⚫ TyDiQA
Metric used: F1.
| langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | mGTE (306M) | MrBERT (308M) |
|---|---|---|---|---|---|
| en | 62.42 | 60.39 | 71.73 | 57.94 | 68.50 |
| fi | 54.19 | 49.51 | 60.89 | 44.85 | 48.08 |
| ru | 52.61 | 51.98 | 59.24 | 50.42 | 52.45 |
⚫ MLQA
Metric used: F1.
| langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | mGTE (306M) | MrBERT (308M) |
|---|---|---|---|---|---|
| de | 61.57 | 61.49 | 65.19 | 59.96 | 64.38 |
| en | 78.68 | 76.37 | 79.09 | 76.24 | 77.45 |
| es | 66.48 | 68.15 | 70.17 | 67.95 | 70.17 |
⚫ XQUAD
Metric used: F1.
| langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | mGTE (306M) | MrBERT (308M) |
|---|---|---|---|---|---|
| de | 74.31 | 72.93 | 77.68 | 70.62 | 75.94 |
| el | 71.68 | 71.70 | 68.81 | 71.93 | 74.92 |
| en | 82.26 | 82.37 | 85.67 | 82.29 | 84.44 |
| es | 75.20 | 78.18 | 79.58 | 76.08 | 79.23 |
| ru | 74.58 | 72.08 | 77.20 | 70.93 | 75.02 |
EvalES Benchmark
The EvalES benchmark consists of 7 tasks: Named Entity Recognition and Classification (CoNLL-NERC), Part-of-Speech Tagging (UD-POS), Text Classification (MLDoc), Paraphrase Identification (PAWS-X), Semantic Textual Similarity (STS), Question Answering (SQAC), and Textual Entailment (XNLI). This benchmark evaluates the model's capabilities in the Spanish language.
| tasks | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | mGTE (306M) | MrBERT (308M) | MrBERT-es (150M) |
|---|---|---|---|---|---|---|
| pos (f1) | 99.01 | 99.03 | 99.09 | 98.92 | 99.06 | 99.08 |
| ner (f1) | 86.91 | 87.77 | 87.01 | 86.96 | 87.42 | 87.77 |
| sts (person) | 80.88 | 79.69 | 82.88 | 84.52 | 84.18 | 85.23 |
| tc - paws-x (acc) | 90.35 | 91.30 | 91.35 | 89.70 | 91.25 | 91.90 |
| tc - mldoc (acc) | 47.67 | 91.28 | 95.10 | 96.13 | 95.28 | 95.55 |
| tc - massivenew (acc) | 21.89 | 86.45 | 86.79 | 87.19 | 87.46 | 87.05 |
| qa (f1) | 74.48 | 77.03 | 79.79 | 76.78 | 81.96 | 82.19 |
| Average | 71.60 | 87.51 | 88.86 | 88.60 | 89.52 | 89.83 |
CLUB Benchmark
The Catalan Language Understanding Benchmark consists of 6 tasks: Named Entity Recognition (NER), Part-of-Speech Tagging (POS), Semantic Textual Similarity (STS), Text Classification (TC), Textual Entailment (TE), and Question Answering (QA). This benchmark evaluates the model's capabilities in the Catalan language.
This comparison also includes RoBERTa-ca, a model derived from mRoBERTa by applying vocabulary adaptation and performing continual pre-training on a 95GB Catalan-only corpus. For further details, visit here.
| tasks | xlm-roberta-base (279M) | mRoBERTa (283M) | roberta-ca (125M) | mmBERT (308M) | mGTE (306M) | MrBERT (308M) | MrBERT-ca (150M) |
|---|---|---|---|---|---|---|---|
| ner (F1) | 87.61 | 88.33 | 89.70 | 88.14 | 87.20 | 87.32 | 88.04 |
| pos (F1) | 98.91 | 98.98 | 99.00 | 99.01 | 98.67 | 99.01 | 99.03 |
| sts (Person) | 74.67 | 79.52 | 82.99 | 83.16 | 78.65 | 83.00 | 85.42 |
| tc (Acc.) | 72.57 | 72.41 | 72.81 | 74.11 | 74.68 | 73.79 | 74.97 |
| te (Acc.) | 79.59 | 82.38 | 82.14 | 83.18 | 79.40 | 84.03 | 86.92 |
| viquiquad (F1) | 86.93 | 87.86 | 87.31 | 89.86 | 86.78 | 89.25 | 89.59 |
| xquad (F1) | 69.69 | 69.40 | 70.53 | 73.88 | 69.27 | 73.96 | 74.47 |
| Average | 81.42 | 82.70 | 83.50 | 84.48 | 82.09 | 84.34 | 85.49 |
MTEB Benchmark
Models are trained on 810k MS-Marco samples using teacher scores from BGE-M3, with a batch size of 16 and a learning rate of 8e-5, leveraging the PyLate library. Evaluations are conducted separately across the legal, scientific, and medical domains.
🩺 Biomedical
| Task Name | Task Type | mmBERT (308M) | MrBERT (308M) | MrBERT-es (150M) | BioClinical-MdnBERT (150M) | Clinical MdnBERT (137M) | MrBERT-biomed (308M) |
|---|---|---|---|---|---|---|---|
| bsc-bio-distemist-ner (ES) | NER | 78.00 | 77.84 | 78.07 | 75.45 | 70.22 | 77.93 |
| cantemist (ES) | NER | 78.03 | 68.73 | 73.40 | 66.68 | 30.91 | 70.78 |
| pharmaconer (ES) | NER | 89.66 | 88.58 | 88.97 | 87.66 | 81.69 | 89.92 |
| AbSanitas (ES) | Retrieval | 34.68 | 34.16 | 53.49 | 30.41 | 18.08 | 51.01 |
| r2med (EN) | Retrieval | 10.87 | 10.15 | 8.65 | 9.97 | 5.91 | 9.76 |
| SciDocs (EN) | Retrieval | 10.00 | 9.75 | 9.90 | 9.33 | 3.64 | 10.05 |
| SciFact (EN) | Retrieval | 32.35 | 31.08 | 31.46 | 32.07 | 20.34 | 30.25 |
| TREC-COVID (EN) | Retrieval | 30.77 | 49.53 | 37.51 | 46.08 | 23.88 | 48.76 |
| Average (EN) | All Tasks | 21.00 | 25.13 | 21.88 | 24.36 | 13.44 | 24.71 |
| Average (EN + ES) | All Tasks | 45.55 | 46.23 | 47.68 | 44.71 | 31.83 | 48.56 |
⚖️ Legal
| Task Name | Task Type | mmBERT (308M) |
MrBERT (308M) |
MrBERT-es (150M) |
legal-bert- base-uncased (110M) |
MrBERT-legal (308M) |
|---|---|---|---|---|---|---|
| LexBOE (ES) | Text Classification | 96.84 | 97.02 | 97.28 | 95.36 | 96.80 |
| small-spanish-legal-dataset (ES) | Retrieval | 42.58 | 40.78 | 46.92 | 19.79 | 38.75 |
| EURLEX (EN) | Text Classification | 97.43 | 97.40 | 97.41 | 97.42 | 97.33 |
| AILAStatutes (EN) | Retrieval | 14.31 | 13.90 | 12.28 | 13.49 | 16.33 |
| legal_summarization (EN) | Retrieval | 53.33 | 53.84 | 46.41 | 52.40 | 55.05 |
| LegalBench (EN) | Retrieval | 60.15 | 58.88 | 58.26 | 63.42 | 58.04 |
| NanoTouche2020 (EN) | Retrieval | 34.03 | 44.15 | 31.18 | 34.48 | 44.74 |
| Average (EN) | All Tasks | 51.85 | 53.63 | 49.11 | 52.24 | 54.30 |
| Average (EN + ES) | All Tasks | 56.95 | 58.00 | 55.68 | 53.77 | 58.15 |
Additional information
Author
The Language Technologies Lab from Barcelona Supercomputing Center.
Contact
For further information, please send an email to langtech@bsc.es.
Copyright
Copyright(c) 2026 by Language Technologies Lab, Barcelona Supercomputing Center.
Funding
This work has been supported and funded by the Ministerio para la Transformación Digital y de la Función Pública and the Plan de Recuperación, Transformación y Resiliencia – funded by the EU through NextGenerationEU, within the framework of the Modelos del Lenguaje project, as well as by the European Union – NextGenerationEU. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or European Commission. Neither the European Union nor the European Commission can be held responsible for them.
Acknowledgements
This project has benefited from the contributions of numerous teams and institutions through data contributions.
In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano, the "Instituto de Ingenieria del Conocimiento" and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.
Their valuable efforts have been instrumental in the development of this work.
Disclaimer
Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.
The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
Citation
A detailed technical report will be released soon.
License
- Downloads last month
- 6