MrBERT-science Model Card

MrBERT-science is a new foundational multilingual science model built on the ModernBERT architecture. The model is obtained via domain adaptation from MrBERT, initializing all weights from MrBERT and further training on a domain-specific scientific corpus comprising comprising 3.6B tokens (38.1% Spanish, 61.9% English).

Technical Description

Technical details of the MrBERT-science model.

Description	Value
Model Parameters	308M
Tokenizer Type	SPM
Vocabulary size	256000
Precision	bfloat16
Context length	8192

Training Hyperparemeters

Hyperparameter	Value
Pretraining Objective	Masked Language Modeling
Learning Rate	6E-04
Learning Rate Scheduler	Cosine
Warmup	360,000,000 tokens
Optimizer	decoupled_stableadamw
Optimizer Hyperparameters	AdamW (β1=0.9,β2=0.98,ε =1e-06 )
Weight Decay	1E-05
Global Batch Size	512
Dropout	1E-01
Activation Function	GeLU

How to use

>>> from transformers import pipeline
>>> from pprint import pprint

>>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT-science')

>>> pprint(unmasker("Hubble's<mask>describes the expansion of the universe and the relationship between a galaxy's distance and its recessional velocity.", top_k=3))
[{'score': 0.8211672902107239,
  'sequence': "Hubble's law describes the expansion of the universe and the "
              "relationship between a galaxy's distance and its recessional "
              'velocity.',
  'token': 21673,
  'token_str': 'law'},
 {'score': 0.16654537618160248,
  'sequence': "Hubble's Law describes the expansion of the universe and the "
              "relationship between a galaxy's distance and its recessional "
              'velocity.',
  'token': 18573,
  'token_str': 'Law'},
 {'score': 0.0063100955449044704,
  'sequence': "Hubble's equation describes the expansion of the universe and "
              "the relationship between a galaxy's distance and its "
              'recessional velocity.',
  'token': 174396,
  'token_str': 'equation'}]

EVALUATION:

In addition to the MrBERT family, the following base foundation models were considered:

Multilingual Foundational Model	Number of Parameters	Vocab Size	Description
xlm-roberta-base	279M	250K	Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.
mRoBERTa	283M	256K	RoBERTa base model pretrained with 35 European languages and a larger vocabulary size.
mmBERT	308M	250K	Multilingual ModernBERT pre-trained with staged language learning.
INDUS	125M	50k	RoBERTa-based, Encoder-only transformer model, domain-adapted for NASA Science Mission Directorate (SMD) applications
SciBERT	110M	31k	BERT model trained on scientific text.

The benchmarks used for comparison are:

MTEB: Subset of scientific-related tasks.
AbScientia: A Spanish scientific abstracts dataset focused on STEM disciplines, organized into 24 unified categories to capture the specifications of Spanish scientific language. The reported metric is accuracy.
Scientific-paragraphs-categorization ('en' split): A large-scale topic-classification dataset built from open-access scientific publications, classified into 26 scientific topics to capture the characteristics of standard English scientific language. The reported metric is accuracy.

Task Name	Task Type	mmBERT (308M)	MrBERT (308M)	MrBERT-es (150M)	INDUS (125M)	SciBERT (110M)	MrBERT-science (308M)
abscientia (ES)	Text Classification	82.66	81.98	82.39	79.24	80.80	82.34
scientific_paragraphs (EN)	Text Classification	60.22	61.02	60.82	63.95	62.77	62.20
chemhotpotqa (EN)	Retrieval	62.73	62.70	56.94	63.79	64.43	62.94
chemnq (EN)	Retrieval	34.87	39.70	29.37	39.78	39.09	36.65
climate_fever_v2 (EN)	Retrieval	23.32	23.21	21.52	23.91	23.06	23.04
litsearch (EN)	Retrieval	11.06	12.17	12.41	12.24	13.52	11.78
Average (EN)	All Tasks	38.44	39.76	36.21	40.73	40.57	39.32
Average (EN + ES)	All Tasks	44.76	45.79	42.81	46.23	46.32	45.47

Additional information

Author

The Language Technologies Lab from Barcelona Supercomputing Center.

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

Funding

This work has been supported and funded by the Ministerio para la Transformación Digital y de la Función Pública and the Plan de Recuperación, Transformación y Resiliencia – funded by the EU through NextGenerationEU, within the framework of the Modelos del Lenguaje project, as well as by the European Union – NextGenerationEU. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or European Commission. Neither the European Union nor the European Commission can be held responsible for them.

Acknowledgements

This project has benefited from the contributions of numerous teams and institutions through data contributions.

In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.

At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.

At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.

Their valuable efforts have been instrumental in the development of this work.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.