Add MrBERT-es

Browse files

Files changed (8) hide show

.gitattributes +5 -0
README.md +181 -0
config.json +3 -0
model.safetensors +3 -0
special_tokens_map.json +3 -0
tokenizer.json +3 -0
tokenizer.model +3 -0
tokenizer_config.json +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,5 @@

+*.json filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,181 @@

+---
+language:
+- es
+- en
+tags:
+  - fill-mask
+  - masked-lm
+  - long-context
+  - modernbert
+license: apache-2.0
+library_name: transformers
+---
+# MrBERT-es Model Card
+MrBERT-es is a new foundational Catalan language model built on the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base/tree/main) architecture. It uses vocabulary adaptation from [MrBERT](https://huggingface.co/BSC-LT/MrBERT), a method that initializes all weights from MrBERT while applying a specialized treatment to the embedding matrix. This treatment carefully handles the differences between the two tokenizers.
+Following initialization, the model is continually pretrained on a bilingual corpus of 615 billion tokens, evenly balanced between English and Spanish
+## Technical Description
+Technical details of the MrBERT-es model.
+| Description          | Value         |
+|-------------------------|:--------------|
+| Model Parameters        | 150M          |
+| Tokenizer Type          | SPM           |
+| Vocabulary size         | 51200         |
+| Precision               | bfloat16      |
+| Context length          | 8192          |
+Training Hyperparemeters
+| Hyperparameter                | Value                             |
+|-------------------------      |:--------------                    |
+| Pretraining  Objective        | Masked Language Modeling          |
+| Learning Rate                 | 4E-04                             |
+| Learning Rate Scheduler       | WSD                               |
+| Warmup                        | 3,000,000,000                     |
+| Optimizer                     | decoupled_stableadamw             |
+| Optimizer Hyperparameters     | AdamW (β1=0.9,β2=0.98,ε =1e-06 )  |
+| Weight Decay                  | 1E-05                             |
+| Global Batch Size             | 4096                              |
+| Dropout                       | 1E-01                             |
+| Activation Function           | GeLU                              |
+## How to use
+```python
+>>> from transformers import pipeline
+>>> from pprint import pprint
+>>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT-es')
+>>> pprint(unmasker("Me encanta la<mask>de Barcelona.",top_k=3))
+[{'score': 0.24022650718688965,
+  'sequence': 'Me encanta la ciudad de Barcelona.',
+  'token': 2634,
+  'token_str': 'ciudad'},
+ {'score': 0.08937042951583862,
+  'sequence': 'Me encanta la gastronomía de Barcelona.',
+  'token': 18096,
+  'token_str': 'gastronomía'},
+ {'score': 0.08782190084457397,
+  'sequence': 'Me encanta la gente de Barcelona.',
+  'token': 4475,
+  'token_str': 'gente'}]
+>>> pprint(unmasker("La ciencia engloba disciplinas como la<mask>y las matemáticas.",top_k=3))
+[{'score': 0.8550629019737244,
+  'sequence': 'La ciencia engloba disciplinas como la física y las '
+              'matemáticas.',
+  'token': 9204,
+  'token_str': 'física'},
+ {'score': 0.06438734382390976,
+  'sequence': 'La ciencia engloba disciplinas como la biología y las '
+              'matemáticas.',
+  'token': 40678,
+  'token_str': 'biología'},
+ {'score': 0.044761642813682556,
+  'sequence': 'La ciencia engloba disciplinas como la química y las '
+              'matemáticas.',
+  'token': 25047,
+  'token_str': 'química'}]
+>>> pprint(unmasker("The favourite food for Spaniards is<mask>.",top_k=3))
+[{'score': 0.11592480540275574,
+  'sequence': 'The favourite food for Spaniards is pizza .',
+  'token': 22646,
+  'token_str': 'pizza'},
+ {'score': 0.07638967037200928,
+  'sequence': 'The favourite food for Spaniards is pasta .',
+  'token': 20822,
+  'token_str': 'pasta'},
+ {'score': 0.07300166040658951,
+  'sequence': 'The favourite food for Spaniards is chicken .',
+  'token': 16966,
+  'token_str': 'chicken'}]
+```
+Which is equivalent to the following torch script:
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+import torch
+model = AutoModelForMaskedLM.from_pretrained("BSC-LT/MrBERT-es")
+tokenizer = AutoTokenizer.from_pretrained("BSC-LT/MrBERT-es")
+# The index of "<mask>" token is -3 given that the -1 position is the EOS token "</s>" and -2 the position of the "." token.
+outputs = model(**tokenizer("La capital de España es<mask>.", return_tensors="pt")).logits
+predicted_token = tokenizer.decode(torch.argmax(outputs[0,-3,:]))
+print(f"The prediction is \"{predicted_token}\"." ) # The prediction is "Madrid"
+```
+In most of the evaluations presented below, the model is adjusted to each use case using specific logits to encode the text.
+### EVALUATION: CLUB Benchmark
+Model performance in Spanish Language is assessed using the EvalES benchmark. The [EvalES benchmark](https://benchmark.plantl.bsc.es/datasets.html) consists of 7 tasks: Named Entity Recognition and Classification (CoNLL-NERC), Part-of-Speech Tagging (UD-POS), Text Classification (MLDoc), Paraphrase Identification (PAWS-X), Semantic Textual Similarity (STS), Question Answering (SQAC), and Textual Entailment (XNLI). This benchmark evaluates the model's capabilities in the Spanish language.
+The following base foundational models have been considered for the comparison:
+| Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
+|---------------------------------|----------------------|------------|-------------|
+| [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base)                | 279M                | 250K       |   Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.          |
+| [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa)                       | 283M                | 256K       |     RoBERTa base model pretrained with 35 European languages and a larger vocabulary size.         |
+| [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base)                | 308M                | 250K       |   Multilingual ModernBERT pre-trained with staged language learning.          |
+| [MrBERT](https://huggingface.co/BSC-LT/MrBERT)                | 308M                | 250K       |   Multilingual ModernBERT pre-trained with 35 European language.          |
+| tasks        |   xlm-roberta-base (278M) | mRoBERTa (300M)   | mmBERT (308M)   | MrBERT (308M)   | MrBERT-es (150M)   |
+|--------------|---------------------------|-------------------|-----------------|-----------------|--------------------|
+| pos (f1)     |                     99.01 | 99.03             | **99.09**       | <u>99.06</u>    | 99.04              |
+| ner (f1)     |                     86.91 | **87.77**         | 87.01           | <u>87.42</u>    | 87.36              |
+| sts (person) |                     80.88 | 79.69             | 82.88           | <u>84.18</u>    | **85.18**          |
+| tc - paws-x (acc)     |                     90.35 | 91.30             | <u>91.35</u>    | 91.25           | **91.60**          |
+| tc - mldoc (acc)    |                     47.67 | 91.28             | 95.10           | <u>95.28</u>    | **95.35**          |
+| tc - massivenew (acc)    |                     21.89 | 86.45             | 86.79           | **87.46**       | <u>87.19</u>       |
+| qa (f1)      |                     74.48 | 77.03             | 79.79           | **81.96**       | <u>80.33</u>       |
+| te (acc)     |                     33.33** | 33.33**             | 79.98           | **84.69**       | <u>82.14</u>       |
+** The textual entailment task currently exhibits some degenerate evaluations, we are working on improving the framework to address this issue.
+## Additional information
+### Author
+The Language Technologies Lab from Barcelona Supercomputing Center.
+### Contact
+For further information, please send an email to <langtech@bsc.es>.
+### Copyright
+Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.
+### Funding
+This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project [ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
+### Acknowledgements
+This project has benefited from the contributions of numerous teams and institutions through data contributions.
+In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
+At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
+At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.
+Their valuable efforts have been instrumental in the development of this work.
+### Disclaimer
+Be aware that the model may contain biases or other unintended distortions.
+When third parties deploy systems or provide services based on this model, or use the model themselves,
+they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations,
+including those governing the use of Artificial Intelligence.
+The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
+### License
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7e12262621db790c860f77ce1b49e10172ca40c5dd02f8c71d9b170feb907d3d
+size 1344

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:984fb0d8ffcf6b33a6022612c094726215baefb4a87d2976a5f682f2faddf95a
+size 601194264

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a807bb6e444852164fa9ba47fe55b4dfeaa9587112a7d719ed6ee6e3b2da1529
+size 751

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:86c1dfbab59efebe083ddf7dfcec3c869f8315f3e6102c3bb7335f65fca7356f
+size 6831096

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ed8dc3e139a6f2c6e1781996aabfef34c32241dcff263dbc66cf69b4760aeee9
+size 1074422

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:097a71f527cde42de81af2131cf2beaa9d233283d4cbeccd03b55bc9e635aaf2
+size 193668