dtamayo commited on
Commit
3e30cc4
·
0 Parent(s):

Add MrBERT-es

Browse files
.gitattributes ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ *.json filter=lfs diff=lfs merge=lfs -text
2
+ *.model filter=lfs diff=lfs merge=lfs -text
3
+ *.png filter=lfs diff=lfs merge=lfs -text
4
+ *.bin filter=lfs diff=lfs merge=lfs -text
5
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - es
4
+ - en
5
+ tags:
6
+ - fill-mask
7
+ - masked-lm
8
+ - long-context
9
+ - modernbert
10
+ license: apache-2.0
11
+ library_name: transformers
12
+ ---
13
+ # MrBERT-es Model Card
14
+
15
+ MrBERT-es is a new foundational Catalan language model built on the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base/tree/main) architecture. It uses vocabulary adaptation from [MrBERT](https://huggingface.co/BSC-LT/MrBERT), a method that initializes all weights from MrBERT while applying a specialized treatment to the embedding matrix. This treatment carefully handles the differences between the two tokenizers.
16
+
17
+ Following initialization, the model is continually pretrained on a bilingual corpus of 615 billion tokens, evenly balanced between English and Spanish
18
+
19
+ ## Technical Description
20
+
21
+ Technical details of the MrBERT-es model.
22
+
23
+
24
+ | Description | Value |
25
+ |-------------------------|:--------------|
26
+ | Model Parameters | 150M |
27
+ | Tokenizer Type | SPM |
28
+ | Vocabulary size | 51200 |
29
+ | Precision | bfloat16 |
30
+ | Context length | 8192 |
31
+
32
+
33
+ Training Hyperparemeters
34
+
35
+ | Hyperparameter | Value |
36
+ |------------------------- |:-------------- |
37
+ | Pretraining Objective | Masked Language Modeling |
38
+ | Learning Rate | 4E-04 |
39
+ | Learning Rate Scheduler | WSD |
40
+ | Warmup | 3,000,000,000 |
41
+ | Optimizer | decoupled_stableadamw |
42
+ | Optimizer Hyperparameters | AdamW (β1=0.9,β2=0.98,ε =1e-06 ) |
43
+ | Weight Decay | 1E-05 |
44
+ | Global Batch Size | 4096 |
45
+ | Dropout | 1E-01 |
46
+ | Activation Function | GeLU |
47
+
48
+
49
+ ## How to use
50
+
51
+ ```python
52
+ >>> from transformers import pipeline
53
+ >>> from pprint import pprint
54
+
55
+ >>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT-es')
56
+
57
+ >>> pprint(unmasker("Me encanta la<mask>de Barcelona.",top_k=3))
58
+ [{'score': 0.24022650718688965,
59
+ 'sequence': 'Me encanta la ciudad de Barcelona.',
60
+ 'token': 2634,
61
+ 'token_str': 'ciudad'},
62
+ {'score': 0.08937042951583862,
63
+ 'sequence': 'Me encanta la gastronomía de Barcelona.',
64
+ 'token': 18096,
65
+ 'token_str': 'gastronomía'},
66
+ {'score': 0.08782190084457397,
67
+ 'sequence': 'Me encanta la gente de Barcelona.',
68
+ 'token': 4475,
69
+ 'token_str': 'gente'}]
70
+ >>> pprint(unmasker("La ciencia engloba disciplinas como la<mask>y las matemáticas.",top_k=3))
71
+ [{'score': 0.8550629019737244,
72
+ 'sequence': 'La ciencia engloba disciplinas como la física y las '
73
+ 'matemáticas.',
74
+ 'token': 9204,
75
+ 'token_str': 'física'},
76
+ {'score': 0.06438734382390976,
77
+ 'sequence': 'La ciencia engloba disciplinas como la biología y las '
78
+ 'matemáticas.',
79
+ 'token': 40678,
80
+ 'token_str': 'biología'},
81
+ {'score': 0.044761642813682556,
82
+ 'sequence': 'La ciencia engloba disciplinas como la química y las '
83
+ 'matemáticas.',
84
+ 'token': 25047,
85
+ 'token_str': 'química'}]
86
+ >>> pprint(unmasker("The favourite food for Spaniards is<mask>.",top_k=3))
87
+ [{'score': 0.11592480540275574,
88
+ 'sequence': 'The favourite food for Spaniards is pizza .',
89
+ 'token': 22646,
90
+ 'token_str': 'pizza'},
91
+ {'score': 0.07638967037200928,
92
+ 'sequence': 'The favourite food for Spaniards is pasta .',
93
+ 'token': 20822,
94
+ 'token_str': 'pasta'},
95
+ {'score': 0.07300166040658951,
96
+ 'sequence': 'The favourite food for Spaniards is chicken .',
97
+ 'token': 16966,
98
+ 'token_str': 'chicken'}]
99
+ ```
100
+
101
+ Which is equivalent to the following torch script:
102
+
103
+ ```python
104
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
105
+ import torch
106
+
107
+ model = AutoModelForMaskedLM.from_pretrained("BSC-LT/MrBERT-es")
108
+ tokenizer = AutoTokenizer.from_pretrained("BSC-LT/MrBERT-es")
109
+
110
+ # The index of "<mask>" token is -3 given that the -1 position is the EOS token "</s>" and -2 the position of the "." token.
111
+ outputs = model(**tokenizer("La capital de España es<mask>.", return_tensors="pt")).logits
112
+ predicted_token = tokenizer.decode(torch.argmax(outputs[0,-3,:]))
113
+
114
+ print(f"The prediction is \"{predicted_token}\"." ) # The prediction is "Madrid"
115
+ ```
116
+
117
+ In most of the evaluations presented below, the model is adjusted to each use case using specific logits to encode the text.
118
+
119
+ ### EVALUATION: CLUB Benchmark
120
+
121
+ Model performance in Spanish Language is assessed using the EvalES benchmark. The [EvalES benchmark](https://benchmark.plantl.bsc.es/datasets.html) consists of 7 tasks: Named Entity Recognition and Classification (CoNLL-NERC), Part-of-Speech Tagging (UD-POS), Text Classification (MLDoc), Paraphrase Identification (PAWS-X), Semantic Textual Similarity (STS), Question Answering (SQAC), and Textual Entailment (XNLI). This benchmark evaluates the model's capabilities in the Spanish language.
122
+
123
+ The following base foundational models have been considered for the comparison:
124
+
125
+ | Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
126
+ |---------------------------------|----------------------|------------|-------------|
127
+ | [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 279M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
128
+ | [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa) | 283M | 256K | RoBERTa base model pretrained with 35 European languages and a larger vocabulary size. |
129
+ | [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base) | 308M | 250K | Multilingual ModernBERT pre-trained with staged language learning. |
130
+ | [MrBERT](https://huggingface.co/BSC-LT/MrBERT) | 308M | 250K | Multilingual ModernBERT pre-trained with 35 European language. |
131
+
132
+ | tasks | xlm-roberta-base (278M) | mRoBERTa (300M) | mmBERT (308M) | MrBERT (308M) | MrBERT-es (150M) |
133
+ |--------------|---------------------------|-------------------|-----------------|-----------------|--------------------|
134
+ | pos (f1) | 99.01 | 99.03 | **99.09** | <u>99.06</u> | 99.04 |
135
+ | ner (f1) | 86.91 | **87.77** | 87.01 | <u>87.42</u> | 87.36 |
136
+ | sts (person) | 80.88 | 79.69 | 82.88 | <u>84.18</u> | **85.18** |
137
+ | tc - paws-x (acc) | 90.35 | 91.30 | <u>91.35</u> | 91.25 | **91.60** |
138
+ | tc - mldoc (acc) | 47.67 | 91.28 | 95.10 | <u>95.28</u> | **95.35** |
139
+ | tc - massivenew (acc) | 21.89 | 86.45 | 86.79 | **87.46** | <u>87.19</u> |
140
+ | qa (f1) | 74.48 | 77.03 | 79.79 | **81.96** | <u>80.33</u> |
141
+ | te (acc) | 33.33** | 33.33** | 79.98 | **84.69** | <u>82.14</u> |
142
+
143
+ ** The textual entailment task currently exhibits some degenerate evaluations, we are working on improving the framework to address this issue.
144
+
145
+ ## Additional information
146
+
147
+ ### Author
148
+ The Language Technologies Lab from Barcelona Supercomputing Center.
149
+
150
+ ### Contact
151
+ For further information, please send an email to <langtech@bsc.es>.
152
+
153
+ ### Copyright
154
+ Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.
155
+
156
+ ### Funding
157
+
158
+ This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project [ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
159
+
160
+ ### Acknowledgements
161
+
162
+ This project has benefited from the contributions of numerous teams and institutions through data contributions.
163
+
164
+ In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
165
+
166
+ At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
167
+
168
+ At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.
169
+
170
+ Their valuable efforts have been instrumental in the development of this work.
171
+
172
+ ### Disclaimer
173
+ Be aware that the model may contain biases or other unintended distortions.
174
+ When third parties deploy systems or provide services based on this model, or use the model themselves,
175
+ they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations,
176
+ including those governing the use of Artificial Intelligence.
177
+
178
+ The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
179
+
180
+ ### License
181
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e12262621db790c860f77ce1b49e10172ca40c5dd02f8c71d9b170feb907d3d
3
+ size 1344
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:984fb0d8ffcf6b33a6022612c094726215baefb4a87d2976a5f682f2faddf95a
3
+ size 601194264
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a807bb6e444852164fa9ba47fe55b4dfeaa9587112a7d719ed6ee6e3b2da1529
3
+ size 751
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:86c1dfbab59efebe083ddf7dfcec3c869f8315f3e6102c3bb7335f65fca7356f
3
+ size 6831096
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ed8dc3e139a6f2c6e1781996aabfef34c32241dcff263dbc66cf69b4760aeee9
3
+ size 1074422
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:097a71f527cde42de81af2131cf2beaa9d233283d4cbeccd03b55bc9e635aaf2
3
+ size 193668