dtamayo commited on
Commit
dd1acc0
·
1 Parent(s): 0466bcf

Update Model

Browse files
README.md CHANGED
@@ -13,7 +13,7 @@ library_name: transformers
13
 
14
  # MrBERT-science Model Card
15
 
16
- MrBERT-science is a new foundational bilingual language model for Spanish and English, adapted to the scientific domain and built on the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base/tree/main) architecture. The model is obtained via domain adaptation from [MrBERT-es](https://huggingface.co/BSC-LT/MrBERT-es), initializing all weights from MrBERT-es and further training on a domain-specific scientific corpus comprising 3.5B tokens (39.7% Spanish, 60.3% English).
17
 
18
  ## Technical Description
19
 
@@ -22,9 +22,9 @@ Technical details of the MrBERT-science model.
22
 
23
  | Description | Value |
24
  |-------------------------|:--------------|
25
- | Model Parameters | 150M |
26
  | Tokenizer Type | SPM |
27
- | Vocabulary size | 51200 |
28
  | Precision | bfloat16 |
29
  | Context length | 8192 |
30
 
@@ -34,13 +34,13 @@ Training Hyperparemeters
34
  | Hyperparameter | Value |
35
  |------------------------- |:-------------- |
36
  | Pretraining Objective | Masked Language Modeling |
37
- | Learning Rate | 1.8E-03 |
38
  | Learning Rate Scheduler | Cosine |
39
- | Warmup | 800,000,000 tokens |
40
  | Optimizer | decoupled_stableadamw |
41
  | Optimizer Hyperparameters | AdamW (β1=0.9,β2=0.98,ε =1e-06 ) |
42
  | Weight Decay | 1E-05 |
43
- | Global Batch Size | 480 |
44
  | Dropout | 1E-01 |
45
  | Activation Function | GeLU |
46
 
@@ -53,66 +53,57 @@ Training Hyperparemeters
53
 
54
  >>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT-science')
55
 
56
- >>> pprint(unmasker("El principio de<mask>establece que no se puede determinar simultáneamente la posición y el momento de una partícula.",top_k=3))
57
- [{'score': 0.9929344654083252,
58
- 'sequence': 'El principio de incertidumbre establece que no se puede '
59
- 'determinar simultáneamente la posición y el momento de una '
60
- 'partícula.',
61
- 'token': 19304,
62
- 'token_str': 'incertidumbre'},
63
- {'score': 0.0035859679337590933,
64
- 'sequence': 'El principio de continuidad establece que no se puede '
65
- 'determinar simultáneamente la posición y el momento de una '
66
- 'partícula.',
67
- 'token': 15944,
68
- 'token_str': 'continuidad'},
69
- {'score': 0.0010854268912225962,
70
- 'sequence': 'El principio de diferencia establece que no se puede determinar '
71
- 'simultáneamente la posición y el momento de una partícula.',
72
- 'token': 7036,
73
- 'token_str': 'diferencia'}]
74
- >>> pprint(unmasker("The principle of<mask>states that it is impossible to simultaneously determine the position and momentum of a particle.", top_k=3))
75
- [{'score': 0.8963900208473206,
76
- 'sequence': 'The principle of uncertainty states that it is impossible to '
77
- 'simultaneously determine the position and momentum of a '
78
- 'particle.',
79
- 'token': 23700,
80
- 'token_str': 'uncertainty'},
81
- {'score': 0.061987195163965225,
82
- 'sequence': 'The principle of incertidumbre states that it is impossible to '
83
- 'simultaneously determine the position and momentum of a '
84
- 'particle.',
85
- 'token': 19304,
86
- 'token_str': 'incertidumbre'},
87
- {'score': 0.008512359112501144,
88
- 'sequence': 'The principle of motion states that it is impossible to '
89
- 'simultaneously determine the position and momentum of a '
90
- 'particle.',
91
- 'token': 9148,
92
- 'token_str': 'motion'}]
93
  ```
94
 
95
  ### EVALUATION: Text Classification
96
 
97
- The following base foundational models have been considered for the comparison:
98
 
99
  | Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
100
  |---------------------------------|----------------------|------------|-------------|
101
  | [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 279M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
102
  | [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa) | 283M | 256K | RoBERTa base model pretrained with 35 European languages and a larger vocabulary size. |
103
  | [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base) | 308M | 250K | Multilingual ModernBERT pre-trained with staged language learning. |
104
- | [MrBERT](https://huggingface.co/BSC-LT/MrBERT) | 308M | 250K | Multilingual ModernBERT pre-trained with 35 European languages. |
105
- | [MrBERT-es](https://huggingface.co/BSC-LT/MrBERT-es) | 308M | 250K | Vocabulary adaptation of MrBERT for Spanish and English. |
106
 
107
  The benchmarks used for comparison are:
108
-
109
  - **AbScientia:** A Spanish scientific abstracts dataset focused on STEM disciplines, organized into 24 unified categories to capture the specifications of Spanish scientific language. The reported metric is accuracy.
110
  - **Scientific-paragraphs-categorization ('en' split):** A large-scale topic-classification dataset built from open-access scientific publications, classified into 26 scientific topics to capture the characteristics of standard English scientific language. The reported metric is accuracy.
111
 
112
- | tasks | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | MrBERT (308M) | MrBERT-es (150M) | MrBERT-science (150M) |
113
- |:---------------------------|:--------------------------|:------------------|:----------------|:----------------|:-------------------|:------------------------|
114
- | abscientia (es) | 80.87 | 81.45 | **82.18** | 81.98 | 81.79 | <u>82.00</u> |
115
- | scientific_paragraphs (en) | 58.18 | 58.34 | **61.48** | 61.02 | <u>61.10</u> | 60.65 |
 
 
 
 
 
 
 
 
 
116
 
117
  ## Additional information
118
 
@@ -123,7 +114,7 @@ The Language Technologies Lab from Barcelona Supercomputing Center.
123
  For further information, please send an email to <langtech@bsc.es>.
124
 
125
  ### Copyright
126
- Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.
127
 
128
  ### Funding
129
 
 
13
 
14
  # MrBERT-science Model Card
15
 
16
+ MrBERT-science is a new foundational bilingual language model for Spanish and English, adapted to the scientific domain and built on the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base/tree/main) architecture. The model is obtained via domain adaptation from [MrBERT](https://huggingface.co/BSC-LT/MrBERT), initializing all weights from MrBERT-es and further training on a domain-specific scientific corpus comprising 3.6B tokens (38.1% Spanish, 61.9% English).
17
 
18
  ## Technical Description
19
 
 
22
 
23
  | Description | Value |
24
  |-------------------------|:--------------|
25
+ | Model Parameters | 308M |
26
  | Tokenizer Type | SPM |
27
+ | Vocabulary size | 256000 |
28
  | Precision | bfloat16 |
29
  | Context length | 8192 |
30
 
 
34
  | Hyperparameter | Value |
35
  |------------------------- |:-------------- |
36
  | Pretraining Objective | Masked Language Modeling |
37
+ | Learning Rate | 6E-04 |
38
  | Learning Rate Scheduler | Cosine |
39
+ | Warmup | 360,000,000 tokens |
40
  | Optimizer | decoupled_stableadamw |
41
  | Optimizer Hyperparameters | AdamW (β1=0.9,β2=0.98,ε =1e-06 ) |
42
  | Weight Decay | 1E-05 |
43
+ | Global Batch Size | 512 |
44
  | Dropout | 1E-01 |
45
  | Activation Function | GeLU |
46
 
 
53
 
54
  >>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT-science')
55
 
56
+ >>> pprint(unmasker("Hubble's<mask>describes the expansion of the universe and the relationship between a galaxy's distance and its recessional velocity.", top_k=3))
57
+ [{'score': 0.8203125,
58
+ 'sequence': "Hubble's law describes the expansion of the universe and the "
59
+ "relationship between a galaxy's distance and its recessional "
60
+ 'velocity.',
61
+ 'token': 21673,
62
+ 'token_str': 'law'},
63
+ {'score': 0.1259765625,
64
+ 'sequence': "Hubble's Law describes the expansion of the universe and the "
65
+ "relationship between a galaxy's distance and its recessional "
66
+ 'velocity.',
67
+ 'token': 18573,
68
+ 'token_str': 'Law'},
69
+ {'score': 0.0247802734375,
70
+ 'sequence': "Hubble's equation describes the expansion of the universe and "
71
+ "the relationship between a galaxy's distance and its "
72
+ 'recessional velocity.',
73
+ 'token': 174396,
74
+ 'token_str': 'equation'}]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  ```
76
 
77
  ### EVALUATION: Text Classification
78
 
79
+ In addition to the MrBERT family, the following base foundation models were considered:
80
 
81
  | Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
82
  |---------------------------------|----------------------|------------|-------------|
83
  | [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 279M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
84
  | [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa) | 283M | 256K | RoBERTa base model pretrained with 35 European languages and a larger vocabulary size. |
85
  | [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base) | 308M | 250K | Multilingual ModernBERT pre-trained with staged language learning. |
86
+ | [INDUS](https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1) | 125M | 50k | RoBERTa-based, Encoder-only transformer model, domain-adapted for NASA Science Mission Directorate (SMD) applications |
87
+ | [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased) | 110M | 31k | BERT model trained on scientific text. |
88
 
89
  The benchmarks used for comparison are:
90
+ - **MTEB:** Subset of scientific-related tasks.
91
  - **AbScientia:** A Spanish scientific abstracts dataset focused on STEM disciplines, organized into 24 unified categories to capture the specifications of Spanish scientific language. The reported metric is accuracy.
92
  - **Scientific-paragraphs-categorization ('en' split):** A large-scale topic-classification dataset built from open-access scientific publications, classified into 26 scientific topics to capture the characteristics of standard English scientific language. The reported metric is accuracy.
93
 
94
+ | | **mmBERT (308M)** | **MrBERT (308M)** | **MrBERT-es (150M)** | **MrBERT-science (308M)** | **INDUS (125M)** | **SciBERT (110M)** |
95
+ |-------------------|------------------:|------------------:|---------------------:|--------------------------:|-----------------:|-------------------:|
96
+ | chemhotpotqa | 62.73 | 62.70 | 60.36 | 63.14 | <u>63.79</u> | **64.43** |
97
+ | chemnq | 34.87 | <u>39.70</u> | 29.01 | 39.24 | **39.78** | 39.09 |
98
+ | climate_fever_v2 | <u>23.32</u> | 23.21 | 22.20 | 23.01 | **23.91** | 23.06 |
99
+ | litsearch | 11.06 | 12.17 | 11.49 | 12.24 | <u>12.24</u> | **13.52** |
100
+ | **Average** | 32.99 | 34.44 | 30.77 | 34.41 | <u>34.93</u> | **35.02** |
101
+
102
+
103
+ | tasks | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | MrBERT (308M) | MrBERT-es (150M) | MrBERT-science (150M) | INDUS (125M) | SciBERT (110M) |
104
+ |:---------------------------|:--------------------------|:------------------|:----------------|:----------------|:-------------------|:------------------------|:-------------|:----------------|
105
+ | abscientia (es) | 80.87 | 81.45 | **82.66** | 81.98 | <u>82.39</u> | 82.34 | 79.24 | 80.80 |
106
+ | scientific_paragraphs (en) | 58.18 | 58.34 | 60.22 | 61.02 | 61.80 | 62.20 | **63.95** | <u>62.77</u> |
107
 
108
  ## Additional information
109
 
 
114
  For further information, please send an email to <langtech@bsc.es>.
115
 
116
  ### Copyright
117
+ Copyright(c) 2026 by Language Technologies Lab, Barcelona Supercomputing Center.
118
 
119
  ### Funding
120
 
config.json CHANGED
@@ -1,16 +1,15 @@
1
  {
2
- "_name_or_path": "/gpfs/scratch/bsc88/bsc088070/Encoder_test/conversion/EngSpa_tok_science/ModernBERT_EngSpaTok_science_top1_lr18",
3
  "architectures": [
4
  "ModernBertForMaskedLM"
5
  ],
6
  "attention_bias": false,
7
  "attention_dropout": 0.0,
8
- "bos_token_id": 0,
9
  "classifier_activation": "silu",
10
  "classifier_bias": false,
11
  "classifier_dropout": 0.0,
12
  "classifier_pooling": "mean",
13
- "cls_token_id": 0,
14
  "decoder_bias": true,
15
  "deterministic_flash_attn": false,
16
  "embedding_dropout": 0.0,
@@ -34,14 +33,11 @@
34
  "norm_eps": 1e-05,
35
  "num_attention_heads": 12,
36
  "num_hidden_layers": 22,
37
- "pad_token_id": 1,
38
  "position_embedding_type": "absolute",
39
- "reference_compile": null,
40
- "repad_logits_with_grad": false,
41
  "sep_token_id": 2,
42
- "sparse_pred_ignore_index": -100,
43
- "sparse_prediction": false,
44
- "torch_dtype": "float32",
45
- "transformers_version": "4.48.1",
46
- "vocab_size": 51200
47
- }
 
1
  {
 
2
  "architectures": [
3
  "ModernBertForMaskedLM"
4
  ],
5
  "attention_bias": false,
6
  "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
  "classifier_activation": "silu",
9
  "classifier_bias": false,
10
  "classifier_dropout": 0.0,
11
  "classifier_pooling": "mean",
12
+ "cls_token_id": 1,
13
  "decoder_bias": true,
14
  "deterministic_flash_attn": false,
15
  "embedding_dropout": 0.0,
 
33
  "norm_eps": 1e-05,
34
  "num_attention_heads": 12,
35
  "num_hidden_layers": 22,
36
+ "pad_token_id": 3,
37
  "position_embedding_type": "absolute",
 
 
38
  "sep_token_id": 2,
39
+ "tie_word_embeddings": true,
40
+ "torch_dtype": "bfloat16",
41
+ "transformers_version": "4.48.0",
42
+ "vocab_size": 256128
43
+ }
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ab554f688a4b924c558439cb94cef445838dd1a3a2ee6d2f72e2be52176fec9f
3
- size 601194264
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:481c17a5a9437d6079367583b7ad88683b6d85156bdeb4d19593ae833b8f6751
3
+ size 1231552912
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e78abd24b7bf22d79684c8c8e6f5517829d39ec2005cf9060eaa28bdc31cf9be
3
+ size 1231581870
special_tokens_map.json CHANGED
@@ -1,7 +1,4 @@
1
  {
2
- "additional_special_tokens": [
3
- "<|translation|>"
4
- ],
5
  "bos_token": {
6
  "content": "<s>",
7
  "lstrip": false,
@@ -37,4 +34,4 @@
37
  "rstrip": false,
38
  "single_word": false
39
  }
40
- }
 
1
  {
 
 
 
2
  "bos_token": {
3
  "content": "<s>",
4
  "lstrip": false,
 
34
  "rstrip": false,
35
  "single_word": false
36
  }
37
+ }
tokenizer.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:86c1dfbab59efebe083ddf7dfcec3c869f8315f3e6102c3bb7335f65fca7356f
3
- size 6831096
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:031e825b2072f555233d0d01abc2cde072183173a53967e7e842813b9673748c
3
+ size 19092952
tokenizer.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ed8dc3e139a6f2c6e1781996aabfef34c32241dcff263dbc66cf69b4760aeee9
3
- size 1074422
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5072e3209a04aa01dbf4db72b8fec52cf8cd06a042c9ba819678e084f7b665d5
3
+ size 4813283
tokenizer_config.json CHANGED
The diff for this file is too large to render. See raw diff