Update Model

Browse files

Files changed (8) hide show

README.md +43 -52
config.json +8 -12
model.safetensors +2 -2
pytorch_model.bin +3 -0
special_tokens_map.json +1 -4
tokenizer.json +2 -2
tokenizer.model +2 -2
tokenizer_config.json +0 -0

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ library_name: transformers
 # MrBERT-science Model Card
-MrBERT-science is a new foundational bilingual language model for Spanish and English, adapted to the scientific domain and built on the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base/tree/main) architecture. The model is obtained via domain adaptation from [MrBERT-es](https://huggingface.co/BSC-LT/MrBERT-es), initializing all weights from MrBERT-es and further training on a domain-specific scientific corpus comprising 3.5B tokens (39.7% Spanish, 60.3% English).
 ## Technical Description
@@ -22,9 +22,9 @@ Technical details of the MrBERT-science model.
 | Description          | Value         |
 |-------------------------|:--------------|
-| Model Parameters        | 150M          |
 | Tokenizer Type          | SPM           |
-| Vocabulary size         | 51200         |
 | Precision               | bfloat16      |
 | Context length          | 8192          |
@@ -34,13 +34,13 @@ Training Hyperparemeters
 | Hyperparameter                | Value                             |
 |-------------------------      |:--------------                    |
 | Pretraining  Objective        | Masked Language Modeling          |
-| Learning Rate                 | 1.8E-03                           |
 | Learning Rate Scheduler       | Cosine                            |
-| Warmup                        | 800,000,000 tokens                |
 | Optimizer                     | decoupled_stableadamw             |
 | Optimizer Hyperparameters     | AdamW (β1=0.9,β2=0.98,ε =1e-06 )  |
 | Weight Decay                  | 1E-05                             |
-| Global Batch Size             | 480                               |
 | Dropout                       | 1E-01                             |
 | Activation Function           | GeLU                              |
@@ -53,66 +53,57 @@ Training Hyperparemeters
 >>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT-science')
->>> pprint(unmasker("El principio de<mask>establece que no se puede determinar simultáneamente la posición y el momento de una partícula.",top_k=3))
-[{'score': 0.9929344654083252,
-  'sequence': 'El principio de incertidumbre establece que no se puede '
-              'determinar simultáneamente la posición y el momento de una '
-              'partícula.',
-  'token': 19304,
-  'token_str': 'incertidumbre'},
- {'score': 0.0035859679337590933,
-  'sequence': 'El principio de continuidad establece que no se puede '
-              'determinar simultáneamente la posición y el momento de una '
-              'partícula.',
-  'token': 15944,
-  'token_str': 'continuidad'},
- {'score': 0.0010854268912225962,
-  'sequence': 'El principio de diferencia establece que no se puede determinar '
-              'simultáneamente la posición y el momento de una partícula.',
-  'token': 7036,
-  'token_str': 'diferencia'}]
->>> pprint(unmasker("The principle of<mask>states that it is impossible to simultaneously determine the position and momentum of a particle.", top_k=3))
-[{'score': 0.8963900208473206,
-  'sequence': 'The principle of uncertainty states that it is impossible to '
-              'simultaneously determine the position and momentum of a '
-              'particle.',
-  'token': 23700,
-  'token_str': 'uncertainty'},
- {'score': 0.061987195163965225,
-  'sequence': 'The principle of incertidumbre states that it is impossible to '
-              'simultaneously determine the position and momentum of a '
-              'particle.',
-  'token': 19304,
-  'token_str': 'incertidumbre'},
- {'score': 0.008512359112501144,
-  'sequence': 'The principle of motion states that it is impossible to '
-              'simultaneously determine the position and momentum of a '
-              'particle.',
-  'token': 9148,
-  'token_str': 'motion'}]
 ```
 ### EVALUATION: Text Classification
-The following base foundational models have been considered for the comparison:
 | Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
 |---------------------------------|----------------------|------------|-------------|
 | [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base)                | 279M                | 250K       |   Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.          |
 | [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa)                       | 283M                | 256K       |     RoBERTa base model pretrained  with 35 European languages and a larger vocabulary size.         |
 | [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base)                | 308M                | 250K       |   Multilingual ModernBERT pre-trained with staged language learning.          |
-| [MrBERT](https://huggingface.co/BSC-LT/MrBERT)                | 308M                | 250K       |   Multilingual ModernBERT pre-trained with 35 European languages.          |
-| [MrBERT-es](https://huggingface.co/BSC-LT/MrBERT-es)                | 308M                | 250K       |   Vocabulary adaptation of MrBERT for Spanish and English.          |
 The benchmarks used for comparison are:
 - **AbScientia:** A Spanish scientific abstracts dataset focused on STEM disciplines, organized into 24 unified categories to capture the specifications of Spanish scientific language. The reported metric is accuracy.
 - **Scientific-paragraphs-categorization ('en' split):** A large-scale topic-classification dataset built from open-access scientific publications, classified into 26 scientific topics to capture the characteristics of standard English scientific language. The reported metric is accuracy.
-| tasks                      | xlm-roberta-base (279M)   | mRoBERTa (283M)   | mmBERT (308M)   | MrBERT (308M)   | MrBERT-es (150M)   | MrBERT-science (150M)   |
-|:---------------------------|:--------------------------|:------------------|:----------------|:----------------|:-------------------|:------------------------|
-| abscientia (es)            | 80.87                     | 81.45             | **82.18**       | 81.98           | 81.79              | <u>82.00</u>            |
-| scientific_paragraphs (en) | 58.18                     | 58.34             | **61.48**       | 61.02           | <u>61.10</u>       | 60.65                   |
 ## Additional information
@@ -123,7 +114,7 @@ The Language Technologies Lab from Barcelona Supercomputing Center.
 For further information, please send an email to <langtech@bsc.es>.
 ### Copyright
-Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.
 ### Funding

 # MrBERT-science Model Card
+MrBERT-science is a new foundational bilingual language model for Spanish and English, adapted to the scientific domain and built on the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base/tree/main) architecture. The model is obtained via domain adaptation from [MrBERT](https://huggingface.co/BSC-LT/MrBERT), initializing all weights from MrBERT-es and further training on a domain-specific scientific corpus comprising 3.6B tokens (38.1% Spanish, 61.9% English).
 ## Technical Description
 | Description          | Value         |
 |-------------------------|:--------------|
+| Model Parameters        | 308M          |
 | Tokenizer Type          | SPM           |
+| Vocabulary size         | 256000        |
 | Precision               | bfloat16      |
 | Context length          | 8192          |
 | Hyperparameter                | Value                             |
 |-------------------------      |:--------------                    |
 | Pretraining  Objective        | Masked Language Modeling          |
+| Learning Rate                 | 6E-04                             |
 | Learning Rate Scheduler       | Cosine                            |
+| Warmup                        | 360,000,000 tokens                |
 | Optimizer                     | decoupled_stableadamw             |
 | Optimizer Hyperparameters     | AdamW (β1=0.9,β2=0.98,ε =1e-06 )  |
 | Weight Decay                  | 1E-05                             |
+| Global Batch Size             | 512                               |
 | Dropout                       | 1E-01                             |
 | Activation Function           | GeLU                              |
 >>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT-science')
+>>> pprint(unmasker("Hubble's<mask>describes the expansion of the universe and the relationship between a galaxy's distance and its recessional velocity.", top_k=3))
+[{'score': 0.8203125,
+  'sequence': "Hubble's law describes the expansion of the universe and the "
+              "relationship between a galaxy's distance and its recessional "
+              'velocity.',
+  'token': 21673,
+  'token_str': 'law'},
+ {'score': 0.1259765625,
+  'sequence': "Hubble's Law describes the expansion of the universe and the "
+              "relationship between a galaxy's distance and its recessional "
+              'velocity.',
+  'token': 18573,
+  'token_str': 'Law'},
+ {'score': 0.0247802734375,
+  'sequence': "Hubble's equation describes the expansion of the universe and "
+              "the relationship between a galaxy's distance and its "
+              'recessional velocity.',
+  'token': 174396,
+  'token_str': 'equation'}]
 ```
 ### EVALUATION: Text Classification
+In addition to the MrBERT family, the following base foundation models were considered:
 | Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
 |---------------------------------|----------------------|------------|-------------|
 | [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base)                | 279M                | 250K       |   Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.          |
 | [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa)                       | 283M                | 256K       |     RoBERTa base model pretrained  with 35 European languages and a larger vocabulary size.         |
 | [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base)                | 308M                | 250K       |   Multilingual ModernBERT pre-trained with staged language learning.          |
+| [INDUS](https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1) | 125M | 50k | RoBERTa-based, Encoder-only transformer model, domain-adapted for NASA Science Mission Directorate (SMD) applications |
+| [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased) | 110M | 31k | BERT model trained on scientific text. |
 The benchmarks used for comparison are:
+- **MTEB:** Subset of scientific-related tasks.
 - **AbScientia:** A Spanish scientific abstracts dataset focused on STEM disciplines, organized into 24 unified categories to capture the specifications of Spanish scientific language. The reported metric is accuracy.
 - **Scientific-paragraphs-categorization ('en' split):** A large-scale topic-classification dataset built from open-access scientific publications, classified into 26 scientific topics to capture the characteristics of standard English scientific language. The reported metric is accuracy.
+|                   | **mmBERT (308M)** | **MrBERT (308M)** | **MrBERT-es (150M)** | **MrBERT-science (308M)** | **INDUS (125M)** | **SciBERT (110M)** |
+|-------------------|------------------:|------------------:|---------------------:|--------------------------:|-----------------:|-------------------:|
+| chemhotpotqa       | 62.73 | 62.70 | 60.36 | 63.14 | <u>63.79</u> | **64.43** |
+| chemnq             | 34.87 | <u>39.70</u> | 29.01 | 39.24 | **39.78** | 39.09 |
+| climate_fever_v2   | <u>23.32</u> | 23.21 | 22.20 | 23.01 | **23.91** | 23.06 |
+| litsearch          | 11.06 | 12.17 | 11.49 | 12.24 | <u>12.24</u> | **13.52** |
+| **Average**        | 32.99 | 34.44 | 30.77 | 34.41 | <u>34.93</u> | **35.02** |
+| tasks                      | xlm-roberta-base (279M)   | mRoBERTa (283M)   | mmBERT (308M)   | MrBERT (308M)   | MrBERT-es (150M)   | MrBERT-science (150M)   | INDUS (125M) | SciBERT (110M)  |
+|:---------------------------|:--------------------------|:------------------|:----------------|:----------------|:-------------------|:------------------------|:-------------|:----------------|
+| abscientia (es)            | 80.87                     | 81.45             | **82.66**       |  81.98          | <u>82.39</u>       | 82.34                   | 79.24        |   80.80         |
+| scientific_paragraphs (en) | 58.18                     | 58.34             | 60.22           |  61.02          | 61.80              | 62.20                   | **63.95**    |   <u>62.77</u>  |
 ## Additional information
 For further information, please send an email to <langtech@bsc.es>.
 ### Copyright
+Copyright(c) 2026 by Language Technologies Lab, Barcelona Supercomputing Center.
 ### Funding

config.json CHANGED Viewed

@@ -1,16 +1,15 @@
 {
-  "_name_or_path": "/gpfs/scratch/bsc88/bsc088070/Encoder_test/conversion/EngSpa_tok_science/ModernBERT_EngSpaTok_science_top1_lr18",
   "architectures": [
     "ModernBertForMaskedLM"
   ],
   "attention_bias": false,
   "attention_dropout": 0.0,
-  "bos_token_id": 0,
   "classifier_activation": "silu",
   "classifier_bias": false,
   "classifier_dropout": 0.0,
   "classifier_pooling": "mean",
-  "cls_token_id": 0,
   "decoder_bias": true,
   "deterministic_flash_attn": false,
   "embedding_dropout": 0.0,
@@ -34,14 +33,11 @@
   "norm_eps": 1e-05,
   "num_attention_heads": 12,
   "num_hidden_layers": 22,
-  "pad_token_id": 1,
   "position_embedding_type": "absolute",
-  "reference_compile": null,
-  "repad_logits_with_grad": false,
   "sep_token_id": 2,
-  "sparse_pred_ignore_index": -100,
-  "sparse_prediction": false,
-  "torch_dtype": "float32",
-  "transformers_version": "4.48.1",
-  "vocab_size": 51200
-}

 {
   "architectures": [
     "ModernBertForMaskedLM"
   ],
   "attention_bias": false,
   "attention_dropout": 0.0,
+  "bos_token_id": 1,
   "classifier_activation": "silu",
   "classifier_bias": false,
   "classifier_dropout": 0.0,
   "classifier_pooling": "mean",
+  "cls_token_id": 1,
   "decoder_bias": true,
   "deterministic_flash_attn": false,
   "embedding_dropout": 0.0,
   "norm_eps": 1e-05,
   "num_attention_heads": 12,
   "num_hidden_layers": 22,
+  "pad_token_id": 3,
   "position_embedding_type": "absolute",
   "sep_token_id": 2,
+  "tie_word_embeddings": true,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.48.0",
+  "vocab_size": 256128
+}

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ab554f688a4b924c558439cb94cef445838dd1a3a2ee6d2f72e2be52176fec9f
-size 601194264

 version https://git-lfs.github.com/spec/v1
+oid sha256:481c17a5a9437d6079367583b7ad88683b6d85156bdeb4d19593ae833b8f6751
+size 1231552912

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e78abd24b7bf22d79684c8c8e6f5517829d39ec2005cf9060eaa28bdc31cf9be
+size 1231581870

special_tokens_map.json CHANGED Viewed

@@ -1,7 +1,4 @@
 {
-  "additional_special_tokens": [
-    "<|translation|>"
-  ],
   "bos_token": {
     "content": "<s>",
     "lstrip": false,
@@ -37,4 +34,4 @@
     "rstrip": false,
     "single_word": false
   }
-}

 {
   "bos_token": {
     "content": "<s>",
     "lstrip": false,
     "rstrip": false,
     "single_word": false
   }
+}

tokenizer.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:86c1dfbab59efebe083ddf7dfcec3c869f8315f3e6102c3bb7335f65fca7356f
-size 6831096

 version https://git-lfs.github.com/spec/v1
+oid sha256:031e825b2072f555233d0d01abc2cde072183173a53967e7e842813b9673748c
+size 19092952

tokenizer.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ed8dc3e139a6f2c6e1781996aabfef34c32241dcff263dbc66cf69b4760aeee9
-size 1074422

 version https://git-lfs.github.com/spec/v1
+oid sha256:5072e3209a04aa01dbf4db72b8fec52cf8cd06a042c9ba819678e084f7b665d5
+size 4813283

tokenizer_config.json CHANGED Viewed

The diff for this file is too large to render. See raw diff