Update Model
Browse files- README.md +43 -52
- config.json +8 -12
- model.safetensors +2 -2
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -4
- tokenizer.json +2 -2
- tokenizer.model +2 -2
- tokenizer_config.json +0 -0
README.md
CHANGED
|
@@ -13,7 +13,7 @@ library_name: transformers
|
|
| 13 |
|
| 14 |
# MrBERT-science Model Card
|
| 15 |
|
| 16 |
-
MrBERT-science is a new foundational bilingual language model for Spanish and English, adapted to the scientific domain and built on the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base/tree/main) architecture. The model is obtained via domain adaptation from [MrBERT
|
| 17 |
|
| 18 |
## Technical Description
|
| 19 |
|
|
@@ -22,9 +22,9 @@ Technical details of the MrBERT-science model.
|
|
| 22 |
|
| 23 |
| Description | Value |
|
| 24 |
|-------------------------|:--------------|
|
| 25 |
-
| Model Parameters |
|
| 26 |
| Tokenizer Type | SPM |
|
| 27 |
-
| Vocabulary size |
|
| 28 |
| Precision | bfloat16 |
|
| 29 |
| Context length | 8192 |
|
| 30 |
|
|
@@ -34,13 +34,13 @@ Training Hyperparemeters
|
|
| 34 |
| Hyperparameter | Value |
|
| 35 |
|------------------------- |:-------------- |
|
| 36 |
| Pretraining Objective | Masked Language Modeling |
|
| 37 |
-
| Learning Rate |
|
| 38 |
| Learning Rate Scheduler | Cosine |
|
| 39 |
-
| Warmup |
|
| 40 |
| Optimizer | decoupled_stableadamw |
|
| 41 |
| Optimizer Hyperparameters | AdamW (β1=0.9,β2=0.98,ε =1e-06 ) |
|
| 42 |
| Weight Decay | 1E-05 |
|
| 43 |
-
| Global Batch Size |
|
| 44 |
| Dropout | 1E-01 |
|
| 45 |
| Activation Function | GeLU |
|
| 46 |
|
|
@@ -53,66 +53,57 @@ Training Hyperparemeters
|
|
| 53 |
|
| 54 |
>>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT-science')
|
| 55 |
|
| 56 |
-
>>> pprint(unmasker("
|
| 57 |
-
[{'score': 0.
|
| 58 |
-
'sequence': '
|
| 59 |
-
|
| 60 |
-
'
|
| 61 |
-
'token':
|
| 62 |
-
'token_str': '
|
| 63 |
-
{'score': 0.
|
| 64 |
-
'sequence': '
|
| 65 |
-
|
| 66 |
-
'
|
| 67 |
-
'token':
|
| 68 |
-
'token_str': '
|
| 69 |
-
{'score': 0.
|
| 70 |
-
'sequence': '
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
'
|
| 74 |
-
|
| 75 |
-
[{'score': 0.8963900208473206,
|
| 76 |
-
'sequence': 'The principle of uncertainty states that it is impossible to '
|
| 77 |
-
'simultaneously determine the position and momentum of a '
|
| 78 |
-
'particle.',
|
| 79 |
-
'token': 23700,
|
| 80 |
-
'token_str': 'uncertainty'},
|
| 81 |
-
{'score': 0.061987195163965225,
|
| 82 |
-
'sequence': 'The principle of incertidumbre states that it is impossible to '
|
| 83 |
-
'simultaneously determine the position and momentum of a '
|
| 84 |
-
'particle.',
|
| 85 |
-
'token': 19304,
|
| 86 |
-
'token_str': 'incertidumbre'},
|
| 87 |
-
{'score': 0.008512359112501144,
|
| 88 |
-
'sequence': 'The principle of motion states that it is impossible to '
|
| 89 |
-
'simultaneously determine the position and momentum of a '
|
| 90 |
-
'particle.',
|
| 91 |
-
'token': 9148,
|
| 92 |
-
'token_str': 'motion'}]
|
| 93 |
```
|
| 94 |
|
| 95 |
### EVALUATION: Text Classification
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
| Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
|
| 100 |
|---------------------------------|----------------------|------------|-------------|
|
| 101 |
| [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 279M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
|
| 102 |
| [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa) | 283M | 256K | RoBERTa base model pretrained with 35 European languages and a larger vocabulary size. |
|
| 103 |
| [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base) | 308M | 250K | Multilingual ModernBERT pre-trained with staged language learning. |
|
| 104 |
-
| [
|
| 105 |
-
| [
|
| 106 |
|
| 107 |
The benchmarks used for comparison are:
|
| 108 |
-
|
| 109 |
- **AbScientia:** A Spanish scientific abstracts dataset focused on STEM disciplines, organized into 24 unified categories to capture the specifications of Spanish scientific language. The reported metric is accuracy.
|
| 110 |
- **Scientific-paragraphs-categorization ('en' split):** A large-scale topic-classification dataset built from open-access scientific publications, classified into 26 scientific topics to capture the characteristics of standard English scientific language. The reported metric is accuracy.
|
| 111 |
|
| 112 |
-
|
|
| 113 |
-
|
| 114 |
-
|
|
| 115 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
## Additional information
|
| 118 |
|
|
@@ -123,7 +114,7 @@ The Language Technologies Lab from Barcelona Supercomputing Center.
|
|
| 123 |
For further information, please send an email to <langtech@bsc.es>.
|
| 124 |
|
| 125 |
### Copyright
|
| 126 |
-
Copyright(c)
|
| 127 |
|
| 128 |
### Funding
|
| 129 |
|
|
|
|
| 13 |
|
| 14 |
# MrBERT-science Model Card
|
| 15 |
|
| 16 |
+
MrBERT-science is a new foundational bilingual language model for Spanish and English, adapted to the scientific domain and built on the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base/tree/main) architecture. The model is obtained via domain adaptation from [MrBERT](https://huggingface.co/BSC-LT/MrBERT), initializing all weights from MrBERT-es and further training on a domain-specific scientific corpus comprising 3.6B tokens (38.1% Spanish, 61.9% English).
|
| 17 |
|
| 18 |
## Technical Description
|
| 19 |
|
|
|
|
| 22 |
|
| 23 |
| Description | Value |
|
| 24 |
|-------------------------|:--------------|
|
| 25 |
+
| Model Parameters | 308M |
|
| 26 |
| Tokenizer Type | SPM |
|
| 27 |
+
| Vocabulary size | 256000 |
|
| 28 |
| Precision | bfloat16 |
|
| 29 |
| Context length | 8192 |
|
| 30 |
|
|
|
|
| 34 |
| Hyperparameter | Value |
|
| 35 |
|------------------------- |:-------------- |
|
| 36 |
| Pretraining Objective | Masked Language Modeling |
|
| 37 |
+
| Learning Rate | 6E-04 |
|
| 38 |
| Learning Rate Scheduler | Cosine |
|
| 39 |
+
| Warmup | 360,000,000 tokens |
|
| 40 |
| Optimizer | decoupled_stableadamw |
|
| 41 |
| Optimizer Hyperparameters | AdamW (β1=0.9,β2=0.98,ε =1e-06 ) |
|
| 42 |
| Weight Decay | 1E-05 |
|
| 43 |
+
| Global Batch Size | 512 |
|
| 44 |
| Dropout | 1E-01 |
|
| 45 |
| Activation Function | GeLU |
|
| 46 |
|
|
|
|
| 53 |
|
| 54 |
>>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT-science')
|
| 55 |
|
| 56 |
+
>>> pprint(unmasker("Hubble's<mask>describes the expansion of the universe and the relationship between a galaxy's distance and its recessional velocity.", top_k=3))
|
| 57 |
+
[{'score': 0.8203125,
|
| 58 |
+
'sequence': "Hubble's law describes the expansion of the universe and the "
|
| 59 |
+
"relationship between a galaxy's distance and its recessional "
|
| 60 |
+
'velocity.',
|
| 61 |
+
'token': 21673,
|
| 62 |
+
'token_str': 'law'},
|
| 63 |
+
{'score': 0.1259765625,
|
| 64 |
+
'sequence': "Hubble's Law describes the expansion of the universe and the "
|
| 65 |
+
"relationship between a galaxy's distance and its recessional "
|
| 66 |
+
'velocity.',
|
| 67 |
+
'token': 18573,
|
| 68 |
+
'token_str': 'Law'},
|
| 69 |
+
{'score': 0.0247802734375,
|
| 70 |
+
'sequence': "Hubble's equation describes the expansion of the universe and "
|
| 71 |
+
"the relationship between a galaxy's distance and its "
|
| 72 |
+
'recessional velocity.',
|
| 73 |
+
'token': 174396,
|
| 74 |
+
'token_str': 'equation'}]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
```
|
| 76 |
|
| 77 |
### EVALUATION: Text Classification
|
| 78 |
|
| 79 |
+
In addition to the MrBERT family, the following base foundation models were considered:
|
| 80 |
|
| 81 |
| Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
|
| 82 |
|---------------------------------|----------------------|------------|-------------|
|
| 83 |
| [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 279M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
|
| 84 |
| [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa) | 283M | 256K | RoBERTa base model pretrained with 35 European languages and a larger vocabulary size. |
|
| 85 |
| [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base) | 308M | 250K | Multilingual ModernBERT pre-trained with staged language learning. |
|
| 86 |
+
| [INDUS](https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1) | 125M | 50k | RoBERTa-based, Encoder-only transformer model, domain-adapted for NASA Science Mission Directorate (SMD) applications |
|
| 87 |
+
| [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased) | 110M | 31k | BERT model trained on scientific text. |
|
| 88 |
|
| 89 |
The benchmarks used for comparison are:
|
| 90 |
+
- **MTEB:** Subset of scientific-related tasks.
|
| 91 |
- **AbScientia:** A Spanish scientific abstracts dataset focused on STEM disciplines, organized into 24 unified categories to capture the specifications of Spanish scientific language. The reported metric is accuracy.
|
| 92 |
- **Scientific-paragraphs-categorization ('en' split):** A large-scale topic-classification dataset built from open-access scientific publications, classified into 26 scientific topics to capture the characteristics of standard English scientific language. The reported metric is accuracy.
|
| 93 |
|
| 94 |
+
| | **mmBERT (308M)** | **MrBERT (308M)** | **MrBERT-es (150M)** | **MrBERT-science (308M)** | **INDUS (125M)** | **SciBERT (110M)** |
|
| 95 |
+
|-------------------|------------------:|------------------:|---------------------:|--------------------------:|-----------------:|-------------------:|
|
| 96 |
+
| chemhotpotqa | 62.73 | 62.70 | 60.36 | 63.14 | <u>63.79</u> | **64.43** |
|
| 97 |
+
| chemnq | 34.87 | <u>39.70</u> | 29.01 | 39.24 | **39.78** | 39.09 |
|
| 98 |
+
| climate_fever_v2 | <u>23.32</u> | 23.21 | 22.20 | 23.01 | **23.91** | 23.06 |
|
| 99 |
+
| litsearch | 11.06 | 12.17 | 11.49 | 12.24 | <u>12.24</u> | **13.52** |
|
| 100 |
+
| **Average** | 32.99 | 34.44 | 30.77 | 34.41 | <u>34.93</u> | **35.02** |
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
| tasks | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | MrBERT (308M) | MrBERT-es (150M) | MrBERT-science (150M) | INDUS (125M) | SciBERT (110M) |
|
| 104 |
+
|:---------------------------|:--------------------------|:------------------|:----------------|:----------------|:-------------------|:------------------------|:-------------|:----------------|
|
| 105 |
+
| abscientia (es) | 80.87 | 81.45 | **82.66** | 81.98 | <u>82.39</u> | 82.34 | 79.24 | 80.80 |
|
| 106 |
+
| scientific_paragraphs (en) | 58.18 | 58.34 | 60.22 | 61.02 | 61.80 | 62.20 | **63.95** | <u>62.77</u> |
|
| 107 |
|
| 108 |
## Additional information
|
| 109 |
|
|
|
|
| 114 |
For further information, please send an email to <langtech@bsc.es>.
|
| 115 |
|
| 116 |
### Copyright
|
| 117 |
+
Copyright(c) 2026 by Language Technologies Lab, Barcelona Supercomputing Center.
|
| 118 |
|
| 119 |
### Funding
|
| 120 |
|
config.json
CHANGED
|
@@ -1,16 +1,15 @@
|
|
| 1 |
{
|
| 2 |
-
"_name_or_path": "/gpfs/scratch/bsc88/bsc088070/Encoder_test/conversion/EngSpa_tok_science/ModernBERT_EngSpaTok_science_top1_lr18",
|
| 3 |
"architectures": [
|
| 4 |
"ModernBertForMaskedLM"
|
| 5 |
],
|
| 6 |
"attention_bias": false,
|
| 7 |
"attention_dropout": 0.0,
|
| 8 |
-
"bos_token_id":
|
| 9 |
"classifier_activation": "silu",
|
| 10 |
"classifier_bias": false,
|
| 11 |
"classifier_dropout": 0.0,
|
| 12 |
"classifier_pooling": "mean",
|
| 13 |
-
"cls_token_id":
|
| 14 |
"decoder_bias": true,
|
| 15 |
"deterministic_flash_attn": false,
|
| 16 |
"embedding_dropout": 0.0,
|
|
@@ -34,14 +33,11 @@
|
|
| 34 |
"norm_eps": 1e-05,
|
| 35 |
"num_attention_heads": 12,
|
| 36 |
"num_hidden_layers": 22,
|
| 37 |
-
"pad_token_id":
|
| 38 |
"position_embedding_type": "absolute",
|
| 39 |
-
"reference_compile": null,
|
| 40 |
-
"repad_logits_with_grad": false,
|
| 41 |
"sep_token_id": 2,
|
| 42 |
-
"
|
| 43 |
-
"
|
| 44 |
-
"
|
| 45 |
-
"
|
| 46 |
-
|
| 47 |
-
}
|
|
|
|
| 1 |
{
|
|
|
|
| 2 |
"architectures": [
|
| 3 |
"ModernBertForMaskedLM"
|
| 4 |
],
|
| 5 |
"attention_bias": false,
|
| 6 |
"attention_dropout": 0.0,
|
| 7 |
+
"bos_token_id": 1,
|
| 8 |
"classifier_activation": "silu",
|
| 9 |
"classifier_bias": false,
|
| 10 |
"classifier_dropout": 0.0,
|
| 11 |
"classifier_pooling": "mean",
|
| 12 |
+
"cls_token_id": 1,
|
| 13 |
"decoder_bias": true,
|
| 14 |
"deterministic_flash_attn": false,
|
| 15 |
"embedding_dropout": 0.0,
|
|
|
|
| 33 |
"norm_eps": 1e-05,
|
| 34 |
"num_attention_heads": 12,
|
| 35 |
"num_hidden_layers": 22,
|
| 36 |
+
"pad_token_id": 3,
|
| 37 |
"position_embedding_type": "absolute",
|
|
|
|
|
|
|
| 38 |
"sep_token_id": 2,
|
| 39 |
+
"tie_word_embeddings": true,
|
| 40 |
+
"torch_dtype": "bfloat16",
|
| 41 |
+
"transformers_version": "4.48.0",
|
| 42 |
+
"vocab_size": 256128
|
| 43 |
+
}
|
|
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:481c17a5a9437d6079367583b7ad88683b6d85156bdeb4d19593ae833b8f6751
|
| 3 |
+
size 1231552912
|
pytorch_model.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e78abd24b7bf22d79684c8c8e6f5517829d39ec2005cf9060eaa28bdc31cf9be
|
| 3 |
+
size 1231581870
|
special_tokens_map.json
CHANGED
|
@@ -1,7 +1,4 @@
|
|
| 1 |
{
|
| 2 |
-
"additional_special_tokens": [
|
| 3 |
-
"<|translation|>"
|
| 4 |
-
],
|
| 5 |
"bos_token": {
|
| 6 |
"content": "<s>",
|
| 7 |
"lstrip": false,
|
|
@@ -37,4 +34,4 @@
|
|
| 37 |
"rstrip": false,
|
| 38 |
"single_word": false
|
| 39 |
}
|
| 40 |
-
}
|
|
|
|
| 1 |
{
|
|
|
|
|
|
|
|
|
|
| 2 |
"bos_token": {
|
| 3 |
"content": "<s>",
|
| 4 |
"lstrip": false,
|
|
|
|
| 34 |
"rstrip": false,
|
| 35 |
"single_word": false
|
| 36 |
}
|
| 37 |
+
}
|
tokenizer.json
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:031e825b2072f555233d0d01abc2cde072183173a53967e7e842813b9673748c
|
| 3 |
+
size 19092952
|
tokenizer.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5072e3209a04aa01dbf4db72b8fec52cf8cd06a042c9ba819678e084f7b665d5
|
| 3 |
+
size 4813283
|
tokenizer_config.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|