Fix bugs for bias
#3
by
Katsumata420
- opened
- README.md +20 -15
- model.safetensors +1 -1
README.md
CHANGED
|
@@ -10,6 +10,11 @@ language:
|
|
| 10 |
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
|
| 11 |
It is designed for use in Japanese.
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
## Model Details
|
| 14 |
|
| 15 |
### Model Description
|
|
@@ -19,12 +24,12 @@ The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
|
|
| 19 |
It is designed for use in Japanese.
|
| 20 |
|
| 21 |
This model offers several advanced features compared to traditional BERT models:
|
| 22 |
-
- **PreNorm**: Improved stability during training.
|
| 23 |
-
- **SwiGLU**: Enhanced activation function for better performance.
|
| 24 |
-
- **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.
|
| 25 |
-
- **Max Sequence Length**: 2048 tokens, allowing for longer context.
|
| 26 |
-
- **Parameters**: 1.3 billion parameters.
|
| 27 |
-
- **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
|
| 28 |
- **Token Type IDs**: Not used in this model.
|
| 29 |
|
| 30 |
### Model Sources
|
|
@@ -44,9 +49,9 @@ Depending on your use case, follow the appropriate section below.
|
|
| 44 |
|
| 45 |
This model is pre-trained using Masked Language Modeling.
|
| 46 |
The mask token used is `<MASK|LLM-jp>`.
|
| 47 |
-
Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.
|
| 48 |
-
|
| 49 |
-
Example code for direct use:
|
| 50 |
|
| 51 |
```python
|
| 52 |
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
|
|
@@ -98,7 +103,7 @@ The model was trained on the following hyperparameters.
|
|
| 98 |
- Floating point expression: BF16
|
| 99 |
|
| 100 |
## Evaluation
|
| 101 |
-
We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
|
| 102 |
We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
|
| 103 |
|
| 104 |
| Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
|
|
@@ -106,7 +111,7 @@ We adjusted the learning rate and training epochs for each model and task in acc
|
|
| 106 |
| tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 |
|
| 107 |
| tohoku-nlp/bert-large-japanese-v2| 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 |
|
| 108 |
| ku-nlp/deberta-v3-base-japaneseγγγγ| 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 |
|
| 109 |
-
| retrieva-jp/bert-1.3bγγγγγγγγγγγγγγγγγγγγγγγγ| 0.
|
| 110 |
|
| 111 |
|
| 112 |
## Technical Specifications
|
|
@@ -121,9 +126,9 @@ The RetrievaBERT model is based on BERT with the following hyperparameters:
|
|
| 121 |
- Maximum length of position embeddings: 2048
|
| 122 |
|
| 123 |
As mentioned earlier, the main differences from the original BERT are:
|
| 124 |
-
- PreNorm: Improved stability during training.
|
| 125 |
-
- SwiGLU: Enhanced activation function for better performance.
|
| 126 |
-
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
|
| 127 |
|
| 128 |
|
| 129 |
### Compute Infrastructure
|
|
@@ -145,4 +150,4 @@ https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)
|
|
| 145 |
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
|
| 146 |
|
| 147 |
## Model Card Contact
|
| 148 |
-
pr@retrieva.jp
|
|
|
|
| 10 |
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
|
| 11 |
It is designed for use in Japanese.
|
| 12 |
|
| 13 |
+
## What's New
|
| 14 |
+
|
| 15 |
+
- November 2024 (`v1.0.1`): Bug fix for the model parameters.
|
| 16 |
+
- The up_proj's bias was initialized with the gate's one. This bug was fixed.
|
| 17 |
+
|
| 18 |
## Model Details
|
| 19 |
|
| 20 |
### Model Description
|
|
|
|
| 24 |
It is designed for use in Japanese.
|
| 25 |
|
| 26 |
This model offers several advanced features compared to traditional BERT models:
|
| 27 |
+
- **PreNorm**: Improved stability during training.
|
| 28 |
+
- **SwiGLU**: Enhanced activation function for better performance.
|
| 29 |
+
- **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.
|
| 30 |
+
- **Max Sequence Length**: 2048 tokens, allowing for longer context.
|
| 31 |
+
- **Parameters**: 1.3 billion parameters.
|
| 32 |
+
- **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
|
| 33 |
- **Token Type IDs**: Not used in this model.
|
| 34 |
|
| 35 |
### Model Sources
|
|
|
|
| 49 |
|
| 50 |
This model is pre-trained using Masked Language Modeling.
|
| 51 |
The mask token used is `<MASK|LLM-jp>`.
|
| 52 |
+
Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.
|
| 53 |
+
|
| 54 |
+
Example code for direct use:
|
| 55 |
|
| 56 |
```python
|
| 57 |
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
|
|
|
|
| 103 |
- Floating point expression: BF16
|
| 104 |
|
| 105 |
## Evaluation
|
| 106 |
+
We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
|
| 107 |
We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
|
| 108 |
|
| 109 |
| Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
|
|
|
|
| 111 |
| tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 |
|
| 112 |
| tohoku-nlp/bert-large-japanese-v2| 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 |
|
| 113 |
| ku-nlp/deberta-v3-base-japaneseγγγγ| 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 |
|
| 114 |
+
| retrieva-jp/bert-1.3bγγγγγγγγγγγγγγγγγγγγγγγγ| 0.959 | 0.917 | 0.881 | 0.898 | 0.875 | 0.874 | 0.827 |
|
| 115 |
|
| 116 |
|
| 117 |
## Technical Specifications
|
|
|
|
| 126 |
- Maximum length of position embeddings: 2048
|
| 127 |
|
| 128 |
As mentioned earlier, the main differences from the original BERT are:
|
| 129 |
+
- PreNorm: Improved stability during training.
|
| 130 |
+
- SwiGLU: Enhanced activation function for better performance.
|
| 131 |
+
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
|
| 132 |
|
| 133 |
|
| 134 |
### Compute Infrastructure
|
|
|
|
| 150 |
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
|
| 151 |
|
| 152 |
## Model Card Contact
|
| 153 |
+
pr@retrieva.jp
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 2602880000
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:994bd099f4bb0c9bab36ed16e1a8271f46f637de6b06e32fa1f29643d7b528c9
|
| 3 |
size 2602880000
|