Update README.md
Browse files
README.md
CHANGED
|
@@ -11,12 +11,12 @@ license: apache-2.0
|
|
| 11 |
# Tiny BERT December 2022
|
| 12 |
|
| 13 |
This is a more up-to-date version of the [original tiny BERT](https://huggingface.co/google/bert_uncased_L-2_H-128_A-2) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962) (English only, uncased, trained with WordPiece masking).
|
| 14 |
-
In addition to being more up-to-date, it is more CPU friendly than its base version.
|
| 15 |
|
| 16 |
-
We think it is fair to directly compare our model to the original tiny BERT because our model was trained with about the same level of compute as the original tiny BERT.
|
| 17 |
-
Our model was trained on a cleaned December 2022 snapshot of Common Crawl and Wikipedia.
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
This is important because we want our models to know about events like COVID or
|
| 21 |
a presidential election right after they happen.
|
| 22 |
|
|
@@ -25,6 +25,27 @@ a presidential election right after they happen.
|
|
| 25 |
You can use the raw model for masked language modeling, but it's mostly intended to
|
| 26 |
be fine-tuned on a downstream task, such as sequence classification, token classification or question answering.
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
## Dataset
|
| 29 |
|
| 30 |
The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](https://huggingface.co/datasets/olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547) plus this [December 2022 cleaned Wikipedia dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221220).\
|
|
|
|
| 11 |
# Tiny BERT December 2022
|
| 12 |
|
| 13 |
This is a more up-to-date version of the [original tiny BERT](https://huggingface.co/google/bert_uncased_L-2_H-128_A-2) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962) (English only, uncased, trained with WordPiece masking).
|
| 14 |
+
In addition to being more up-to-date, it is more CPU friendly than its base version, but its first version and is not perfect by no means.
|
| 15 |
|
|
|
|
|
|
|
| 16 |
|
| 17 |
+
The model was trained on a cleaned December 2022 snapshot of Common Crawl and Wikipedia.
|
| 18 |
+
|
| 19 |
+
This model was intended to be part of the OLM project, which has the goal of continuously training and releasing models that are up-to-date and comparable in standard language model performance to their static counterparts.
|
| 20 |
This is important because we want our models to know about events like COVID or
|
| 21 |
a presidential election right after they happen.
|
| 22 |
|
|
|
|
| 25 |
You can use the raw model for masked language modeling, but it's mostly intended to
|
| 26 |
be fine-tuned on a downstream task, such as sequence classification, token classification or question answering.
|
| 27 |
|
| 28 |
+
## Special note
|
| 29 |
+
|
| 30 |
+
It looks like the olm tinybert is underperforming the original from a quick glue finetuning and dev evaluation:
|
| 31 |
+
|
| 32 |
+
Original
|
| 33 |
+
```bash
|
| 34 |
+
{'cola_mcc': 0.0, 'sst2_acc': 0.7981651376146789, 'mrpc_acc': 0.6838235294117647, 'mrpc_f1': 0.8122270742358079, 'stsb_pear': 0.67208
|
| 35 |
+
2873279731, 'stsb_spear': 0.6933378278505834, 'qqp_acc': 0.7766420762598881, 'mnli_acc': 0.6542027508914926, 'mnli_acc_mm': 0.6670056
|
| 36 |
+
956875509, 'qnli_acc': 0.774665934468241, 'rte_acc': 0.5776173285198556, 'wnli_acc': 0.49295774647887325}
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
OLM
|
| 40 |
+
```bash
|
| 41 |
+
{'cola_mcc': 0.0, 'sst2_acc': 0.7970183486238532, 'mrpc_acc': 0.6838235294117647, 'mrpc_f1': 0.8122270742358079, 'stsb_pear': -0.1597
|
| 42 |
+
8233085015087, 'stsb_spear': -0.13638650127051932, 'qqp_acc': 0.6292213609628794, 'mnli_acc': 0.5323484462557311, 'mnli_acc_mm': 0.54
|
| 43 |
+
65825874694874, 'qnli_acc': 0.6199890170236134, 'rte_acc': 0.5595667870036101, 'wnli_acc': 0.5352112676056338}
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
Probably messed up with hyperparameters and tokenizer a bit, unfortunately. Stay tuned for version 2 🚀🚀🚀
|
| 47 |
+
|
| 48 |
+
|
| 49 |
## Dataset
|
| 50 |
|
| 51 |
The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](https://huggingface.co/datasets/olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547) plus this [December 2022 cleaned Wikipedia dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221220).\
|