David
commited on
Commit
·
5697249
1
Parent(s):
7e22797
Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- es
|
| 4 |
+
thumbnail: "url to a thumbnail used in social sharing"
|
| 5 |
+
license: apache-2.0
|
| 6 |
+
datasets:
|
| 7 |
+
- oscar
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# SELECTRA: A Spanish ELECTRA
|
| 11 |
+
|
| 12 |
+
SELECTRA is a Spanish pre-trained language model based on [ELECTRA](https://github.com/google-research/electra).
|
| 13 |
+
We release a `small` and `medium` version with the following configuration:
|
| 14 |
+
|
| 15 |
+
| Model | Layers | Embedding/Hidden Size | Params | Vocab Size | Max Sequence Length | Cased |
|
| 16 |
+
| --- | --- | --- | --- | --- | --- | --- |
|
| 17 |
+
| SELECTRA small | 12 | 256 | 22M | 50k | 512 | True |
|
| 18 |
+
| **SELECTRA medium** | **12** | **384** | **41M** | **50k** | **512** | **True** |
|
| 19 |
+
|
| 20 |
+
Selectra small (medium) is about 5 (3) times smaller than BETO but achieves comparable results (see Metrics section below).
|
| 21 |
+
|
| 22 |
+
## Usage
|
| 23 |
+
|
| 24 |
+
From the original [ELECTRA model card](https://huggingface.co/google/electra-small-discriminator): "ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN."
|
| 25 |
+
The discriminator should therefore activate the logit corresponding to the fake input token, as the following example demonstrates:
|
| 26 |
+
|
| 27 |
+
```python
|
| 28 |
+
from transformers import ElectraForPreTraining, ElectraTokenizerFast
|
| 29 |
+
|
| 30 |
+
discriminator = ElectraForPreTraining.from_pretrained("Recognai/selectra_small")
|
| 31 |
+
tokenizer = ElectraTokenizerFast.from_pretrained("Recognai/selectra_small")
|
| 32 |
+
|
| 33 |
+
sentence_with_fake_token = "Estamos desayunando pan rosa con tomate y aceite de oliva."
|
| 34 |
+
|
| 35 |
+
inputs = tokenizer.encode(sentence_with_fake_token, return_tensors="pt")
|
| 36 |
+
logits = discriminator(inputs).logits.tolist()[0]
|
| 37 |
+
|
| 38 |
+
print("\t".join(tokenizer.tokenize(sentence_with_fake_token)))
|
| 39 |
+
print("\t".join(map(lambda x: str(x)[:4], logits[1:-1])))
|
| 40 |
+
"""Output:
|
| 41 |
+
Estamos desayun ##ando pan rosa con tomate y aceite de oliva .
|
| 42 |
+
-3.1 -3.6 -6.9 -3.0 0.19 -4.5 -3.3 -5.1 -5.7 -7.7 -4.4 -4.2
|
| 43 |
+
"""
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
However, you probably want to use this model to fine-tune it on a down-stream task.
|
| 47 |
+
|
| 48 |
+
- Links to our zero-shot-classifiers
|
| 49 |
+
|
| 50 |
+
## Metrics
|
| 51 |
+
|
| 52 |
+
We fine-tune our models on 4 different down-stream tasks:
|
| 53 |
+
|
| 54 |
+
- [XNLI](https://huggingface.co/datasets/xnli)
|
| 55 |
+
- [PAWS-X](https://huggingface.co/datasets/paws-x)
|
| 56 |
+
- [CoNLL2002 - POS](https://huggingface.co/datasets/conll2002)
|
| 57 |
+
- [CoNLL2002 - NER](https://huggingface.co/datasets/conll2002)
|
| 58 |
+
|
| 59 |
+
For each task, we conduct 5 trials and state the mean and standard deviation of the metrics in the table below.
|
| 60 |
+
To compare our results to other Spanish language models, we provide the same metrics taken from [Table 4](https://huggingface.co/bertin-project/bertin-roberta-base-spanish#results) of the Bertin-project model card.
|
| 61 |
+
|
| 62 |
+
| Model | CoNLL2002 - POS (acc) | CoNLL2002 - NER (f1) | PAWS-X (acc) | XNLI (acc) | Params |
|
| 63 |
+
| --- | --- | --- | --- | --- | --- |
|
| 64 |
+
| SELECTRA small | 0.9653 +- 0.0007 | 0.863 +- 0.004 | 0.896 +- 0.002 | 0.784 +- 0.002 | **22M** |
|
| 65 |
+
| SELECTRA medium | 0.9677 +- 0.0004 | 0.870 +- 0.003 | 0.896 +- 0.002 | **0.804 +- 0.002** | 41M |
|
| 66 |
+
| [mBERT](https://huggingface.co/bert-base-multilingual-cased) | 0.9689 | 0.8616 | 0.8895 | 0.7606 | 178M |
|
| 67 |
+
| [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) | 0.9693 | 0.8596 | 0.8720 | 0.8012 | 110M |
|
| 68 |
+
| [BSC-BNE](https://huggingface.co/BSC-TeMU/roberta-base-bne) | **0.9706** | **0.8764** | 0.8815 | 0.7771 | 125M |
|
| 69 |
+
| [Bertin](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512) | 0.9697 | 0.8707 | **0.8965** | 0.7843 | 125M |
|
| 70 |
+
|
| 71 |
+
Some details of our fine-tuning runs:
|
| 72 |
+
- epochs: 5
|
| 73 |
+
- batch-size: 32
|
| 74 |
+
- learning rate: 1e-4
|
| 75 |
+
- warmup proportion: 0.1
|
| 76 |
+
- linear learning rate decay
|
| 77 |
+
- layerwise learning rate decay
|
| 78 |
+
|
| 79 |
+
For all the details, check out our [selectra repo](https://github.com/recognai/selectra).
|
| 80 |
+
|
| 81 |
+
## Training
|
| 82 |
+
|
| 83 |
+
We pre-trained our SELECTRA models on the Spanish portion of the [Oscar](https://huggingface.co/datasets/oscar) dataset, which is about 150GB in size.
|
| 84 |
+
Each model version is trained for 300k steps, with a warm restart of the learning rate after the first 150k steps.
|
| 85 |
+
Some details of the training:
|
| 86 |
+
- steps: 300k
|
| 87 |
+
- batch-size: 128
|
| 88 |
+
- learning rate: 5e-4
|
| 89 |
+
- warmup steps: 10k
|
| 90 |
+
- linear learning rate decay
|
| 91 |
+
- TPU cores: 8 (v2-8)
|
| 92 |
+
|
| 93 |
+
For all details, check out our [selectra repo](https://github.com/recognai/selectra).
|
| 94 |
+
|
| 95 |
+
**Note:** Due to a misconfiguration in the pre-training scripts the embeddings of the vocabulary containing an accent were not optimized. If you fine-tune this model on a down-stream task, you might consider using a tokenizer that does not strip the accents:
|
| 96 |
+
```python
|
| 97 |
+
tokenizer = ElectraTokenizerFast.from_pretrained("Recognai/selectra_small", strip_accents=False)
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
## Motivation
|
| 101 |
+
|
| 102 |
+
Despite the abundance of excellent Spanish language models (BETO, BSC-BNE, Bertin, ELECTRICIDAD, etc.), we felt there was still a lack of distilled or compact Spanish language models and a lack of comparing those to their bigger siblings.
|
| 103 |
+
|
| 104 |
+
## Acknowledgment
|
| 105 |
+
|
| 106 |
+
This research was supported by the Google TPU Research Cloud (TRC) program.
|
| 107 |
+
|
| 108 |
+
## Authors
|
| 109 |
+
|
| 110 |
+
- David Fidalgo ([GitHub](https://github.com/dcfidalgo))
|
| 111 |
+
- Javier Lopez ([GitHub](https://github.com/javispp))
|
| 112 |
+
- Daniel Vila ([GitHub](https://github.com/dvsrepo))
|
| 113 |
+
- Francisco Aranda ([GitHub](https://github.com/frascuchon))
|