Update README.md
Browse files
README.md
CHANGED
|
@@ -9,162 +9,19 @@ datasets:
|
|
| 9 |
|
| 10 |
Pretrained mbart-large-cc25 model finetuned on ERRnews Estonian news story dataset.
|
| 11 |
|
| 12 |
-
##
|
| 13 |
-
|
| 14 |
-
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
|
| 15 |
-
was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of
|
| 16 |
-
publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
|
| 17 |
-
was pretrained with two objectives:
|
| 18 |
-
|
| 19 |
-
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
|
| 20 |
-
the entire masked sentence through the model and has to predict the masked words. This is different from traditional
|
| 21 |
-
recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
|
| 22 |
-
GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the
|
| 23 |
-
sentence.
|
| 24 |
-
- Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
|
| 25 |
-
they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
|
| 26 |
-
predict if the two sentences were following each other or not.
|
| 27 |
-
|
| 28 |
-
This way, the model learns an inner representation of the English language that can then be used to extract features
|
| 29 |
-
useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard
|
| 30 |
-
classifier using the features produced by the BERT model as inputs.
|
| 31 |
-
|
| 32 |
-
## Model variations
|
| 33 |
-
|
| 34 |
-
BERT has originally been released in base and large variations, for cased and uncased input text. The uncased models also strips out an accent markers.
|
| 35 |
-
Chinese and multilingual uncased and cased versions followed shortly after.
|
| 36 |
-
Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of two models.
|
| 37 |
-
Other 24 smaller models are released afterward.
|
| 38 |
-
|
| 39 |
-
The detailed release history can be found on the [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md) on github.
|
| 40 |
-
|
| 41 |
-
| Model | #params | Language |
|
| 42 |
-
|------------------------|--------------------------------|-------|
|
| 43 |
-
| [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) | 110M | English |
|
| 44 |
-
| [`bert-large-uncased`](https://huggingface.co/bert-large-uncased) | 340M | English | sub
|
| 45 |
-
| [`bert-base-cased`](https://huggingface.co/bert-base-cased) | 110M | English |
|
| 46 |
-
| [`bert-large-cased`](https://huggingface.co/bert-large-cased) | 340M | English |
|
| 47 |
-
| [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) | 110M | Chinese |
|
| 48 |
-
| [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased) | 110M | Multiple |
|
| 49 |
-
| [`bert-large-uncased-whole-word-masking`](https://huggingface.co/bert-large-uncased-whole-word-masking) | 340M | English |
|
| 50 |
-
| [`bert-large-cased-whole-word-masking`](https://huggingface.co/bert-large-cased-whole-word-masking) | 340M | English |
|
| 51 |
-
|
| 52 |
-
## Intended uses & limitations
|
| 53 |
-
|
| 54 |
-
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
| 55 |
-
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
|
| 56 |
-
fine-tuned versions of a task that interests you.
|
| 57 |
-
|
| 58 |
-
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
|
| 59 |
-
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
|
| 60 |
-
generation you should look at model like GPT2.
|
| 61 |
-
|
| 62 |
-
### How to use
|
| 63 |
-
|
| 64 |
-
You can use this model directly with a pipeline for masked language modeling:
|
| 65 |
|
| 66 |
```python
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
{'sequence': "[CLS] hello i'm a role model. [SEP]",
|
| 75 |
-
'score': 0.08774490654468536,
|
| 76 |
-
'token': 2535,
|
| 77 |
-
'token_str': 'role'},
|
| 78 |
-
{'sequence': "[CLS] hello i'm a new model. [SEP]",
|
| 79 |
-
'score': 0.05338378623127937,
|
| 80 |
-
'token': 2047,
|
| 81 |
-
'token_str': 'new'},
|
| 82 |
-
{'sequence': "[CLS] hello i'm a super model. [SEP]",
|
| 83 |
-
'score': 0.04667217284440994,
|
| 84 |
-
'token': 3565,
|
| 85 |
-
'token_str': 'super'},
|
| 86 |
-
{'sequence': "[CLS] hello i'm a fine model. [SEP]",
|
| 87 |
-
'score': 0.027095865458250046,
|
| 88 |
-
'token': 2986,
|
| 89 |
-
'token_str': 'fine'}]
|
| 90 |
```
|
| 91 |
|
| 92 |
-
Here is how to use this model to get the features of a given text in PyTorch:
|
| 93 |
-
|
| 94 |
-
```python
|
| 95 |
-
from transformers import BertTokenizer, BertModel
|
| 96 |
-
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
| 97 |
-
model = BertModel.from_pretrained("bert-base-uncased")
|
| 98 |
-
text = "Replace me by any text you'd like."
|
| 99 |
-
encoded_input = tokenizer(text, return_tensors='pt')
|
| 100 |
-
output = model(**encoded_input)
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
and in TensorFlow:
|
| 104 |
-
|
| 105 |
-
```python
|
| 106 |
-
from transformers import BertTokenizer, TFBertModel
|
| 107 |
-
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
| 108 |
-
model = TFBertModel.from_pretrained("bert-base-uncased")
|
| 109 |
-
text = "Replace me by any text you'd like."
|
| 110 |
-
encoded_input = tokenizer(text, return_tensors='tf')
|
| 111 |
-
output = model(encoded_input)
|
| 112 |
-
```
|
| 113 |
-
|
| 114 |
-
### Limitations and bias
|
| 115 |
-
|
| 116 |
-
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
| 117 |
-
predictions:
|
| 118 |
-
|
| 119 |
-
```python
|
| 120 |
-
>>> from transformers import pipeline
|
| 121 |
-
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
|
| 122 |
-
>>> unmasker("The man worked as a [MASK].")
|
| 123 |
-
[{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
|
| 124 |
-
'score': 0.09747550636529922,
|
| 125 |
-
'token': 10533,
|
| 126 |
-
'token_str': 'carpenter'},
|
| 127 |
-
{'sequence': '[CLS] the man worked as a waiter. [SEP]',
|
| 128 |
-
'score': 0.0523831807076931,
|
| 129 |
-
'token': 15610,
|
| 130 |
-
'token_str': 'waiter'},
|
| 131 |
-
{'sequence': '[CLS] the man worked as a barber. [SEP]',
|
| 132 |
-
'score': 0.04962705448269844,
|
| 133 |
-
'token': 13362,
|
| 134 |
-
'token_str': 'barber'},
|
| 135 |
-
{'sequence': '[CLS] the man worked as a mechanic. [SEP]',
|
| 136 |
-
'score': 0.03788609802722931,
|
| 137 |
-
'token': 15893,
|
| 138 |
-
'token_str': 'mechanic'},
|
| 139 |
-
{'sequence': '[CLS] the man worked as a salesman. [SEP]',
|
| 140 |
-
'score': 0.037680890411138535,
|
| 141 |
-
'token': 18968,
|
| 142 |
-
'token_str': 'salesman'}]
|
| 143 |
-
>>> unmasker("The woman worked as a [MASK].")
|
| 144 |
-
[{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
|
| 145 |
-
'score': 0.21981462836265564,
|
| 146 |
-
'token': 6821,
|
| 147 |
-
'token_str': 'nurse'},
|
| 148 |
-
{'sequence': '[CLS] the woman worked as a waitress. [SEP]',
|
| 149 |
-
'score': 0.1597415804862976,
|
| 150 |
-
'token': 13877,
|
| 151 |
-
'token_str': 'waitress'},
|
| 152 |
-
{'sequence': '[CLS] the woman worked as a maid. [SEP]',
|
| 153 |
-
'score': 0.1154729500412941,
|
| 154 |
-
'token': 10850,
|
| 155 |
-
'token_str': 'maid'},
|
| 156 |
-
{'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
|
| 157 |
-
'score': 0.037968918681144714,
|
| 158 |
-
'token': 19215,
|
| 159 |
-
'token_str': 'prostitute'},
|
| 160 |
-
{'sequence': '[CLS] the woman worked as a cook. [SEP]',
|
| 161 |
-
'score': 0.03042375110089779,
|
| 162 |
-
'token': 5660,
|
| 163 |
-
'token_str': 'cook'}]
|
| 164 |
-
```
|
| 165 |
-
|
| 166 |
-
This bias will also affect all fine-tuned versions of this model.
|
| 167 |
-
|
| 168 |
## Training data
|
| 169 |
|
| 170 |
The mBART model was finetuned on [ERRnews](https://huggingface.co/datasets/TalTechNLP/ERRnews), a dataset consisting of 10 420
|
|
@@ -179,6 +36,7 @@ used is Adam with a learning rate of 5e-05, \\(\beta_{1} = 0.9\\) and \\(\beta_{
|
|
| 179 |
|
| 180 |
This model achieves the following results:
|
| 181 |
|
|
|
|
| 182 |
| Dataset | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-L-SUM |
|
| 183 |
|:-------:|:-------:|:-------:|:-------:|:-----------:|
|
| 184 |
| ERRnews | 19.2 | 6.7 | 16.1 | 17.4 |
|
|
|
|
| 9 |
|
| 10 |
Pretrained mbart-large-cc25 model finetuned on ERRnews Estonian news story dataset.
|
| 11 |
|
| 12 |
+
## How to use
|
| 13 |
+
Here is how to use this model to get a summary of a given text in PyTorch:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
```python
|
| 16 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
| 17 |
+
tokenizer = AutoTokenizer.from_pretrained("TalTechNLP/mBART-ERRnews")
|
| 18 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("TalTechNLP/mBART-ERRnews")
|
| 19 |
+
text = "Riigikogu rahanduskomisjon võttis esmaspäeval maha riigieelarvesse esitatud investeeringuettepanekutest siseministeeriumi investeeringud koolidele ja lasteaedadele, sest komisjoni hinnangul ei peaks siseministeerium tegelema investeeringutega väljaspoole oma vastutusala. Komisjoni esimees Aivar Kokk ütles, et komisjon lähtus otsuse tegemisel riigikontrolör Janar Holmi soovitusest ja seadustest."
|
| 20 |
+
inputs = tokenizer(text, return_tensors='pt', max_length=1024)
|
| 21 |
+
summary_ids = model.generate(inputs['input_ids'])
|
| 22 |
+
summary = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
```
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
## Training data
|
| 26 |
|
| 27 |
The mBART model was finetuned on [ERRnews](https://huggingface.co/datasets/TalTechNLP/ERRnews), a dataset consisting of 10 420
|
|
|
|
| 36 |
|
| 37 |
This model achieves the following results:
|
| 38 |
|
| 39 |
+
|
| 40 |
| Dataset | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-L-SUM |
|
| 41 |
|:-------:|:-------:|:-------:|:-------:|:-----------:|
|
| 42 |
| ERRnews | 19.2 | 6.7 | 16.1 | 17.4 |
|