henryharm commited on
Commit
066d8b5
·
1 Parent(s): c3ba182

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -152
README.md CHANGED
@@ -9,162 +9,19 @@ datasets:
9
 
10
  Pretrained mbart-large-cc25 model finetuned on ERRnews Estonian news story dataset.
11
 
12
- ## Model description
13
-
14
- BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
15
- was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of
16
- publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
17
- was pretrained with two objectives:
18
-
19
- - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
20
- the entire masked sentence through the model and has to predict the masked words. This is different from traditional
21
- recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
22
- GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the
23
- sentence.
24
- - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
25
- they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
26
- predict if the two sentences were following each other or not.
27
-
28
- This way, the model learns an inner representation of the English language that can then be used to extract features
29
- useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard
30
- classifier using the features produced by the BERT model as inputs.
31
-
32
- ## Model variations
33
-
34
- BERT has originally been released in base and large variations, for cased and uncased input text. The uncased models also strips out an accent markers.
35
- Chinese and multilingual uncased and cased versions followed shortly after.
36
- Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of two models.
37
- Other 24 smaller models are released afterward.
38
-
39
- The detailed release history can be found on the [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md) on github.
40
-
41
- | Model | #params | Language |
42
- |------------------------|--------------------------------|-------|
43
- | [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) | 110M | English |
44
- | [`bert-large-uncased`](https://huggingface.co/bert-large-uncased) | 340M | English | sub
45
- | [`bert-base-cased`](https://huggingface.co/bert-base-cased) | 110M | English |
46
- | [`bert-large-cased`](https://huggingface.co/bert-large-cased) | 340M | English |
47
- | [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) | 110M | Chinese |
48
- | [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased) | 110M | Multiple |
49
- | [`bert-large-uncased-whole-word-masking`](https://huggingface.co/bert-large-uncased-whole-word-masking) | 340M | English |
50
- | [`bert-large-cased-whole-word-masking`](https://huggingface.co/bert-large-cased-whole-word-masking) | 340M | English |
51
-
52
- ## Intended uses & limitations
53
-
54
- You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
55
- be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
56
- fine-tuned versions of a task that interests you.
57
-
58
- Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
59
- to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
60
- generation you should look at model like GPT2.
61
-
62
- ### How to use
63
-
64
- You can use this model directly with a pipeline for masked language modeling:
65
 
66
  ```python
67
- >>> from transformers import pipeline
68
- >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
69
- >>> unmasker("Hello I'm a [MASK] model.")
70
- [{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
71
- 'score': 0.1073106899857521,
72
- 'token': 4827,
73
- 'token_str': 'fashion'},
74
- {'sequence': "[CLS] hello i'm a role model. [SEP]",
75
- 'score': 0.08774490654468536,
76
- 'token': 2535,
77
- 'token_str': 'role'},
78
- {'sequence': "[CLS] hello i'm a new model. [SEP]",
79
- 'score': 0.05338378623127937,
80
- 'token': 2047,
81
- 'token_str': 'new'},
82
- {'sequence': "[CLS] hello i'm a super model. [SEP]",
83
- 'score': 0.04667217284440994,
84
- 'token': 3565,
85
- 'token_str': 'super'},
86
- {'sequence': "[CLS] hello i'm a fine model. [SEP]",
87
- 'score': 0.027095865458250046,
88
- 'token': 2986,
89
- 'token_str': 'fine'}]
90
  ```
91
 
92
- Here is how to use this model to get the features of a given text in PyTorch:
93
-
94
- ```python
95
- from transformers import BertTokenizer, BertModel
96
- tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
97
- model = BertModel.from_pretrained("bert-base-uncased")
98
- text = "Replace me by any text you'd like."
99
- encoded_input = tokenizer(text, return_tensors='pt')
100
- output = model(**encoded_input)
101
- ```
102
-
103
- and in TensorFlow:
104
-
105
- ```python
106
- from transformers import BertTokenizer, TFBertModel
107
- tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
108
- model = TFBertModel.from_pretrained("bert-base-uncased")
109
- text = "Replace me by any text you'd like."
110
- encoded_input = tokenizer(text, return_tensors='tf')
111
- output = model(encoded_input)
112
- ```
113
-
114
- ### Limitations and bias
115
-
116
- Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
117
- predictions:
118
-
119
- ```python
120
- >>> from transformers import pipeline
121
- >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
122
- >>> unmasker("The man worked as a [MASK].")
123
- [{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
124
- 'score': 0.09747550636529922,
125
- 'token': 10533,
126
- 'token_str': 'carpenter'},
127
- {'sequence': '[CLS] the man worked as a waiter. [SEP]',
128
- 'score': 0.0523831807076931,
129
- 'token': 15610,
130
- 'token_str': 'waiter'},
131
- {'sequence': '[CLS] the man worked as a barber. [SEP]',
132
- 'score': 0.04962705448269844,
133
- 'token': 13362,
134
- 'token_str': 'barber'},
135
- {'sequence': '[CLS] the man worked as a mechanic. [SEP]',
136
- 'score': 0.03788609802722931,
137
- 'token': 15893,
138
- 'token_str': 'mechanic'},
139
- {'sequence': '[CLS] the man worked as a salesman. [SEP]',
140
- 'score': 0.037680890411138535,
141
- 'token': 18968,
142
- 'token_str': 'salesman'}]
143
- >>> unmasker("The woman worked as a [MASK].")
144
- [{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
145
- 'score': 0.21981462836265564,
146
- 'token': 6821,
147
- 'token_str': 'nurse'},
148
- {'sequence': '[CLS] the woman worked as a waitress. [SEP]',
149
- 'score': 0.1597415804862976,
150
- 'token': 13877,
151
- 'token_str': 'waitress'},
152
- {'sequence': '[CLS] the woman worked as a maid. [SEP]',
153
- 'score': 0.1154729500412941,
154
- 'token': 10850,
155
- 'token_str': 'maid'},
156
- {'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
157
- 'score': 0.037968918681144714,
158
- 'token': 19215,
159
- 'token_str': 'prostitute'},
160
- {'sequence': '[CLS] the woman worked as a cook. [SEP]',
161
- 'score': 0.03042375110089779,
162
- 'token': 5660,
163
- 'token_str': 'cook'}]
164
- ```
165
-
166
- This bias will also affect all fine-tuned versions of this model.
167
-
168
  ## Training data
169
 
170
  The mBART model was finetuned on [ERRnews](https://huggingface.co/datasets/TalTechNLP/ERRnews), a dataset consisting of 10 420
@@ -179,6 +36,7 @@ used is Adam with a learning rate of 5e-05, \\(\beta_{1} = 0.9\\) and \\(\beta_{
179
 
180
  This model achieves the following results:
181
 
 
182
  | Dataset | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-L-SUM |
183
  |:-------:|:-------:|:-------:|:-------:|:-----------:|
184
  | ERRnews | 19.2 | 6.7 | 16.1 | 17.4 |
 
9
 
10
  Pretrained mbart-large-cc25 model finetuned on ERRnews Estonian news story dataset.
11
 
12
+ ## How to use
13
+ Here is how to use this model to get a summary of a given text in PyTorch:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ```python
16
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
17
+ tokenizer = AutoTokenizer.from_pretrained("TalTechNLP/mBART-ERRnews")
18
+ model = AutoModelForSeq2SeqLM.from_pretrained("TalTechNLP/mBART-ERRnews")
19
+ text = "Riigikogu rahanduskomisjon võttis esmaspäeval maha riigieelarvesse esitatud investeeringuettepanekutest siseministeeriumi investeeringud koolidele ja lasteaedadele, sest komisjoni hinnangul ei peaks siseministeerium tegelema investeeringutega väljaspoole oma vastutusala. Komisjoni esimees Aivar Kokk ütles, et komisjon lähtus otsuse tegemisel riigikontrolör Janar Holmi soovitusest ja seadustest."
20
+ inputs = tokenizer(text, return_tensors='pt', max_length=1024)
21
+ summary_ids = model.generate(inputs['input_ids'])
22
+ summary = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  ```
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ## Training data
26
 
27
  The mBART model was finetuned on [ERRnews](https://huggingface.co/datasets/TalTechNLP/ERRnews), a dataset consisting of 10 420
 
36
 
37
  This model achieves the following results:
38
 
39
+
40
  | Dataset | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-L-SUM |
41
  |:-------:|:-------:|:-------:|:-------:|:-----------:|
42
  | ERRnews | 19.2 | 6.7 | 16.1 | 17.4 |