aravind-812 commited on
Commit
e1a9ca8
1 Parent(s): 6f920a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +255 -1
README.md CHANGED
@@ -1 +1,255 @@
1
- hello
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ 馃殌 Models 馃敟 Datasets 鈿★笍 Pricing
3
+ Back to model page
4
+ Model: roberta-large
5
+ pytorch tf roberta masked-lm en dataset:bookcorpus 馃彿 dataset:wikipedia 馃彿 arxiv:1907.11692 arxiv:1806.02847 exbert license:mit
6
+ Model card Files and versions
7
+ Use in transformers
8
+ roberta-large
9
+ / README.md
10
+ julien-c
11
+ Migrate model card from transformers-repo Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755 Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/roberta-large-README.md
12
+ 50c7b24
13
+
14
+ 1 month ago
15
+ raw
16
+ history
17
+ blame
18
+ 9,295 Bytes
19
+
20
+ ---
21
+ language: en
22
+ tags:
23
+ - exbert
24
+ license: mit
25
+ datasets:
26
+ - bookcorpus
27
+ - wikipedia
28
+ ---
29
+
30
+ # RoBERTa large model
31
+
32
+ Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
33
+ [this paper](https://arxiv.org/abs/1907.11692) and first released in
34
+ [this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). This model is case-sensitive: it
35
+ makes a difference between english and English.
36
+
37
+ Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by
38
+ the Hugging Face team.
39
+
40
+ ## Model description
41
+
42
+ RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means
43
+ it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
44
+ publicly available data) with an automatic process to generate inputs and labels from those texts.
45
+
46
+ More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model
47
+ randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict
48
+ the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one
49
+ after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to
50
+ learn a bidirectional representation of the sentence.
51
+
52
+ This way, the model learns an inner representation of the English language that can then be used to extract features
53
+ useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
54
+ classifier using the features produced by the BERT model as inputs.
55
+
56
+ ## Intended uses & limitations
57
+
58
+ You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
59
+ See the [model hub](https://huggingface.co/models?filter=roberta) to look for fine-tuned versions on a task that
60
+ interests you.
61
+
62
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
63
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
64
+ generation you should look at model like GPT2.
65
+
66
+ ### How to use
67
+
68
+ You can use this model directly with a pipeline for masked language modeling:
69
+
70
+ ```python
71
+ >>> from transformers import pipeline
72
+ >>> unmasker = pipeline('fill-mask', model='roberta-large')
73
+ >>> unmasker("Hello I'm a <mask> model.")
74
+
75
+ [{'sequence': "<s>Hello I'm a male model.</s>",
76
+ 'score': 0.3317350447177887,
77
+ 'token': 2943,
78
+ 'token_str': '臓male'},
79
+ {'sequence': "<s>Hello I'm a fashion model.</s>",
80
+ 'score': 0.14171843230724335,
81
+ 'token': 2734,
82
+ 'token_str': '臓fashion'},
83
+ {'sequence': "<s>Hello I'm a professional model.</s>",
84
+ 'score': 0.04291723668575287,
85
+ 'token': 2038,
86
+ 'token_str': '臓professional'},
87
+ {'sequence': "<s>Hello I'm a freelance model.</s>",
88
+ 'score': 0.02134818211197853,
89
+ 'token': 18150,
90
+ 'token_str': '臓freelance'},
91
+ {'sequence': "<s>Hello I'm a young model.</s>",
92
+ 'score': 0.021098261699080467,
93
+ 'token': 664,
94
+ 'token_str': '臓young'}]
95
+ ```
96
+
97
+ Here is how to use this model to get the features of a given text in PyTorch:
98
+
99
+ ```python
100
+ from transformers import RobertaTokenizer, RobertaModel
101
+ tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
102
+ model = RobertaModel.from_pretrained('roberta-large')
103
+ text = "Replace me by any text you'd like."
104
+ encoded_input = tokenizer(text, return_tensors='pt')
105
+ output = model(**encoded_input)
106
+ ```
107
+
108
+ and in TensorFlow:
109
+
110
+ ```python
111
+ from transformers import RobertaTokenizer, TFRobertaModel
112
+ tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
113
+ model = TFRobertaModel.from_pretrained('roberta-large')
114
+ text = "Replace me by any text you'd like."
115
+ encoded_input = tokenizer(text, return_tensors='tf')
116
+ output = model(encoded_input)
117
+ ```
118
+
119
+ ### Limitations and bias
120
+
121
+ The training data used for this model contains a lot of unfiltered content from the internet, which is far from
122
+ neutral. Therefore, the model can have biased predictions:
123
+
124
+ ```python
125
+ >>> from transformers import pipeline
126
+ >>> unmasker = pipeline('fill-mask', model='roberta-large')
127
+ >>> unmasker("The man worked as a <mask>.")
128
+
129
+ [{'sequence': '<s>The man worked as a mechanic.</s>',
130
+ 'score': 0.08260300755500793,
131
+ 'token': 25682,
132
+ 'token_str': '臓mechanic'},
133
+ {'sequence': '<s>The man worked as a driver.</s>',
134
+ 'score': 0.05736079439520836,
135
+ 'token': 1393,
136
+ 'token_str': '臓driver'},
137
+ {'sequence': '<s>The man worked as a teacher.</s>',
138
+ 'score': 0.04709019884467125,
139
+ 'token': 3254,
140
+ 'token_str': '臓teacher'},
141
+ {'sequence': '<s>The man worked as a bartender.</s>',
142
+ 'score': 0.04641604796051979,
143
+ 'token': 33080,
144
+ 'token_str': '臓bartender'},
145
+ {'sequence': '<s>The man worked as a waiter.</s>',
146
+ 'score': 0.04239227622747421,
147
+ 'token': 38233,
148
+ 'token_str': '臓waiter'}]
149
+
150
+ >>> unmasker("The woman worked as a <mask>.")
151
+
152
+ [{'sequence': '<s>The woman worked as a nurse.</s>',
153
+ 'score': 0.2667474150657654,
154
+ 'token': 9008,
155
+ 'token_str': '臓nurse'},
156
+ {'sequence': '<s>The woman worked as a waitress.</s>',
157
+ 'score': 0.12280137836933136,
158
+ 'token': 35698,
159
+ 'token_str': '臓waitress'},
160
+ {'sequence': '<s>The woman worked as a teacher.</s>',
161
+ 'score': 0.09747499972581863,
162
+ 'token': 3254,
163
+ 'token_str': '臓teacher'},
164
+ {'sequence': '<s>The woman worked as a secretary.</s>',
165
+ 'score': 0.05783602222800255,
166
+ 'token': 2971,
167
+ 'token_str': '臓secretary'},
168
+ {'sequence': '<s>The woman worked as a cleaner.</s>',
169
+ 'score': 0.05576248839497566,
170
+ 'token': 16126,
171
+ 'token_str': '臓cleaner'}]
172
+ ```
173
+
174
+ This bias will also affect all fine-tuned versions of this model.
175
+
176
+ ## Training data
177
+
178
+ The RoBERTa model was pretrained on the reunion of five datasets:
179
+ - [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books;
180
+ - [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers) ;
181
+ - [CC-News](https://commoncrawl.org/2016/10/news-dataset-available/), a dataset containing 63 millions English news
182
+ articles crawled between September 2016 and February 2019.
183
+ - [OpenWebText](https://github.com/jcpeterson/openwebtext), an opensource recreation of the WebText dataset used to
184
+ train GPT-2,
185
+ - [Stories](https://arxiv.org/abs/1806.02847) a dataset containing a subset of CommonCrawl data filtered to match the
186
+ story-like style of Winograd schemas.
187
+
188
+ Together theses datasets weight 160GB of text.
189
+
190
+ ## Training procedure
191
+
192
+ ### Preprocessing
193
+
194
+ The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of
195
+ the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
196
+ with `<s>` and the end of one by `</s>`
197
+
198
+ The details of the masking procedure for each sentence are the following:
199
+ - 15% of the tokens are masked.
200
+ - In 80% of the cases, the masked tokens are replaced by `<mask>`.
201
+
202
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
203
+ - In the 10% remaining cases, the masked tokens are left as is.
204
+
205
+ Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
206
+
207
+ ### Pretraining
208
+
209
+ The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The
210
+ optimizer used is Adam with a learning rate of 4e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
211
+ \\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 30,000 steps and linear decay of the learning
212
+ rate after.
213
+
214
+ ## Evaluation results
215
+
216
+ When fine-tuned on downstream tasks, this model achieves the following results:
217
+
218
+ Glue test results:
219
+
220
+ | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
221
+ |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
222
+ | | 90.2 | 92.2 | 94.7 | 96.4 | 68.0 | 96.4 | 90.9 | 86.6 |
223
+
224
+
225
+ ### BibTeX entry and citation info
226
+
227
+ ```bibtex
228
+ @article{DBLP:journals/corr/abs-1907-11692,
229
+ author = {Yinhan Liu and
230
+ Myle Ott and
231
+ Naman Goyal and
232
+ Jingfei Du and
233
+ Mandar Joshi and
234
+ Danqi Chen and
235
+ Omer Levy and
236
+ Mike Lewis and
237
+ Luke Zettlemoyer and
238
+ Veselin Stoyanov},
239
+ title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
240
+ journal = {CoRR},
241
+ volume = {abs/1907.11692},
242
+ year = {2019},
243
+ url = {http://arxiv.org/abs/1907.11692},
244
+ archivePrefix = {arXiv},
245
+ eprint = {1907.11692},
246
+ timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},
247
+ biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},
248
+ bibsource = {dblp computer science bibliography, https://dblp.org}
249
+ }
250
+ ```
251
+
252
+ <a href="https://huggingface.co/exbert/?model=roberta-base">
253
+ <img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
254
+ </a>
255
+