Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
datasets:
|
| 5 |
+
- wikipedia
|
| 6 |
+
- c4
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# Perceiver IO for language
|
| 10 |
+
|
| 11 |
+
Perceiver IO model pre-trained on the Masked Language Modeling (MLM) task proposed in [BERT](https://arxiv.org/abs/1810.04805) using a large text corpus obtained by combining [English Wikipedia](https://huggingface.co/datasets/wikipedia) and [C4](https://huggingface.co/datasets/c4). It was introduced in the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Jaegle et al. and first released in [this repository](https://github.com/deepmind/deepmind-research/tree/master/perceiver).
|
| 12 |
+
|
| 13 |
+
Disclaimer: The team releasing Perceiver IO did not write a model card for this model so this model card has been written by the Hugging Face team.
|
| 14 |
+
|
| 15 |
+
## Model description
|
| 16 |
+
|
| 17 |
+
Perceiver IO is a transformer encoder model that can be applied on any modality (text, images, audio, video, ...). The core idea is to employ the self-attention mechanism on a set of not-too large latent vectors, and only use the inputs to perform cross-attention with the latents. This allows for the time and memory requirements of the self-attention mechanism to not depend on the size of the inputs.
|
| 18 |
+
|
| 19 |
+
To decode, the authors employ so-called decoder queries, which allow to flexibly decode the final hidden states of the latents to produce outputs of arbitrary size and semantics. For masked language modeling, the output is a tensor containing the prediction scores of the language modeling head, of shape (batch_size, seq_length, vocab_size).
|
| 20 |
+
|
| 21 |
+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg" alt="drawing" width="600"/>
|
| 22 |
+
|
| 23 |
+
<small> Perceiver IO architecture.</small>
|
| 24 |
+
|
| 25 |
+
As the time and memory requirements of the self-attention mechanism don't depend on the size of the inputs, the Perceiver IO authors train the model directly on raw UTF-8 bytes, rather than on subwords as is done in models like BERT, RoBERTa and GPT-2. This has many benefits: one doesn't need to train a tokenizer before training the model, one doesn't need to maintain a (fixed) vocabulary file, and this also doesn't hurt model performance as shown by [Bostrom et al., 2020](https://arxiv.org/abs/2004.03720).
|
| 26 |
+
|
| 27 |
+
By pre-training the model, it learns an inner representation of language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
|
| 28 |
+
classifier using the features produced by the BERT model as inputs.
|
| 29 |
+
|
| 30 |
+
## Intended uses & limitations
|
| 31 |
+
|
| 32 |
+
You can use the raw model for masked language modeling, but the model is intended to be fine-tuned on a labeled dataset. See the [model hub](https://huggingface.co/models?search=deepmind/perceiver) to look for fine-tuned versions on a task that interests you.
|
| 33 |
+
|
| 34 |
+
### How to use
|
| 35 |
+
|
| 36 |
+
Here is how to use this model in PyTorch:
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
from transformers import PerceiverTokenizer, PerceiverForMaskedLM
|
| 40 |
+
|
| 41 |
+
tokenizer = PerceiverTokenizer.from_pretrained("deepmind/language-perceiver")
|
| 42 |
+
model = PerceiverForMaskedLM.from_pretrained("deepmind/language-perceiver")
|
| 43 |
+
|
| 44 |
+
text = "This is an incomplete sentence where some words are missing."
|
| 45 |
+
# prepare input
|
| 46 |
+
encoding = tokenizer(text, padding="max_length", return_tensors="pt")
|
| 47 |
+
# mask " missing.". Note that the model performs much better if the masked span starts with a space.
|
| 48 |
+
encoding.input_ids[0, 52:61] = tokenizer.mask_token_id
|
| 49 |
+
inputs, input_mask = encoding.input_ids.to(device), encoding.attention_mask.to(device)
|
| 50 |
+
|
| 51 |
+
# forward pass
|
| 52 |
+
outputs = model(inputs=inputs, attention_mask=input_mask)
|
| 53 |
+
logits = outputs.logits
|
| 54 |
+
masked_tokens_predictions = logits[0, 51:61].argmax(dim=-1)
|
| 55 |
+
print(tokenizer.decode(masked_tokens_predictions))
|
| 56 |
+
>>> should print " missing."
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
## Training data
|
| 60 |
+
|
| 61 |
+
This model was pretrained on a combination of [English Wikipedia](https://huggingface.co/datasets/wikipedia) and [C4](https://huggingface.co/datasets/c4). 70% of the training tokens were sampled from the C4 dataset and the remaining 30% from Wikipedia. The authors concatenate 10 documents before splitting into crops to reduce wasteful computation on padding tokens.
|
| 62 |
+
|
| 63 |
+
## Training procedure
|
| 64 |
+
|
| 65 |
+
### Preprocessing
|
| 66 |
+
|
| 67 |
+
Text preprocessing is trivial: it only involves encoding text into UTF-8 bytes, and padding them up to the same length (2048).
|
| 68 |
+
|
| 69 |
+
### Pretraining
|
| 70 |
+
|
| 71 |
+
Hyperparameter details can be found in table 9 of the [paper](https://arxiv.org/abs/2107.14795).
|
| 72 |
+
|
| 73 |
+
## Evaluation results
|
| 74 |
+
|
| 75 |
+
This model is able to achieve an average score of 81.8 on GLUE. For more details, we refer to table 3 f the original paper.
|
| 76 |
+
|
| 77 |
+
### BibTeX entry and citation info
|
| 78 |
+
|
| 79 |
+
```bibtex
|
| 80 |
+
@article{DBLP:journals/corr/abs-2107-14795,
|
| 81 |
+
author = {Andrew Jaegle and
|
| 82 |
+
Sebastian Borgeaud and
|
| 83 |
+
Jean{-}Baptiste Alayrac and
|
| 84 |
+
Carl Doersch and
|
| 85 |
+
Catalin Ionescu and
|
| 86 |
+
David Ding and
|
| 87 |
+
Skanda Koppula and
|
| 88 |
+
Daniel Zoran and
|
| 89 |
+
Andrew Brock and
|
| 90 |
+
Evan Shelhamer and
|
| 91 |
+
Olivier J. H{\'{e}}naff and
|
| 92 |
+
Matthew M. Botvinick and
|
| 93 |
+
Andrew Zisserman and
|
| 94 |
+
Oriol Vinyals and
|
| 95 |
+
Jo{\~{a}}o Carreira},
|
| 96 |
+
title = {Perceiver {IO:} {A} General Architecture for Structured Inputs {\&}
|
| 97 |
+
Outputs},
|
| 98 |
+
journal = {CoRR},
|
| 99 |
+
volume = {abs/2107.14795},
|
| 100 |
+
year = {2021},
|
| 101 |
+
url = {https://arxiv.org/abs/2107.14795},
|
| 102 |
+
eprinttype = {arXiv},
|
| 103 |
+
eprint = {2107.14795},
|
| 104 |
+
timestamp = {Tue, 03 Aug 2021 14:53:34 +0200},
|
| 105 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-2107-14795.bib},
|
| 106 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
| 107 |
+
}
|
| 108 |
+
```
|