Buckets:
| language: en | |
| tags: | |
| - fnet | |
| license: apache-2.0 | |
| datasets: | |
| - c4 | |
| # FNet base model | |
| Pretrained model on English language using a masked language modeling (MLM) and next sentence prediction (NSP) objective. It was | |
| introduced in [this paper](https://arxiv.org/abs/2105.03824) and first released in [this repository](https://github.com/google-research/google-research/tree/master/f_net). | |
| This model is cased: it makes a difference between english and English. The model achieves 0.58 accuracy on MLM objective and 0.80 on NSP objective. | |
| Disclaimer: This model card has been written by [gchhablani](https://huggingface.co/gchhablani). | |
| ## Model description | |
| FNet is a transformers model with attention replaced with fourier transforms. Hence, the inputs do not contain an `attention_mask`. It is pretrained on a large corpus of | |
| English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling | |
| them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and | |
| labels from those texts. More precisely, it was pretrained with two objectives: | |
| - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run | |
| the entire masked sentence through the model and has to predict the masked words. This is different from traditional | |
| recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like | |
| GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the | |
| sentence. | |
| - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes | |
| they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to | |
| predict if the two sentences were following each other or not. | |
| This way, the model learns an inner representation of the English language that can then be used to extract features | |
| useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard | |
| classifier using the features produced by the FNet model as inputs. | |
| ## Intended uses & limitations | |
| You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to | |
| be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=fnet) to look for | |
| fine-tuned versions on a task that interests you. | |
| Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) | |
| to make decisions, such as sequence classification, token classification or question answering. For tasks such as text | |
| generation you should look at model like GPT2. | |
| ## Training data | |
| The FNet model was pretrained on [C4](https://huggingface.co/datasets/c4), a cleaned version of the Common Crawl dataset. | |
| ## Training procedure | |
| ### Preprocessing | |
| The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 32,000. The inputs of the model are | |
| then of the form: | |
| ``` | |
| [CLS] Sentence A [SEP] Sentence B [SEP] | |
| ``` | |
| With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in | |
| the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a | |
| consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two | |
| "sentences" has a combined length of less than 512 tokens. | |
| The details of the masking procedure for each sentence are the following: | |
| - 15% of the tokens are masked. | |
| - In 80% of the cases, the masked tokens are replaced by `[MASK]`. | |
| - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. | |
| - In the 10% remaining cases, the masked tokens are left as is. | |
| ### Pretraining | |
| FNet-base was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size | |
| of 256. The sequence length was limited to 512 tokens. The optimizer | |
| used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01, | |
| learning rate warmup for 10,000 steps and linear decay of the learning rate after. | |
| ## Evaluation results | |
| FNet-base was fine-tuned and evaluated on the validation data of the [GLUE benchamrk](https://huggingface.co/datasets/glue). The results of the official model (written in Flax) can be seen in Table 1 on page 7 of [the official paper](https://arxiv.org/abs/2105.03824). | |
| For comparison, this model (ported to PyTorch) was fine-tuned and evaluated using the [official Hugging Face GLUE evaluation scripts](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification#glue-tasks) alongside [bert-base-cased](https://hf.co/models/bert-base-cased) for comparison. | |
| The training was done on a single 16GB NVIDIA Tesla V100 GPU. For MRPC/WNLI, the models were trained for 5 epochs, while for other tasks, the models were trained for 3 epochs. A sequence length of 512 was used with batch size 16 and learning rate 2e-5. | |
| The following table summarizes the results for [fnet-base](https://huggingface.co/google/fnet-base) (called *FNet (PyTorch) - Reproduced*) and [bert-base-cased](https://hf.co/models/bert-base-cased) (called *Bert (PyTorch) - Reproduced*) in terms of **fine-tuning** speed. The format is *hour:min:seconds*. **Note** that the authors compared **pre-traning** speed in [the official paper](https://arxiv.org/abs/2105.03824) instead. | |
| | Task/Model | FNet-base (PyTorch) |Bert-base (PyTorch)| | |
| |:----:|:-----------:|:----:| | |
| | MNLI-(m/mm) | [06:40:55](https://huggingface.co/gchhablani/fnet-base-finetuned-mnli) | [09:52:33](https://huggingface.co/gchhablani/bert-base-cased-finetuned-mnli)| | |
| | QQP | [06:21:16](https://huggingface.co/gchhablani/fnet-base-finetuned-qqp) | [09:25:01](https://huggingface.co/gchhablani/bert-base-cased-finetuned-qqp) | | |
| | QNLI | [01:48:22](https://huggingface.co/gchhablani/fnet-base-finetuned-qnli) | [02:40:22](https://huggingface.co/gchhablani/bert-base-cased-finetuned-qnli)| | |
| | SST-2 | [01:09:27](https://huggingface.co/gchhablani/fnet-base-finetuned-sst2) | [01:42:17](https://huggingface.co/gchhablani/bert-base-cased-finetuned-sst2)| | |
| | CoLA | [00:09:47](https://huggingface.co/gchhablani/fnet-base-finetuned-cola) | [00:14:20](https://huggingface.co/gchhablani/bert-base-cased-finetuned-cola)| | |
| | STS-B | [00:07:09](https://huggingface.co/gchhablani/fnet-base-finetuned-stsb) | [00:10:24](https://huggingface.co/gchhablani/bert-base-cased-finetuned-stsb)| | |
| | MRPC | [00:07:48](https://huggingface.co/gchhablani/fnet-base-finetuned-mrpc) | [00:11:12](https://huggingface.co/gchhablani/bert-base-cased-finetuned-mrpc)| | |
| | RTE | [00:03:24](https://huggingface.co/gchhablani/fnet-base-finetuned-rte) | [00:04:51](https://huggingface.co/gchhablani/bert-base-cased-finetuned-rte)| | |
| | WNLI | [00:02:37](https://huggingface.co/gchhablani/fnet-base-finetuned-wnli) | [00:03:23](https://huggingface.co/gchhablani/bert-base-cased-finetuned-wnli)| | |
| | SUM | 16:30:45 | 24:23:56 | | |
| On average the PyTorch version of FNet-base requires *ca.* 32% less time for GLUE fine-tuning on GPU. | |
| The following table summarizes the results for [fnet-base](https://huggingface.co/google/fnet-base) (called *FNet (PyTorch) - Reproduced*) and [bert-base-cased](https://hf.co/models/bert-base-cased) (called *Bert (PyTorch) - Reproduced*) in terms of performance and compares it to the reported performance of the official FNet-base model (called *FNet (Flax) - Official*). Note that the training hyperparameters of the reproduced models were not the same as the official model, so the performance may differ significantly for some tasks (for example: CoLA). | |
| | Task/Model | Metric | FNet-base (PyTorch) | Bert-base (PyTorch) | FNet-Base (Flax - official) | | |
| |:----:|:-----------:|:----:|:-----------:|:----:| | |
| | MNLI-(m/mm) | Accuracy or Match/Mismatch | [76.75](https://huggingface.co/gchhablani/fnet-base-finetuned-mnli) | [84.10](https://huggingface.co/gchhablani/bert-base-cased-finetuned-mnli) | 72/73 | | |
| | QQP | mean(Accuracy,F1) | [86.5](https://huggingface.co/gchhablani/fnet-base-finetuned-qqp) | [89.26](https://huggingface.co/gchhablani/bert-base-cased-finetuned-qqp) | 83 | | |
| | QNLI | Accuracy | [84.39](https://huggingface.co/gchhablani/fnet-base-finetuned-qnli) | [90.99](https://huggingface.co/gchhablani/bert-base-cased-finetuned-qnli) | 80 | | |
| | SST-2 | Accuracy | [89.45](https://huggingface.co/gchhablani/fnet-base-finetuned-sst2) | [92.32](https://huggingface.co/gchhablani/bert-base-cased-finetuned-sst2) | 95 | | |
| | CoLA | Matthews corr or Accuracy | [35.94](https://huggingface.co/gchhablani/fnet-base-finetuned-cola) | [59.57](https://huggingface.co/gchhablani/bert-base-cased-finetuned-cola) | 69 | | |
| | STS-B | Spearman corr. | [82.19](https://huggingface.co/gchhablani/fnet-base-finetuned-stsb) | [88.98](https://huggingface.co/gchhablani/bert-base-cased-finetuned-stsb) | 79 | | |
| | MRPC | mean(F1/Accuracy) | [81.15](https://huggingface.co/gchhablani/fnet-base-finetuned-mrpc) | [88.15](https://huggingface.co/gchhablani/bert-base-cased-finetuned-mrpc) | 76 | | |
| | RTE | Accuracy | [62.82](https://huggingface.co/gchhablani/fnet-base-finetuned-rte) | [67.15](https://huggingface.co/gchhablani/bert-base-cased-finetuned-rte) | 63 | | |
| | WNLI | Accuracy | [54.93](https://huggingface.co/gchhablani/fnet-base-finetuned-wnli) | [46.48](https://huggingface.co/gchhablani/bert-base-cased-finetuned-wnli) | - | | |
| | Avg | - | 72.7 | 78.6 | 76.7 | | |
| We can see that FNet-base achieves around 93% of BERT-base's performance on average. | |
| For more details, please refer to the checkpoints linked with the scores. On overview of all fine-tuned checkpoints of the following table can be accessed [here](https://huggingface.co/models?other=fnet-bert-base-comparison). | |
| ### How to use | |
| You can use this model directly with a pipeline for masked language modeling: | |
| **Note: The mask filling pipeline doesn't work exactly as the original model performs masking after converting to tokens. In masking pipeline an additional space is added after the [MASK].** | |
| ```python | |
| >>> from transformers import FNetForMaskedLM, FNetTokenizer, pipeline | |
| >>> tokenizer = FNetTokenizer.from_pretrained("google/fnet-base") | |
| >>> model = FNetForMaskedLM.from_pretrained("google/fnet-base") | |
| >>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer) | |
| >>> unmasker("Hello I'm a [MASK] model.") | |
| [ | |
| {"sequence": "hello i'm a new model.", "score": 0.12073223292827606, "token": 351, "token_str": "new"}, | |
| {"sequence": "hello i'm a first model.", "score": 0.08501081168651581, "token": 478, "token_str": "first"}, | |
| {"sequence": "hello i'm a next model.", "score": 0.060546260327100754, "token": 1037, "token_str": "next"}, | |
| {"sequence": "hello i'm a last model.", "score": 0.038265593349933624, "token": 813, "token_str": "last"}, | |
| {"sequence": "hello i'm a sister model.", "score": 0.033868927508592606, "token": 6232, "token_str": "sister"}, | |
| ] | |
| ``` | |
| Here is how to use this model to get the features of a given text in PyTorch: | |
| **Note: You must specify the maximum sequence length to be 512 and truncate/pad to the same length because the original model has no attention mask and considers all the hidden states during forward pass.** | |
| ```python | |
| from transformers import FNetTokenizer, FNetModel | |
| tokenizer = FNetTokenizer.from_pretrained("google/fnet-base") | |
| model = FNetModel.from_pretrained("google/fnet-base") | |
| text = "Replace me by any text you'd like." | |
| encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=512) | |
| output = model(**encoded_input) | |
| ``` | |
| ### BibTeX entry and citation info | |
| ```bibtex | |
| @article{DBLP:journals/corr/abs-2105-03824, | |
| author = {James Lee{-}Thorp and | |
| Joshua Ainslie and | |
| Ilya Eckstein and | |
| Santiago Onta{\~{n}}{\'{o}}n}, | |
| title = {FNet: Mixing Tokens with Fourier Transforms}, | |
| journal = {CoRR}, | |
| volume = {abs/2105.03824}, | |
| year = {2021}, | |
| url = {https://arxiv.org/abs/2105.03824}, | |
| archivePrefix = {arXiv}, | |
| eprint = {2105.03824}, | |
| timestamp = {Fri, 14 May 2021 12:13:30 +0200}, | |
| biburl = {https://dblp.org/rec/journals/corr/abs-2105-03824.bib}, | |
| bibsource = {dblp computer science bibliography, https://dblp.org} | |
| } | |
| ``` | |
| ## Contributions | |
| Thanks to [@gchhablani](https://huggingface.co/gchhablani) for adding this model. |
Xet Storage Details
- Size:
- 12.6 kB
- Xet hash:
- 311dfe68462af48a161a627079aa9b4cea2a0a8c734c8a070cb8da8dbdac0f24
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.