| # BERT | |
| **\*\*\*\*\* New March 11th, 2020: Smaller BERT Models \*\*\*\*\*** | |
| This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). | |
| We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. | |
| Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. | |
| You can download all 24 from [here][all], or individually from the table below: | |
| | |H=128|H=256|H=512|H=768| | |
| |---|:---:|:---:|:---:|:---:| | |
| | **L=2** |[**2/128 (BERT-Tiny)**][2_128]|[2/256][2_256]|[2/512][2_512]|[2/768][2_768]| | |
| | **L=4** |[4/128][4_128]|[**4/256 (BERT-Mini)**][4_256]|[**4/512 (BERT-Small)**][4_512]|[4/768][4_768]| | |
| | **L=6** |[6/128][6_128]|[6/256][6_256]|[6/512][6_512]|[6/768][6_768]| | |
| | **L=8** |[8/128][8_128]|[8/256][8_256]|[**8/512 (BERT-Medium)**][8_512]|[8/768][8_768]| | |
| | **L=10** |[10/128][10_128]|[10/256][10_256]|[10/512][10_512]|[10/768][10_768]| | |
| | **L=12** |[12/128][12_128]|[12/256][12_256]|[12/512][12_512]|[**12/768 (BERT-Base)**][12_768]| | |
| Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model. | |
| Here are the corresponding GLUE scores on the test set: | |
| |Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX| | |
| |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | |
| |BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0| | |
| |BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1| | |
| |BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6| | |
| |BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5| | |
| For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs: | |
| - batch sizes: 8, 16, 32, 64, 128 | |
| - learning rates: 3e-4, 1e-4, 5e-5, 3e-5 | |
| If you use these models, please cite the following paper: | |
| ``` | |
| @article{turc2019, | |
| title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models}, | |
| author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, | |
| journal={arXiv preprint arXiv:1908.08962v2 }, | |
| year={2019} | |
| } | |
| ``` | |
| [2_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-128_A-2.zip | |
| [2_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-256_A-4.zip | |
| [2_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-512_A-8.zip | |
| [2_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-768_A-12.zip | |
| [4_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-128_A-2.zip | |
| [4_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-256_A-4.zip | |
| [4_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-512_A-8.zip | |
| [4_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-768_A-12.zip | |
| [6_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-128_A-2.zip | |
| [6_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-256_A-4.zip | |
| [6_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-512_A-8.zip | |
| [6_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-768_A-12.zip | |
| [8_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-128_A-2.zip | |
| [8_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-256_A-4.zip | |
| [8_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-512_A-8.zip | |
| [8_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-768_A-12.zip | |
| [10_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-128_A-2.zip | |
| [10_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-256_A-4.zip | |
| [10_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-512_A-8.zip | |
| [10_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-768_A-12.zip | |
| [12_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-128_A-2.zip | |
| [12_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-256_A-4.zip | |
| [12_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-512_A-8.zip | |
| [12_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-768_A-12.zip | |
| [all]: https://storage.googleapis.com/bert_models/2020_02_20/all_bert_models.zip | |
| **\*\*\*\*\* New May 31st, 2019: Whole Word Masking Models \*\*\*\*\*** | |
| This is a release of several new models which were the result of an improvement | |
| the pre-processing code. | |
| In the original pre-processing code, we randomly select WordPiece tokens to | |
| mask. For example: | |
| `Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head` | |
| `Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil | |
| [MASK] ##mon ' s head` | |
| The new technique is called Whole Word Masking. In this case, we always mask | |
| *all* of the the tokens corresponding to a word at once. The overall masking | |
| rate remains the same. | |
| `Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] | |
| [MASK] ' s head` | |
| The training is identical -- we still predict each masked WordPiece token | |
| independently. The improvement comes from the fact that the original prediction | |
| task was too 'easy' for words that had been split into multiple WordPieces. | |
| This can be enabled during data generation by passing the flag | |
| `--do_whole_word_mask=True` to `create_pretraining_data.py`. | |
| Pre-trained models with Whole Word Masking are linked below. The data and | |
| training were otherwise identical, and the models have identical structure and | |
| vocab to the original models. We only include BERT-Large models. When using | |
| these models, please make it clear in the paper that you are using the Whole | |
| Word Masking variant of BERT-Large. | |
| * **[`BERT-Large, Uncased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip)**: | |
| 24-layer, 1024-hidden, 16-heads, 340M parameters | |
| * **[`BERT-Large, Cased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip)**: | |
| 24-layer, 1024-hidden, 16-heads, 340M parameters | |
| Model | SQUAD 1.1 F1/EM | Multi NLI Accuracy | |
| ---------------------------------------- | :-------------: | :----------------: | |
| BERT-Large, Uncased (Original) | 91.0/84.3 | 86.05 | |
| BERT-Large, Uncased (Whole Word Masking) | 92.8/86.7 | 87.07 | |
| BERT-Large, Cased (Original) | 91.5/84.8 | 86.09 | |
| BERT-Large, Cased (Whole Word Masking) | 92.9/86.7 | 86.46 | |
| **\*\*\*\*\* New February 7th, 2019: TfHub Module \*\*\*\*\*** | |
| BERT has been uploaded to [TensorFlow Hub](https://tfhub.dev). See | |
| `run_classifier_with_tfhub.py` for an example of how to use the TF Hub module, | |
| or run an example in the browser on | |
| [Colab](https://colab.sandbox.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb). | |
| **\*\*\*\*\* New November 23rd, 2018: Un-normalized multilingual model + Thai + | |
| Mongolian \*\*\*\*\*** | |
| We uploaded a new multilingual model which does *not* perform any normalization | |
| on the input (no lower casing, accent stripping, or Unicode normalization), and | |
| additionally inclues Thai and Mongolian. | |
| **It is recommended to use this version for developing multilingual models, | |
| especially on languages with non-Latin alphabets.** | |
| This does not require any code changes, and can be downloaded here: | |
| * **[`BERT-Base, Multilingual Cased`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**: | |
| 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters | |
| **\*\*\*\*\* New November 15th, 2018: SOTA SQuAD 2.0 System \*\*\*\*\*** | |
| We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is | |
| currently 1st place on the leaderboard by 3%. See the SQuAD 2.0 section of the | |
| README for details. | |
| **\*\*\*\*\* New November 5th, 2018: Third-party PyTorch and Chainer versions of | |
| BERT available \*\*\*\*\*** | |
| NLP researchers from HuggingFace made a | |
| [PyTorch version of BERT available](https://github.com/huggingface/pytorch-pretrained-BERT) | |
| which is compatible with our pre-trained checkpoints and is able to reproduce | |
| our results. Sosuke Kobayashi also made a | |
| [Chainer version of BERT available](https://github.com/soskek/bert-chainer) | |
| (Thanks!) We were not involved in the creation or maintenance of the PyTorch | |
| implementation so please direct any questions towards the authors of that | |
| repository. | |
| **\*\*\*\*\* New November 3rd, 2018: Multilingual and Chinese models available | |
| \*\*\*\*\*** | |
| We have made two new BERT models available: | |
| * **[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip) | |
| (Not recommended, use `Multilingual Cased` instead)**: 102 languages, | |
| 12-layer, 768-hidden, 12-heads, 110M parameters | |
| * **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**: | |
| Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M | |
| parameters | |
| We use character-based tokenization for Chinese, and WordPiece tokenization for | |
| all other languages. Both models should work out-of-the-box without any code | |
| changes. We did update the implementation of `BasicTokenizer` in | |
| `tokenization.py` to support Chinese character tokenization, so please update if | |
| you forked it. However, we did not change the tokenization API. | |
| For more, see the | |
| [Multilingual README](https://github.com/google-research/bert/blob/master/multilingual.md). | |
| **\*\*\*\*\* End new information \*\*\*\*\*** | |
| ## Introduction | |
| **BERT**, or **B**idirectional **E**ncoder **R**epresentations from | |
| **T**ransformers, is a new method of pre-training language representations which | |
| obtains state-of-the-art results on a wide array of Natural Language Processing | |
| (NLP) tasks. | |
| Our academic paper which describes BERT in detail and provides full results on a | |
| number of tasks can be found here: | |
| [https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805). | |
| To give a few numbers, here are the results on the | |
| [SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/) question answering | |
| task: | |
| SQuAD v1.1 Leaderboard (Oct 8th 2018) | Test EM | Test F1 | |
| ------------------------------------- | :------: | :------: | |
| 1st Place Ensemble - BERT | **87.4** | **93.2** | |
| 2nd Place Ensemble - nlnet | 86.0 | 91.7 | |
| 1st Place Single Model - BERT | **85.1** | **91.8** | |
| 2nd Place Single Model - nlnet | 83.5 | 90.1 | |
| And several natural language inference tasks: | |
| System | MultiNLI | Question NLI | SWAG | |
| ----------------------- | :------: | :----------: | :------: | |
| BERT | **86.7** | **91.1** | **86.3** | |
| OpenAI GPT (Prev. SOTA) | 82.2 | 88.1 | 75.0 | |
| Plus many other tasks. | |
| Moreover, these results were all obtained with almost no task-specific neural | |
| network architecture design. | |
| If you already know what BERT is and you just want to get started, you can | |
| [download the pre-trained models](#pre-trained-models) and | |
| [run a state-of-the-art fine-tuning](#fine-tuning-with-bert) in only a few | |
| minutes. | |
| ## What is BERT? | |
| BERT is a method of pre-training language representations, meaning that we train | |
| a general-purpose "language understanding" model on a large text corpus (like | |
| Wikipedia), and then use that model for downstream NLP tasks that we care about | |
| (like question answering). BERT outperforms previous methods because it is the | |
| first *unsupervised*, *deeply bidirectional* system for pre-training NLP. | |
| *Unsupervised* means that BERT was trained using only a plain text corpus, which | |
| is important because an enormous amount of plain text data is publicly available | |
| on the web in many languages. | |
| Pre-trained representations can also either be *context-free* or *contextual*, | |
| and contextual representations can further be *unidirectional* or | |
| *bidirectional*. Context-free models such as | |
| [word2vec](https://www.tensorflow.org/tutorials/representation/word2vec) or | |
| [GloVe](https://nlp.stanford.edu/projects/glove/) generate a single "word | |
| embedding" representation for each word in the vocabulary, so `bank` would have | |
| the same representation in `bank deposit` and `river bank`. Contextual models | |
| instead generate a representation of each word that is based on the other words | |
| in the sentence. | |
| BERT was built upon recent work in pre-training contextual representations — | |
| including [Semi-supervised Sequence Learning](https://arxiv.org/abs/1511.01432), | |
| [Generative Pre-Training](https://blog.openai.com/language-unsupervised/), | |
| [ELMo](https://allennlp.org/elmo), and | |
| [ULMFit](http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html) | |
| — but crucially these models are all *unidirectional* or *shallowly | |
| bidirectional*. This means that each word is only contextualized using the words | |
| to its left (or right). For example, in the sentence `I made a bank deposit` the | |
| unidirectional representation of `bank` is only based on `I made a` but not | |
| `deposit`. Some previous work does combine the representations from separate | |
| left-context and right-context models, but only in a "shallow" manner. BERT | |
| represents "bank" using both its left and right context — `I made a ... deposit` | |
| — starting from the very bottom of a deep neural network, so it is *deeply | |
| bidirectional*. | |
| BERT uses a simple approach for this: We mask out 15% of the words in the input, | |
| run the entire sequence through a deep bidirectional | |
| [Transformer](https://arxiv.org/abs/1706.03762) encoder, and then predict only | |
| the masked words. For example: | |
| ``` | |
| Input: the man went to the [MASK1] . he bought a [MASK2] of milk. | |
| Labels: [MASK1] = store; [MASK2] = gallon | |
| ``` | |
| In order to learn relationships between sentences, we also train on a simple | |
| task which can be generated from any monolingual corpus: Given two sentences `A` | |
| and `B`, is `B` the actual next sentence that comes after `A`, or just a random | |
| sentence from the corpus? | |
| ``` | |
| Sentence A: the man went to the store . | |
| Sentence B: he bought a gallon of milk . | |
| Label: IsNextSentence | |
| ``` | |
| ``` | |
| Sentence A: the man went to the store . | |
| Sentence B: penguins are flightless . | |
| Label: NotNextSentence | |
| ``` | |
| We then train a large model (12-layer to 24-layer Transformer) on a large corpus | |
| (Wikipedia + [BookCorpus](http://yknzhu.wixsite.com/mbweb)) for a long time (1M | |
| update steps), and that's BERT. | |
| Using BERT has two stages: *Pre-training* and *fine-tuning*. | |
| **Pre-training** is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a | |
| one-time procedure for each language (current models are English-only, but | |
| multilingual models will be released in the near future). We are releasing a | |
| number of pre-trained models from the paper which were pre-trained at Google. | |
| Most NLP researchers will never need to pre-train their own model from scratch. | |
| **Fine-tuning** is inexpensive. All of the results in the paper can be | |
| replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, | |
| starting from the exact same pre-trained model. SQuAD, for example, can be | |
| trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of | |
| 91.0%, which is the single system state-of-the-art. | |
| The other important aspect of BERT is that it can be adapted to many types of | |
| NLP tasks very easily. In the paper, we demonstrate state-of-the-art results on | |
| sentence-level (e.g., SST-2), sentence-pair-level (e.g., MultiNLI), word-level | |
| (e.g., NER), and span-level (e.g., SQuAD) tasks with almost no task-specific | |
| modifications. | |
| ## What has been released in this repository? | |
| We are releasing the following: | |
| * TensorFlow code for the BERT model architecture (which is mostly a standard | |
| [Transformer](https://arxiv.org/abs/1706.03762) architecture). | |
| * Pre-trained checkpoints for both the lowercase and cased version of | |
| `BERT-Base` and `BERT-Large` from the paper. | |
| * TensorFlow code for push-button replication of the most important | |
| fine-tuning experiments from the paper, including SQuAD, MultiNLI, and MRPC. | |
| All of the code in this repository works out-of-the-box with CPU, GPU, and Cloud | |
| TPU. | |
| ## Pre-trained models | |
| We are releasing the `BERT-Base` and `BERT-Large` models from the paper. | |
| `Uncased` means that the text has been lowercased before WordPiece tokenization, | |
| e.g., `John Smith` becomes `john smith`. The `Uncased` model also strips out any | |
| accent markers. `Cased` means that the true case and accent markers are | |
| preserved. Typically, the `Uncased` model is better unless you know that case | |
| information is important for your task (e.g., Named Entity Recognition or | |
| Part-of-Speech tagging). | |
| These models are all released under the same license as the source code (Apache | |
| 2.0). | |
| For information about the Multilingual and Chinese model, see the | |
| [Multilingual README](https://github.com/google-research/bert/blob/master/multilingual.md). | |
| **When using a cased model, make sure to pass `--do_lower=False` to the training | |
| scripts. (Or pass `do_lower_case=False` directly to `FullTokenizer` if you're | |
| using your own script.)** | |
| The links to the models are here (right-click, 'Save link as...' on the name): | |
| * **[`BERT-Large, Uncased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip)**: | |
| 24-layer, 1024-hidden, 16-heads, 340M parameters | |
| * **[`BERT-Large, Cased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip)**: | |
| 24-layer, 1024-hidden, 16-heads, 340M parameters | |
| * **[`BERT-Base, Uncased`](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)**: | |
| 12-layer, 768-hidden, 12-heads, 110M parameters | |
| * **[`BERT-Large, Uncased`](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)**: | |
| 24-layer, 1024-hidden, 16-heads, 340M parameters | |
| * **[`BERT-Base, Cased`](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip)**: | |
| 12-layer, 768-hidden, 12-heads , 110M parameters | |
| * **[`BERT-Large, Cased`](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip)**: | |
| 24-layer, 1024-hidden, 16-heads, 340M parameters | |
| * **[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**: | |
| 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters | |
| * **[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip) | |
| (Not recommended, use `Multilingual Cased` instead)**: 102 languages, | |
| 12-layer, 768-hidden, 12-heads, 110M parameters | |
| * **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**: | |
| Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M | |
| parameters | |
| Each .zip file contains three items: | |
| * A TensorFlow checkpoint (`bert_model.ckpt`) containing the pre-trained | |
| weights (which is actually 3 files). | |
| * A vocab file (`vocab.txt`) to map WordPiece to word id. | |
| * A config file (`bert_config.json`) which specifies the hyperparameters of | |
| the model. | |
| ## Fine-tuning with BERT | |
| **Important**: All results on the paper were fine-tuned on a single Cloud TPU, | |
| which has 64GB of RAM. It is currently not possible to re-produce most of the | |
| `BERT-Large` results on the paper using a GPU with 12GB - 16GB of RAM, because | |
| the maximum batch size that can fit in memory is too small. We are working on | |
| adding code to this repository which allows for much larger effective batch size | |
| on the GPU. See the section on [out-of-memory issues](#out-of-memory-issues) for | |
| more details. | |
| This code was tested with TensorFlow 1.11.0. It was tested with Python2 and | |
| Python3 (but more thoroughly with Python2, since this is what's used internally | |
| in Google). | |
| The fine-tuning examples which use `BERT-Base` should be able to run on a GPU | |
| that has at least 12GB of RAM using the hyperparameters given. | |
| ### Fine-tuning with Cloud TPUs | |
| Most of the examples below assumes that you will be running training/evaluation | |
| on your local machine, using a GPU like a Titan X or GTX 1080. | |
| However, if you have access to a Cloud TPU that you want to train on, just add | |
| the following flags to `run_classifier.py` or `run_squad.py`: | |
| ``` | |
| --use_tpu=True \ | |
| --tpu_name=$TPU_NAME | |
| ``` | |
| Please see the | |
| [Google Cloud TPU tutorial](https://cloud.google.com/tpu/docs/tutorials/mnist) | |
| for how to use Cloud TPUs. Alternatively, you can use the Google Colab notebook | |
| "[BERT FineTuning with Cloud TPUs](https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)". | |
| On Cloud TPUs, the pretrained model and the output directory will need to be on | |
| Google Cloud Storage. For example, if you have a bucket named `some_bucket`, you | |
| might use the following flags instead: | |
| ``` | |
| --output_dir=gs://some_bucket/my_output_dir/ | |
| ``` | |
| The unzipped pre-trained model files can also be found in the Google Cloud | |
| Storage folder `gs://bert_models/2018_10_18`. For example: | |
| ``` | |
| export BERT_BASE_DIR=gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12 | |
| ``` | |
| ### Sentence (and sentence-pair) classification tasks | |
| Before running this example you must download the | |
| [GLUE data](https://gluebenchmark.com/tasks) by running | |
| [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) | |
| and unpack it to some directory `$GLUE_DIR`. Next, download the `BERT-Base` | |
| checkpoint and unzip it to some directory `$BERT_BASE_DIR`. | |
| This example code fine-tunes `BERT-Base` on the Microsoft Research Paraphrase | |
| Corpus (MRPC) corpus, which only contains 3,600 examples and can fine-tune in a | |
| few minutes on most GPUs. | |
| ```shell | |
| export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12 | |
| export GLUE_DIR=/path/to/glue | |
| python run_classifier.py \ | |
| --task_name=MRPC \ | |
| --do_train=true \ | |
| --do_eval=true \ | |
| --data_dir=$GLUE_DIR/MRPC \ | |
| --vocab_file=$BERT_BASE_DIR/vocab.txt \ | |
| --bert_config_file=$BERT_BASE_DIR/bert_config.json \ | |
| --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ | |
| --max_seq_length=128 \ | |
| --train_batch_size=32 \ | |
| --learning_rate=2e-5 \ | |
| --num_train_epochs=3.0 \ | |
| --output_dir=/tmp/mrpc_output/ | |
| ``` | |
| You should see output like this: | |
| ``` | |
| ***** Eval results ***** | |
| eval_accuracy = 0.845588 | |
| eval_loss = 0.505248 | |
| global_step = 343 | |
| loss = 0.505248 | |
| ``` | |
| This means that the Dev set accuracy was 84.55%. Small sets like MRPC have a | |
| high variance in the Dev set accuracy, even when starting from the same | |
| pre-training checkpoint. If you re-run multiple times (making sure to point to | |
| different `output_dir`), you should see results between 84% and 88%. | |
| A few other pre-trained models are implemented off-the-shelf in | |
| `run_classifier.py`, so it should be straightforward to follow those examples to | |
| use BERT for any single-sentence or sentence-pair classification task. | |
| Note: You might see a message `Running train on CPU`. This really just means | |
| that it's running on something other than a Cloud TPU, which includes a GPU. | |
| #### Prediction from classifier | |
| Once you have trained your classifier you can use it in inference mode by using | |
| the --do_predict=true command. You need to have a file named test.tsv in the | |
| input folder. Output will be created in file called test_results.tsv in the | |
| output folder. Each line will contain output for each sample, columns are the | |
| class probabilities. | |
| ```shell | |
| export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12 | |
| export GLUE_DIR=/path/to/glue | |
| export TRAINED_CLASSIFIER=/path/to/fine/tuned/classifier | |
| python run_classifier.py \ | |
| --task_name=MRPC \ | |
| --do_predict=true \ | |
| --data_dir=$GLUE_DIR/MRPC \ | |
| --vocab_file=$BERT_BASE_DIR/vocab.txt \ | |
| --bert_config_file=$BERT_BASE_DIR/bert_config.json \ | |
| --init_checkpoint=$TRAINED_CLASSIFIER \ | |
| --max_seq_length=128 \ | |
| --output_dir=/tmp/mrpc_output/ | |
| ``` | |
| ### SQuAD 1.1 | |
| The Stanford Question Answering Dataset (SQuAD) is a popular question answering | |
| benchmark dataset. BERT (at the time of the release) obtains state-of-the-art | |
| results on SQuAD with almost no task-specific network architecture modifications | |
| or data augmentation. However, it does require semi-complex data pre-processing | |
| and post-processing to deal with (a) the variable-length nature of SQuAD context | |
| paragraphs, and (b) the character-level answer annotations which are used for | |
| SQuAD training. This processing is implemented and documented in `run_squad.py`. | |
| To run on SQuAD, you will first need to download the dataset. The | |
| [SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/) does not seem to | |
| link to the v1.1 datasets any longer, but the necessary files can be found here: | |
| * [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json) | |
| * [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json) | |
| * [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py) | |
| Download these to some directory `$SQUAD_DIR`. | |
| The state-of-the-art SQuAD results from the paper currently cannot be reproduced | |
| on a 12GB-16GB GPU due to memory constraints (in fact, even batch size 1 does | |
| not seem to fit on a 12GB GPU using `BERT-Large`). However, a reasonably strong | |
| `BERT-Base` model can be trained on the GPU with these hyperparameters: | |
| ```shell | |
| python run_squad.py \ | |
| --vocab_file=$BERT_BASE_DIR/vocab.txt \ | |
| --bert_config_file=$BERT_BASE_DIR/bert_config.json \ | |
| --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ | |
| --do_train=True \ | |
| --train_file=$SQUAD_DIR/train-v1.1.json \ | |
| --do_predict=True \ | |
| --predict_file=$SQUAD_DIR/dev-v1.1.json \ | |
| --train_batch_size=12 \ | |
| --learning_rate=3e-5 \ | |
| --num_train_epochs=2.0 \ | |
| --max_seq_length=384 \ | |
| --doc_stride=128 \ | |
| --output_dir=/tmp/squad_base/ | |
| ``` | |
| The dev set predictions will be saved into a file called `predictions.json` in | |
| the `output_dir`: | |
| ```shell | |
| python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ./squad/predictions.json | |
| ``` | |
| Which should produce an output like this: | |
| ```shell | |
| {"f1": 88.41249612335034, "exact_match": 81.2488174077578} | |
| ``` | |
| You should see a result similar to the 88.5% reported in the paper for | |
| `BERT-Base`. | |
| If you have access to a Cloud TPU, you can train with `BERT-Large`. Here is a | |
| set of hyperparameters (slightly different than the paper) which consistently | |
| obtain around 90.5%-91.0% F1 single-system trained only on SQuAD: | |
| ```shell | |
| python run_squad.py \ | |
| --vocab_file=$BERT_LARGE_DIR/vocab.txt \ | |
| --bert_config_file=$BERT_LARGE_DIR/bert_config.json \ | |
| --init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \ | |
| --do_train=True \ | |
| --train_file=$SQUAD_DIR/train-v1.1.json \ | |
| --do_predict=True \ | |
| --predict_file=$SQUAD_DIR/dev-v1.1.json \ | |
| --train_batch_size=24 \ | |
| --learning_rate=3e-5 \ | |
| --num_train_epochs=2.0 \ | |
| --max_seq_length=384 \ | |
| --doc_stride=128 \ | |
| --output_dir=gs://some_bucket/squad_large/ \ | |
| --use_tpu=True \ | |
| --tpu_name=$TPU_NAME | |
| ``` | |
| For example, one random run with these parameters produces the following Dev | |
| scores: | |
| ```shell | |
| {"f1": 90.87081895814865, "exact_match": 84.38978240302744} | |
| ``` | |
| If you fine-tune for one epoch on | |
| [TriviaQA](http://nlp.cs.washington.edu/triviaqa/) before this the results will | |
| be even better, but you will need to convert TriviaQA into the SQuAD json | |
| format. | |
| ### SQuAD 2.0 | |
| This model is also implemented and documented in `run_squad.py`. | |
| To run on SQuAD 2.0, you will first need to download the dataset. The necessary | |
| files can be found here: | |
| * [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json) | |
| * [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json) | |
| * [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/) | |
| Download these to some directory `$SQUAD_DIR`. | |
| On Cloud TPU you can run with BERT-Large as follows: | |
| ```shell | |
| python run_squad.py \ | |
| --vocab_file=$BERT_LARGE_DIR/vocab.txt \ | |
| --bert_config_file=$BERT_LARGE_DIR/bert_config.json \ | |
| --init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \ | |
| --do_train=True \ | |
| --train_file=$SQUAD_DIR/train-v2.0.json \ | |
| --do_predict=True \ | |
| --predict_file=$SQUAD_DIR/dev-v2.0.json \ | |
| --train_batch_size=24 \ | |
| --learning_rate=3e-5 \ | |
| --num_train_epochs=2.0 \ | |
| --max_seq_length=384 \ | |
| --doc_stride=128 \ | |
| --output_dir=gs://some_bucket/squad_large/ \ | |
| --use_tpu=True \ | |
| --tpu_name=$TPU_NAME \ | |
| --version_2_with_negative=True | |
| ``` | |
| We assume you have copied everything from the output directory to a local | |
| directory called ./squad/. The initial dev set predictions will be at | |
| ./squad/predictions.json and the differences between the score of no answer ("") | |
| and the best non-null answer for each question will be in the file | |
| ./squad/null_odds.json | |
| Run this script to tune a threshold for predicting null versus non-null answers: | |
| python $SQUAD_DIR/evaluate-v2.0.py $SQUAD_DIR/dev-v2.0.json | |
| ./squad/predictions.json --na-prob-file ./squad/null_odds.json | |
| Assume the script outputs "best_f1_thresh" THRESH. (Typical values are between | |
| -1.0 and -5.0). You can now re-run the model to generate predictions with the | |
| derived threshold or alternatively you can extract the appropriate answers from | |
| ./squad/nbest_predictions.json. | |
| ```shell | |
| python run_squad.py \ | |
| --vocab_file=$BERT_LARGE_DIR/vocab.txt \ | |
| --bert_config_file=$BERT_LARGE_DIR/bert_config.json \ | |
| --init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \ | |
| --do_train=False \ | |
| --train_file=$SQUAD_DIR/train-v2.0.json \ | |
| --do_predict=True \ | |
| --predict_file=$SQUAD_DIR/dev-v2.0.json \ | |
| --train_batch_size=24 \ | |
| --learning_rate=3e-5 \ | |
| --num_train_epochs=2.0 \ | |
| --max_seq_length=384 \ | |
| --doc_stride=128 \ | |
| --output_dir=gs://some_bucket/squad_large/ \ | |
| --use_tpu=True \ | |
| --tpu_name=$TPU_NAME \ | |
| --version_2_with_negative=True \ | |
| --null_score_diff_threshold=$THRESH | |
| ``` | |
| ### Out-of-memory issues | |
| All experiments in the paper were fine-tuned on a Cloud TPU, which has 64GB of | |
| device RAM. Therefore, when using a GPU with 12GB - 16GB of RAM, you are likely | |
| to encounter out-of-memory issues if you use the same hyperparameters described | |
| in the paper. | |
| The factors that affect memory usage are: | |
| * **`max_seq_length`**: The released models were trained with sequence lengths | |
| up to 512, but you can fine-tune with a shorter max sequence length to save | |
| substantial memory. This is controlled by the `max_seq_length` flag in our | |
| example code. | |
| * **`train_batch_size`**: The memory usage is also directly proportional to | |
| the batch size. | |
| * **Model type, `BERT-Base` vs. `BERT-Large`**: The `BERT-Large` model | |
| requires significantly more memory than `BERT-Base`. | |
| * **Optimizer**: The default optimizer for BERT is Adam, which requires a lot | |
| of extra memory to store the `m` and `v` vectors. Switching to a more memory | |
| efficient optimizer can reduce memory usage, but can also affect the | |
| results. We have not experimented with other optimizers for fine-tuning. | |
| Using the default training scripts (`run_classifier.py` and `run_squad.py`), we | |
| benchmarked the maximum batch size on single Titan X GPU (12GB RAM) with | |
| TensorFlow 1.11.0: | |
| System | Seq Length | Max Batch Size | |
| ------------ | ---------- | -------------- | |
| `BERT-Base` | 64 | 64 | |
| ... | 128 | 32 | |
| ... | 256 | 16 | |
| ... | 320 | 14 | |
| ... | 384 | 12 | |
| ... | 512 | 6 | |
| `BERT-Large` | 64 | 12 | |
| ... | 128 | 6 | |
| ... | 256 | 2 | |
| ... | 320 | 1 | |
| ... | 384 | 0 | |
| ... | 512 | 0 | |
| Unfortunately, these max batch sizes for `BERT-Large` are so small that they | |
| will actually harm the model accuracy, regardless of the learning rate used. We | |
| are working on adding code to this repository which will allow much larger | |
| effective batch sizes to be used on the GPU. The code will be based on one (or | |
| both) of the following techniques: | |
| * **Gradient accumulation**: The samples in a minibatch are typically | |
| independent with respect to gradient computation (excluding batch | |
| normalization, which is not used here). This means that the gradients of | |
| multiple smaller minibatches can be accumulated before performing the weight | |
| update, and this will be exactly equivalent to a single larger update. | |
| * [**Gradient checkpointing**](https://github.com/openai/gradient-checkpointing): | |
| The major use of GPU/TPU memory during DNN training is caching the | |
| intermediate activations in the forward pass that are necessary for | |
| efficient computation in the backward pass. "Gradient checkpointing" trades | |
| memory for compute time by re-computing the activations in an intelligent | |
| way. | |
| **However, this is not implemented in the current release.** | |
| ## Using BERT to extract fixed feature vectors (like ELMo) | |
| In certain cases, rather than fine-tuning the entire pre-trained model | |
| end-to-end, it can be beneficial to obtained *pre-trained contextual | |
| embeddings*, which are fixed contextual representations of each input token | |
| generated from the hidden layers of the pre-trained model. This should also | |
| mitigate most of the out-of-memory issues. | |
| As an example, we include the script `extract_features.py` which can be used | |
| like this: | |
| ```shell | |
| # Sentence A and Sentence B are separated by the ||| delimiter for sentence | |
| # pair tasks like question answering and entailment. | |
| # For single sentence inputs, put one sentence per line and DON'T use the | |
| # delimiter. | |
| echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt | |
| python extract_features.py \ | |
| --input_file=/tmp/input.txt \ | |
| --output_file=/tmp/output.jsonl \ | |
| --vocab_file=$BERT_BASE_DIR/vocab.txt \ | |
| --bert_config_file=$BERT_BASE_DIR/bert_config.json \ | |
| --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ | |
| --layers=-1,-2,-3,-4 \ | |
| --max_seq_length=128 \ | |
| --batch_size=8 | |
| ``` | |
| This will create a JSON file (one line per line of input) containing the BERT | |
| activations from each Transformer layer specified by `layers` (-1 is the final | |
| hidden layer of the Transformer, etc.) | |
| Note that this script will produce very large output files (by default, around | |
| 15kb for every input token). | |
| If you need to maintain alignment between the original and tokenized words (for | |
| projecting training labels), see the [Tokenization](#tokenization) section | |
| below. | |
| **Note:** You may see a message like `Could not find trained model in model_dir: | |
| /tmp/tmpuB5g5c, running initialization to predict.` This message is expected, it | |
| just means that we are using the `init_from_checkpoint()` API rather than the | |
| saved model API. If you don't specify a checkpoint or specify an invalid | |
| checkpoint, this script will complain. | |
| ## Tokenization | |
| For sentence-level tasks (or sentence-pair) tasks, tokenization is very simple. | |
| Just follow the example code in `run_classifier.py` and `extract_features.py`. | |
| The basic procedure for sentence-level tasks is: | |
| 1. Instantiate an instance of `tokenizer = tokenization.FullTokenizer` | |
| 2. Tokenize the raw text with `tokens = tokenizer.tokenize(raw_text)`. | |
| 3. Truncate to the maximum sequence length. (You can use up to 512, but you | |
| probably want to use shorter if possible for memory and speed reasons.) | |
| 4. Add the `[CLS]` and `[SEP]` tokens in the right place. | |
| Word-level and span-level tasks (e.g., SQuAD and NER) are more complex, since | |
| you need to maintain alignment between your input text and output text so that | |
| you can project your training labels. SQuAD is a particularly complex example | |
| because the input labels are *character*-based, and SQuAD paragraphs are often | |
| longer than our maximum sequence length. See the code in `run_squad.py` to show | |
| how we handle this. | |
| Before we describe the general recipe for handling word-level tasks, it's | |
| important to understand what exactly our tokenizer is doing. It has three main | |
| steps: | |
| 1. **Text normalization**: Convert all whitespace characters to spaces, and | |
| (for the `Uncased` model) lowercase the input and strip out accent markers. | |
| E.g., `John Johanson's, → john johanson's,`. | |
| 2. **Punctuation splitting**: Split *all* punctuation characters on both sides | |
| (i.e., add whitespace around all punctuation characters). Punctuation | |
| characters are defined as (a) Anything with a `P*` Unicode class, (b) any | |
| non-letter/number/space ASCII character (e.g., characters like `$` which are | |
| technically not punctuation). E.g., `john johanson's, → john johanson ' s ,` | |
| 3. **WordPiece tokenization**: Apply whitespace tokenization to the output of | |
| the above procedure, and apply | |
| [WordPiece](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py) | |
| tokenization to each token separately. (Our implementation is directly based | |
| on the one from `tensor2tensor`, which is linked). E.g., `john johanson ' s | |
| , → john johan ##son ' s ,` | |
| The advantage of this scheme is that it is "compatible" with most existing | |
| English tokenizers. For example, imagine that you have a part-of-speech tagging | |
| task which looks like this: | |
| ``` | |
| Input: John Johanson 's house | |
| Labels: NNP NNP POS NN | |
| ``` | |
| The tokenized output will look like this: | |
| ``` | |
| Tokens: john johan ##son ' s house | |
| ``` | |
| Crucially, this would be the same output as if the raw text were `John | |
| Johanson's house` (with no space before the `'s`). | |
| If you have a pre-tokenized representation with word-level annotations, you can | |
| simply tokenize each input word independently, and deterministically maintain an | |
| original-to-tokenized alignment: | |
| ```python | |
| ### Input | |
| orig_tokens = ["John", "Johanson", "'s", "house"] | |
| labels = ["NNP", "NNP", "POS", "NN"] | |
| ### Output | |
| bert_tokens = [] | |
| # Token map will be an int -> int mapping between the `orig_tokens` index and | |
| # the `bert_tokens` index. | |
| orig_to_tok_map = [] | |
| tokenizer = tokenization.FullTokenizer( | |
| vocab_file=vocab_file, do_lower_case=True) | |
| bert_tokens.append("[CLS]") | |
| for orig_token in orig_tokens: | |
| orig_to_tok_map.append(len(bert_tokens)) | |
| bert_tokens.extend(tokenizer.tokenize(orig_token)) | |
| bert_tokens.append("[SEP]") | |
| # bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"] | |
| # orig_to_tok_map == [1, 2, 4, 6] | |
| ``` | |
| Now `orig_to_tok_map` can be used to project `labels` to the tokenized | |
| representation. | |
| There are common English tokenization schemes which will cause a slight mismatch | |
| between how BERT was pre-trained. For example, if your input tokenization splits | |
| off contractions like `do n't`, this will cause a mismatch. If it is possible to | |
| do so, you should pre-process your data to convert these back to raw-looking | |
| text, but if it's not possible, this mismatch is likely not a big deal. | |
| ## Pre-training with BERT | |
| We are releasing code to do "masked LM" and "next sentence prediction" on an | |
| arbitrary text corpus. Note that this is *not* the exact code that was used for | |
| the paper (the original code was written in C++, and had some additional | |
| complexity), but this code does generate pre-training data as described in the | |
| paper. | |
| Here's how to run the data generation. The input is a plain text file, with one | |
| sentence per line. (It is important that these be actual sentences for the "next | |
| sentence prediction" task). Documents are delimited by empty lines. The output | |
| is a set of `tf.train.Example`s serialized into `TFRecord` file format. | |
| You can perform sentence segmentation with an off-the-shelf NLP toolkit such as | |
| [spaCy](https://spacy.io/). The `create_pretraining_data.py` script will | |
| concatenate segments until they reach the maximum sequence length to minimize | |
| computational waste from padding (see the script for more details). However, you | |
| may want to intentionally add a slight amount of noise to your input data (e.g., | |
| randomly truncate 2% of input segments) to make it more robust to non-sentential | |
| input during fine-tuning. | |
| This script stores all of the examples for the entire input file in memory, so | |
| for large data files you should shard the input file and call the script | |
| multiple times. (You can pass in a file glob to `run_pretraining.py`, e.g., | |
| `tf_examples.tf_record*`.) | |
| The `max_predictions_per_seq` is the maximum number of masked LM predictions per | |
| sequence. You should set this to around `max_seq_length` * `masked_lm_prob` (the | |
| script doesn't do that automatically because the exact value needs to be passed | |
| to both scripts). | |
| ```shell | |
| python create_pretraining_data.py \ | |
| --input_file=./sample_text.txt \ | |
| --output_file=/tmp/tf_examples.tfrecord \ | |
| --vocab_file=$BERT_BASE_DIR/vocab.txt \ | |
| --do_lower_case=True \ | |
| --max_seq_length=128 \ | |
| --max_predictions_per_seq=20 \ | |
| --masked_lm_prob=0.15 \ | |
| --random_seed=12345 \ | |
| --dupe_factor=5 | |
| ``` | |
| Here's how to run the pre-training. Do not include `init_checkpoint` if you are | |
| pre-training from scratch. The model configuration (including vocab size) is | |
| specified in `bert_config_file`. This demo code only pre-trains for a small | |
| number of steps (20), but in practice you will probably want to set | |
| `num_train_steps` to 10000 steps or more. The `max_seq_length` and | |
| `max_predictions_per_seq` parameters passed to `run_pretraining.py` must be the | |
| same as `create_pretraining_data.py`. | |
| ```shell | |
| python run_pretraining.py \ | |
| --input_file=/tmp/tf_examples.tfrecord \ | |
| --output_dir=/tmp/pretraining_output \ | |
| --do_train=True \ | |
| --do_eval=True \ | |
| --bert_config_file=$BERT_BASE_DIR/bert_config.json \ | |
| --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ | |
| --train_batch_size=32 \ | |
| --max_seq_length=128 \ | |
| --max_predictions_per_seq=20 \ | |
| --num_train_steps=20 \ | |
| --num_warmup_steps=10 \ | |
| --learning_rate=2e-5 | |
| ``` | |
| This will produce an output like this: | |
| ``` | |
| ***** Eval results ***** | |
| global_step = 20 | |
| loss = 0.0979674 | |
| masked_lm_accuracy = 0.985479 | |
| masked_lm_loss = 0.0979328 | |
| next_sentence_accuracy = 1.0 | |
| next_sentence_loss = 3.45724e-05 | |
| ``` | |
| Note that since our `sample_text.txt` file is very small, this example training | |
| will overfit that data in only a few steps and produce unrealistically high | |
| accuracy numbers. | |
| ### Pre-training tips and caveats | |
| * **If using your own vocabulary, make sure to change `vocab_size` in | |
| `bert_config.json`. If you use a larger vocabulary without changing this, | |
| you will likely get NaNs when training on GPU or TPU due to unchecked | |
| out-of-bounds access.** | |
| * If your task has a large domain-specific corpus available (e.g., "movie | |
| reviews" or "scientific papers"), it will likely be beneficial to run | |
| additional steps of pre-training on your corpus, starting from the BERT | |
| checkpoint. | |
| * The learning rate we used in the paper was 1e-4. However, if you are doing | |
| additional steps of pre-training starting from an existing BERT checkpoint, | |
| you should use a smaller learning rate (e.g., 2e-5). | |
| * Current BERT models are English-only, but we do plan to release a | |
| multilingual model which has been pre-trained on a lot of languages in the | |
| near future (hopefully by the end of November 2018). | |
| * Longer sequences are disproportionately expensive because attention is | |
| quadratic to the sequence length. In other words, a batch of 64 sequences of | |
| length 512 is much more expensive than a batch of 256 sequences of | |
| length 128. The fully-connected/convolutional cost is the same, but the | |
| attention cost is far greater for the 512-length sequences. Therefore, one | |
| good recipe is to pre-train for, say, 90,000 steps with a sequence length of | |
| 128 and then for 10,000 additional steps with a sequence length of 512. The | |
| very long sequences are mostly needed to learn positional embeddings, which | |
| can be learned fairly quickly. Note that this does require generating the | |
| data twice with different values of `max_seq_length`. | |
| * If you are pre-training from scratch, be prepared that pre-training is | |
| computationally expensive, especially on GPUs. If you are pre-training from | |
| scratch, our recommended recipe is to pre-train a `BERT-Base` on a single | |
| [preemptible Cloud TPU v2](https://cloud.google.com/tpu/docs/pricing), which | |
| takes about 2 weeks at a cost of about $500 USD (based on the pricing in | |
| October 2018). You will have to scale down the batch size when only training | |
| on a single Cloud TPU, compared to what was used in the paper. It is | |
| recommended to use the largest batch size that fits into TPU memory. | |
| ### Pre-training data | |
| We will **not** be able to release the pre-processed datasets used in the paper. | |
| For Wikipedia, the recommended pre-processing is to download | |
| [the latest dump](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2), | |
| extract the text with | |
| [`WikiExtractor.py`](https://github.com/attardi/wikiextractor), and then apply | |
| any necessary cleanup to convert it into plain text. | |
| Unfortunately the researchers who collected the | |
| [BookCorpus](http://yknzhu.wixsite.com/mbweb) no longer have it available for | |
| public download. The | |
| [Project Guttenberg Dataset](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html) | |
| is a somewhat smaller (200M word) collection of older books that are public | |
| domain. | |
| [Common Crawl](http://commoncrawl.org/) is another very large collection of | |
| text, but you will likely have to do substantial pre-processing and cleanup to | |
| extract a usable corpus for pre-training BERT. | |
| ### Learning a new WordPiece vocabulary | |
| This repository does not include code for *learning* a new WordPiece vocabulary. | |
| The reason is that the code used in the paper was implemented in C++ with | |
| dependencies on Google's internal libraries. For English, it is almost always | |
| better to just start with our vocabulary and pre-trained models. For learning | |
| vocabularies of other languages, there are a number of open source options | |
| available. However, keep in mind that these are not compatible with our | |
| `tokenization.py` library: | |
| * [Google's SentencePiece library](https://github.com/google/sentencepiece) | |
| * [tensor2tensor's WordPiece generation script](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder_build_subword.py) | |
| * [Rico Sennrich's Byte Pair Encoding library](https://github.com/rsennrich/subword-nmt) | |
| ## Using BERT in Colab | |
| If you want to use BERT with [Colab](https://colab.research.google.com), you can | |
| get started with the notebook | |
| "[BERT FineTuning with Cloud TPUs](https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)". | |
| **At the time of this writing (October 31st, 2018), Colab users can access a | |
| Cloud TPU completely for free.** Note: One per user, availability limited, | |
| requires a Google Cloud Platform account with storage (although storage may be | |
| purchased with free credit for signing up with GCP), and this capability may not | |
| longer be available in the future. Click on the BERT Colab that was just linked | |
| for more information. | |
| ## FAQ | |
| #### Is this code compatible with Cloud TPUs? What about GPUs? | |
| Yes, all of the code in this repository works out-of-the-box with CPU, GPU, and | |
| Cloud TPU. However, GPU training is single-GPU only. | |
| #### I am getting out-of-memory errors, what is wrong? | |
| See the section on [out-of-memory issues](#out-of-memory-issues) for more | |
| information. | |
| #### Is there a PyTorch version available? | |
| There is no official PyTorch implementation. However, NLP researchers from | |
| HuggingFace made a | |
| [PyTorch version of BERT available](https://github.com/huggingface/pytorch-pretrained-BERT) | |
| which is compatible with our pre-trained checkpoints and is able to reproduce | |
| our results. We were not involved in the creation or maintenance of the PyTorch | |
| implementation so please direct any questions towards the authors of that | |
| repository. | |
| #### Is there a Chainer version available? | |
| There is no official Chainer implementation. However, Sosuke Kobayashi made a | |
| [Chainer version of BERT available](https://github.com/soskek/bert-chainer) | |
| which is compatible with our pre-trained checkpoints and is able to reproduce | |
| our results. We were not involved in the creation or maintenance of the Chainer | |
| implementation so please direct any questions towards the authors of that | |
| repository. | |
| #### Will models in other languages be released? | |
| Yes, we plan to release a multi-lingual BERT model in the near future. We cannot | |
| make promises about exactly which languages will be included, but it will likely | |
| be a single model which includes *most* of the languages which have a | |
| significantly-sized Wikipedia. | |
| #### Will models larger than `BERT-Large` be released? | |
| So far we have not attempted to train anything larger than `BERT-Large`. It is | |
| possible that we will release larger models if we are able to obtain significant | |
| improvements. | |
| #### What license is this library released under? | |
| All code *and* models are released under the Apache 2.0 license. See the | |
| `LICENSE` file for more information. | |
| #### How do I cite BERT? | |
| For now, cite [the Arxiv paper](https://arxiv.org/abs/1810.04805): | |
| ``` | |
| @article{devlin2018bert, | |
| title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding}, | |
| author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, | |
| journal={arXiv preprint arXiv:1810.04805}, | |
| year={2018} | |
| } | |
| ``` | |
| If we submit the paper to a conference or journal, we will update the BibTeX. | |
| ## Disclaimer | |
| This is not an official Google product. | |
| ## Contact information | |
| For help or issues using BERT, please submit a GitHub issue. | |
| For personal communication related to BERT, please contact Jacob Devlin | |
| (`jacobdevlin@google.com`), Ming-Wei Chang (`mingweichang@google.com`), or | |
| Kenton Lee (`kentonl@google.com`). | |