Spaces:
Runtime error
Runtime error
| # Pretraining RoBERTa using your own data | |
| This tutorial will walk you through pretraining RoBERTa over your own data. | |
| ### 1) Preprocess the data | |
| Data should be preprocessed following the [language modeling format](/examples/language_model), i.e. each document should be separated by an empty line (only useful with `--sample-break-mode complete_doc`). Lines will be concatenated as a 1D text stream during training. | |
| We'll use the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) | |
| to demonstrate how to preprocess raw text data with the GPT-2 BPE. Of course | |
| this dataset is quite small, so the resulting pretrained model will perform | |
| poorly, but it gives the general idea. | |
| First download the dataset: | |
| ```bash | |
| wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip | |
| unzip wikitext-103-raw-v1.zip | |
| ``` | |
| Next encode it with the GPT-2 BPE: | |
| ```bash | |
| mkdir -p gpt2_bpe | |
| wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json | |
| wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe | |
| for SPLIT in train valid test; do \ | |
| python -m examples.roberta.multiprocessing_bpe_encoder \ | |
| --encoder-json gpt2_bpe/encoder.json \ | |
| --vocab-bpe gpt2_bpe/vocab.bpe \ | |
| --inputs wikitext-103-raw/wiki.${SPLIT}.raw \ | |
| --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \ | |
| --keep-empty \ | |
| --workers 60; \ | |
| done | |
| ``` | |
| Finally preprocess/binarize the data using the GPT-2 fairseq dictionary: | |
| ```bash | |
| wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt | |
| fairseq-preprocess \ | |
| --only-source \ | |
| --srcdict gpt2_bpe/dict.txt \ | |
| --trainpref wikitext-103-raw/wiki.train.bpe \ | |
| --validpref wikitext-103-raw/wiki.valid.bpe \ | |
| --testpref wikitext-103-raw/wiki.test.bpe \ | |
| --destdir data-bin/wikitext-103 \ | |
| --workers 60 | |
| ``` | |
| ### 2) Train RoBERTa base | |
| ```bash | |
| DATA_DIR=data-bin/wikitext-103 | |
| fairseq-hydra-train -m --config-dir examples/roberta/config/pretraining \ | |
| --config-name base task.data=$DATA_DIR | |
| ``` | |
| **Note:** You can optionally resume training the released RoBERTa base model by | |
| adding `checkpoint.restore_file=/path/to/roberta.base/model.pt`. | |
| **Note:** The above command assumes training on 8x32GB V100 GPUs. Each GPU uses | |
| a batch size of 16 sequences (`dataset.batch_size`) and accumulates gradients to | |
| further increase the batch size by 16x (`optimization.update_freq`), for a total batch size | |
| of 2048 sequences. If you have fewer GPUs or GPUs with less memory you may need | |
| to reduce `dataset.batch_size` and increase dataset.update_freq to compensate. | |
| Alternatively if you have more GPUs you can decrease `dataset.update_freq` accordingly | |
| to increase training speed. | |
| **Note:** The learning rate and batch size are tightly connected and need to be | |
| adjusted together. We generally recommend increasing the learning rate as you | |
| increase the batch size according to the following table (although it's also | |
| dataset dependent, so don't rely on the following values too closely): | |
| batch size | peak learning rate | |
| ---|--- | |
| 256 | 0.0001 | |
| 2048 | 0.0005 | |
| 8192 | 0.0007 | |
| ### 3) Load your pretrained model | |
| ```python | |
| from fairseq.models.roberta import RobertaModel | |
| roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data') | |
| assert isinstance(roberta.model, torch.nn.Module) | |
| ``` | |