BERT
JAX implementation of BERT that follows and reproduces the numbers of the official TF implmenetation of BERT.
The code here at this point supports the pretraining and finetuning/fewshot eval on GLUE. We hope to find time to also add support for finetuning on SuperGLUE, SQUAD, and XTREME based on the code in Tensorflow model.
Additional Requirements:
The following command will install the required packages for BERT.
$ pip install -r scenic/projects/baselines/bert/requirements.txt
Process Datasets
The code here consumes data with the same format as the official implementation. So to generate the data, you can follow this instruction, that is also explained in BERT official repo:
So to start, you first need to get the preprocessing code:
$ git clone https://github.com/tensorflow/models.git
Pre-training
To generate pre-training data, you can use the
create_pretraining_data script
(which is essentially branched from BERT research repo)
to get the processed pre-training data.
Running the pre-training script requires an input and output directory, as well
as a vocab file. Note that max_seq_length will need to match the sequence
length parameter you specify when you run pre-training.
Example shell script to call create_pretraining_data.py
$ export WORKING_DIR='local disk or cloud location'
$ export BERT_DIR='local disk or cloud location'
$ python models/official/nlp/data/create_pretraining_data.py \
--input_file=$WORKING_DIR/input/input.txt \
--output_file=$WORKING_DIR/output/tf_examples.tfrecord \
--vocab_file=$BERT_DIR/wwm_uncased_L-24_H-1024_A-16/vocab.txt \
--do_lower_case=True \
--max_seq_length=512 \
--max_predictions_per_seq=76 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=5
Fine-tuning
To prepare the fine-tuning data for final model training, use the
create_finetuning_data.py script.
Resulting datasets in tf_record format and training meta data should be later
passed to training or evaluation scripts. The task-specific arguments are
described in following sections:
GLUE
Users can download the
GLUE data by running
this script
and unpack it to some directory $GLUE_DIR.
$ export GLUE_DIR=~/glue
$ export BERT_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
$ export TASK_NAME=MNLI
$ export OUTPUT_DIR=gs://some_bucket/datasets
$ python ../data/create_finetuning_data.py \
--input_data_dir=${GLUE_DIR}/${TASK_NAME}/ \
--vocab_file=${BERT_DIR}/vocab.txt \
--train_data_output_path=${OUTPUT_DIR}/${TASK_NAME}_train.tf_record \
--eval_data_output_path=${OUTPUT_DIR}/${TASK_NAME}_eval.tf_record \
--meta_data_file_path=${OUTPUT_DIR}/${TASK_NAME}_meta_data \
--fine_tuning_task_type=classification --max_seq_length=128 \
--classification_task_name=${TASK_NAME}
Pretrained checkpoints
We will release BERT checkpoints that are pretrained using this code and can be used with no specific modification or weight surgery.
Acknowledgment
We would like to thank Valerii Likhosherstov and Yi Tay for their contribution to the BERT implementation in Scenic.