Create README.md

db2066d over 2 years ago

1.37 kB

license: mit
datasets:
  - wikipedia
  - bookcorpus
language:
  - en
metrics:
  - glue
library_name: transformers

This is our reproduction using the official HuggingFace roberta architecture with a medium size. On the architecture side, RoBERTa is exactly the same as BERT except for its larger vocabulary size.

According to Google's BERT releases and BERT-Medium, a medium sized model should have a config of Layer=8, Hidden=512, #AttnHeads=8, and IntermediateSize=2048. We follow this config to pre-train a RoBERTa-base model for reproduction.

We use the same datasets as BERT (English Wikipedia and Book Corpus) to pre-train for 30k steps with a batch size of 8,192. I also released the reproduction of this dataset on HuggingFace.

We utilized DeepSpeed ZeRO-2 for performance optimization.

Other training configuration:

Parameter	Value
WARMUP_STEPS	1800
LR_DECAY	linear
ADAM_EPS	1e-6
ADAM_BETA1	0.9
ADAM_BETA2	0.98
ADAM_WEIGHT_DECAY	0.01
PEAK_LR	1e-3