roberta-medium / README.md
JackBAI's picture
Create README.md
db2066d
|
raw
history blame
1.37 kB
metadata
license: mit
datasets:
  - wikipedia
  - bookcorpus
language:
  - en
metrics:
  - glue
library_name: transformers

This is our reproduction using the official HuggingFace roberta architecture with a medium size. On the architecture side, RoBERTa is exactly the same as BERT except for its larger vocabulary size.

According to Google's BERT releases and BERT-Medium, a medium sized model should have a config of Layer=8, Hidden=512, #AttnHeads=8, and IntermediateSize=2048. We follow this config to pre-train a RoBERTa-base model for reproduction.

We use the same datasets as BERT (English Wikipedia and Book Corpus) to pre-train for 30k steps with a batch size of 8,192. I also released the reproduction of this dataset on HuggingFace.

We utilized DeepSpeed ZeRO-2 for performance optimization.

Other training configuration:

Parameter Value
WARMUP_STEPS 1800
LR_DECAY linear
ADAM_EPS 1e-6
ADAM_BETA1 0.9
ADAM_BETA2 0.98
ADAM_WEIGHT_DECAY 0.01
PEAK_LR 1e-3