PyTorch
ssl-aasist
custom_code
ssl-aasist / fairseq /examples /roberta /fb_multilingual /README.multilingual.pretraining.md
ash56's picture
Add files using upload-large-folder tool
77d5bb2 verified
|
raw
history blame
633 Bytes

Multilingual pretraining RoBERTa

This tutorial will walk you through pretraining multilingual RoBERTa.

1) Preprocess the data

DICTIONARY="/private/home/namangoyal/dataset/XLM/wiki/17/175k/vocab"
DATA_LOCATION="/private/home/namangoyal/dataset/XLM/wiki/17/175k"

for LANG in en es it
do
  fairseq-preprocess \
      --only-source \
      --srcdict $DICTIONARY \
      --trainpref "$DATA_LOCATION/train.$LANG" \
      --validpref "$DATA_LOCATION/valid.$LANG" \
      --testpref "$DATA_LOCATION/test.$LANG" \
      --destdir "wiki_17-bin/$LANG" \
      --workers 60;
done

2) Train RoBERTa base

[COMING UP...]