Multilingual pretraining RoBERTa
This tutorial will walk you through pretraining multilingual RoBERTa.
1) Preprocess the data
DICTIONARY="/private/home/namangoyal/dataset/XLM/wiki/17/175k/vocab"
DATA_LOCATION="/private/home/namangoyal/dataset/XLM/wiki/17/175k"
for LANG in en es it
do
fairseq-preprocess \
--only-source \
--srcdict $DICTIONARY \
--trainpref "$DATA_LOCATION/train.$LANG" \
--validpref "$DATA_LOCATION/valid.$LANG" \
--testpref "$DATA_LOCATION/test.$LANG" \
--destdir "wiki_17-bin/$LANG" \
--workers 60;
done
2) Train RoBERTa base
[COMING UP...]