File size: 411 Bytes
e02c7f5 |
1 2 3 4 5 6 7 |
# Prepare data for backtranslation
1. Download data from [CC-100](https://data.statmt.org/cc-100/) website
2. Run `head -500000 <language>.txt > <language>_500K.txt`
3. (Optional) For random sampling from `<language>.txt`, consider using `shuf <language>.txt > | head -500000 > <language>_500K.txt`. If the file is too large to fit in memory, consider using [terashuf](https://github.com/alexandres/terashuf)
|