Training script for cosmo-1b?
#6
by
vdmbrsv
- opened
Is there a training code for cosmo-1d?
We used an internal wrapper around nanotron library https://github.com/huggingface/nanotron/ you can adapt this script https://github.com/loubnabnl/nanotron-smol-cluster/blob/main/brrr/cosmopedia/cosmo_1b.yaml
Hi @loubnabnl , thanks for pointing the yaml file. I have two questions regarding the data preprocessing part.
- Cosmopedia data was in
prompt-textformat. For pretraining, do you simply concatenate prompt and text together to form a document? - I noticed the datasets in the yaml file have different folder names,
tokenized_text_document,tokenized_completion_document,tokenized_train_prompt_document,tokenized_script_document. Does this mean different data preparation methods were used for different subsets?
Thanks a lot!
- we only train on
textcolumn, the prompts are not used - no we didn't do any post-processing, this is only because the target columns had different names at the time, but they were all renamed to
textincosmopedia
Thanks @loubnabnl