Training script for cosmo-1b?

by vdmbrsv - opened Mar 24, 2024

Is there a training code for cosmo-1d?

Hugging Face Smol Models Research org Mar 25, 2024

•

Hi @loubnabnl , thanks for pointing the yaml file. I have two questions regarding the data preprocessing part.

Cosmopedia data was in prompt-text format. For pretraining, do you simply concatenate prompt and text together to form a document?
I noticed the datasets in the yaml file have different folder names, tokenized_text_document, tokenized_completion_document, tokenized_train_prompt_document, tokenized_script_document. Does this mean different data preparation methods were used for different subsets?

Thanks a lot!

Hugging Face Smol Models Research org May 25, 2024

we only train on text column, the prompts are not used
no we didn't do any post-processing, this is only because the target columns had different names at the time, but they were all renamed to text incosmopedia

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment