HuggingFaceTB
/

cosmo-1b

Text Generation

text-generation-inference

Model card Files Files and versions

loubnabnl HF Staff commited on Feb 20, 2024

Commit

a4c852f

·

verified ·

1 Parent(s): 68d9f23

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ This is a 1.8B model trained on [Cosmopedia](https://huggingface.co/datasets/Hug
 The training corpus consisted of 30B tokens, 25B of which are synthetic from Cosmopedia. Since we didn't explore the synthetic generation of code, we augmented the dataset with 5B tokens of non-synthetic sources like the `code-python-0.60-to-1.00` and `web-0.50-to-1.00` subsets of [AutoMathText](https://huggingface.co/datasets/math-ai/AutoMathText). We also added 1M files from [The Stack](https://huggingface.co/datasets/bigcode/the-stack)'s Jupyter Notebooks, converted to script. They tend to have educational code interleaved with text.
 We also included [ultrachat](https://huggingface.co/datasets/stingning/ultrachat) formatted in the chat format of `LlaMa` models, so we don't have to instruction-tune the model after the pre-training. Additionally, we upsampled twice the data from these seed sources twice to help with commonsense and reasoning: stories, AutoMathText & KhanAcademy.
-We trained for 6 epochs, resulting in a model trained on 180B tokens with a sequence length of 2k, a global batch size of 1.3M tokens and a learning rate of 3e-4 with a cosine schedule for 14àk steps.
 We used the tokenizer from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1/).
 # How to use

 The training corpus consisted of 30B tokens, 25B of which are synthetic from Cosmopedia. Since we didn't explore the synthetic generation of code, we augmented the dataset with 5B tokens of non-synthetic sources like the `code-python-0.60-to-1.00` and `web-0.50-to-1.00` subsets of [AutoMathText](https://huggingface.co/datasets/math-ai/AutoMathText). We also added 1M files from [The Stack](https://huggingface.co/datasets/bigcode/the-stack)'s Jupyter Notebooks, converted to script. They tend to have educational code interleaved with text.
 We also included [ultrachat](https://huggingface.co/datasets/stingning/ultrachat) formatted in the chat format of `LlaMa` models, so we don't have to instruction-tune the model after the pre-training. Additionally, we upsampled twice the data from these seed sources twice to help with commonsense and reasoning: stories, AutoMathText & KhanAcademy.
+We trained for 6 epochs, resulting in a model trained on 180B tokens with a sequence length of 2k, a global batch size of 1.3M tokens and a learning rate of 3e-4 with a cosine schedule for 140k steps.
 We used the tokenizer from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1/).
 # How to use