| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | library_name: transformers |
| | tags: |
| | - nlp |
| | - llm |
| | --- |
| | # K2: a Fully Transparent OSS Language at Llama 2 Performance Using 35% Less Compute |
| |
|
| | LLM360 demystifies the training recipe used for Llama 2 - 70B with K2. Reaching a comparable performance with Llama 2, K2 has 65B parameters |
| | and is trained on around 1.4T tokens, resulting a receipe of approximately 35% less compute. |
| |
|
| | ## Evaluations |
| | <center><img src="eval_table_temp.png" alt="eval table"/></center> |
| |
|
| | ## Datasets and Mix |
| |
|
| | The following data mix was used to train K2 and achieve results in line with Llama 2 70B. The full data sequence will be available soon. |
| |
|
| | | Dataset | Starting Tokens | Multiplier | Total Tokens |% of Total | |
| | | ----------- | ----------- | ----------- | ----------- | ----------- | |
| | | dm-math | 4.33B | 3x | 13B | 1% | |
| | | pubmed-abstracts | 4.77B | 3x | 14.3B | 1.1% | |
| | | uspto | 4.77B | 3x | 14.3B | 1.1% | |
| | | pubmed-central | 26B | 1x | 26B | 2% | |
| | | [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 27.3B | 1x | 27.3B | 2.1% | |
| | | [starcoder.spm](https://huggingface.co/datasets/bigcode/starcoderdata) | 67.6B | 0.5x | 33.8B | 2.6% | |
| | | [starcoder.fim](https://huggingface.co/datasets/bigcode/starcoderdata) | 67.6B | 0.5x | 33.8B | 2.6% | |
| | | [redpajama.stackexchange](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 61.1B | 1x | 61.1B | 4.7% | |
| | | [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata) | 132.6B | 0.5x | 66.3B | 5.1% | |
| | | [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 76.7B | 1x | 76.7B | 5.9% | |
| | | [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 80.6B | 1x | 80.6B | 6.2% | |
| | | s2orc | 107.9B | 1x | 107.9B | 8.3% | |
| | | [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 22.1B | 6x | 132.6B | 10.2% | |
| | | [refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 612.3B | 1x | 612.3B | 47.1% | |
| | | Totals | - | - | 1.3T | 100% | |
| |
|
| | ## First 10 Checkpoints |
| | | Checkpoints | | |
| | | ----------- | ----------- | |
| | | [Checkpoint 360](https://huggingface.co/LLM360/K2/tree/ckpt_360) | [Checkpoint 355](https://huggingface.co/LLM360/K2/tree/ckpt_355) | |
| | | [Checkpoint 359](https://huggingface.co/LLM360/K2/tree/ckpt_359) | [Checkpoint 354](https://huggingface.co/LLM360/K2/tree/ckpt_354) | |
| | | [Checkpoint 358](https://huggingface.co/LLM360/K2/tree/ckpt_358) | [Checkpoint 353](https://huggingface.co/LLM360/K2/tree/ckpt_353) | |
| | | [Checkpoint 357](https://huggingface.co/LLM360/K2/tree/ckpt_357) | [Checkpoint 352](https://huggingface.co/LLM360/K2/tree/ckpt_352) | |
| | | [Checkpoint 356](https://huggingface.co/LLM360/K2/tree/ckpt_356) | [Checkpoint 351](https://huggingface.co/LLM360/K2/tree/ckpt_351) | |
| |
|
| | [to find all branches: git branch -a] |
| |
|
| | ## Additional Artifacts |
| | We are working on release caliber artifacts for the dataset, code, and analysis which will be released over the next few weeks. |
| |
|
| |
|
| | ## Model Description |
| |
|
| | - **Model type:** 65 billion parameter language model with the same architecture as LLaMA. |
| | - **Language(s) (NLP):** English |
| | - **License:** Apache 2.0 |
| | - **Resources for more information:** |
| | - Training Code: TBD |
| | - Data Preparation: TBD |
| | - Metrics: TBD |
| | - Fully processed K2 pretraining dataset: TBD |
| |
|
| |
|
| | ## About LLM360 |
| | LLM360 is an initiative for comprehensive and fully open-sourced LLMs, |
| | where all training details, model checkpoints, intermediate results, and |
| | additional analyses are made available to the community. Our goal is to advance |
| | the field by inviting the community to deepen the understanding of LLMs |
| | together. As the first step of the project LLM360, we release all intermediate |
| | model checkpoints, our fully-prepared pre-training dataset, all source code and |
| | configurations, and training details. We are |
| | committed to continually pushing the boundaries of LLMs through this open-source |
| | effort. |
| |
|
| | [Visit us](https://www.llm360.ai/) |
| |
|