Update Readme.md
#4
by
michaelfeil - opened
README.md
CHANGED
|
@@ -9,17 +9,17 @@ tags:
|
|
| 9 |
|
| 10 |

|
| 11 |
|
| 12 |
-
This model extends LLama-3 8B's context length from 8k to > 160K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
|
| 13 |
|
| 14 |
**Approach:**
|
| 15 |
|
| 16 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
| 17 |
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
|
| 18 |
-
-
|
| 19 |
|
| 20 |
**Infra:**
|
| 21 |
|
| 22 |
-
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to
|
| 23 |
|
| 24 |
**Data:**
|
| 25 |
|
|
|
|
| 9 |
|
| 10 |

|
| 11 |
|
| 12 |
+
This model extends LLama-3 8B's context length from 8k to > 160K, developed by Gradient, sponsored by compute from [Crusoe Energy](https://huggingface.co/crusoeai). It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
|
| 13 |
|
| 14 |
**Approach:**
|
| 15 |
|
| 16 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
| 17 |
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
|
| 18 |
+
- Progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2] (See details below)
|
| 19 |
|
| 20 |
**Infra:**
|
| 21 |
|
| 22 |
+
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 262144 tokens on [Crusoe Energy](https://huggingface.co/crusoeai) high performance L40S cluster.
|
| 23 |
|
| 24 |
**Data:**
|
| 25 |
|