Update README.md
Browse files
README.md
CHANGED
|
@@ -15,6 +15,8 @@ tags:
|
|
| 15 |
- Small
|
| 16 |
- Huggingface
|
| 17 |
- Allenai
|
|
|
|
|
|
|
| 18 |
pipeline_tag: text-generation
|
| 19 |
---
|
| 20 |
|
|
@@ -22,17 +24,17 @@ pipeline_tag: text-generation
|
|
| 22 |
|
| 23 |

|
| 24 |
|
| 25 |
-
SmolTulu-v0 is the first model in a series of models meant to leverage [AllenAI's Tulu 3 post-training pipeline](https://allenai.org/blog/tulu-3-technical) to tune the [base version of Huggingface's SmolLM2-1.7b](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B)! The post training pipeline AllenAI came up with seemed like something perfect to apply here.
|
| 26 |
|
| 27 |
This model scores the highest current score in IFEval while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the SFT (supervised finetuning) and DPO (direct preference optimization) stages.
|
| 28 |
|
| 29 |
-
## Why v0?
|
| 30 |
|
| 31 |
-
There's a few reasons on why I
|
| 32 |
|
| 33 |
-
1. The model still lags behind the instruction tuned version of SmolLM2 in
|
| 34 |
2. This model has only undergone SFT and DPO, the RLVR (reinforcement learning with verifiable rewards) stage was too computationally expensive to run on a model that could be better.
|
| 35 |
-
3. Initial hyperparameter choice was naive, through some napkin math I've been able to find a much better learning rate that scales the one found in the Tulu 3 paper according to my computational resources better.
|
| 36 |
|
| 37 |
# Evaluation
|
| 38 |
|
|
|
|
| 15 |
- Small
|
| 16 |
- Huggingface
|
| 17 |
- Allenai
|
| 18 |
+
- SFT
|
| 19 |
+
- DPO
|
| 20 |
pipeline_tag: text-generation
|
| 21 |
---
|
| 22 |
|
|
|
|
| 24 |
|
| 25 |

|
| 26 |
|
| 27 |
+
SmolTulu-v0.1 is the first model in a series of models meant to leverage [AllenAI's Tulu 3 post-training pipeline](https://allenai.org/blog/tulu-3-technical) to tune the [base version of Huggingface's SmolLM2-1.7b](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B)! The post training pipeline AllenAI came up with seemed like something perfect to apply here.
|
| 28 |
|
| 29 |
This model scores the highest current score in IFEval while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the SFT (supervised finetuning) and DPO (direct preference optimization) stages.
|
| 30 |
|
| 31 |
+
## Why v0.1?
|
| 32 |
|
| 33 |
+
There's a few reasons on why I like calling this model v0.1:
|
| 34 |
|
| 35 |
+
1. The model still lags behind the instruction tuned version of SmolLM2 in some other metrics.
|
| 36 |
2. This model has only undergone SFT and DPO, the RLVR (reinforcement learning with verifiable rewards) stage was too computationally expensive to run on a model that could be better.
|
| 37 |
+
3. Initial hyperparameter choice during training was naive, through some napkin math I've been able to find a much better learning rate that scales the one found in the Tulu 3 paper according to my computational resources better.
|
| 38 |
|
| 39 |
# Evaluation
|
| 40 |
|