Update README.md
Browse files
README.md
CHANGED
|
@@ -118,19 +118,13 @@ model-index:
|
|
| 118 |
|
| 119 |
# SmolLM2 1.7b Instruction Tuned & DPO Aligned through Tulu 3!
|
| 120 |
|
| 121 |
-
 to tune the [base version of Huggingface's SmolLM2-1.7b](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B)! The post training pipeline AllenAI came up with seemed like something perfect to apply here.
|
| 124 |
|
| 125 |
This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the SFT (supervised finetuning) and DPO (direct preference optimization) stages.
|
| 126 |
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
There's a few reasons on why I like calling this model v0.1:
|
| 130 |
-
|
| 131 |
-
1. The model still lags behind the instruction tuned version of SmolLM2 in some other metrics.
|
| 132 |
-
2. This model has only undergone SFT and DPO, the RLVR (reinforcement learning with verifiable rewards) stage was too computationally expensive to run on a model that could be better.
|
| 133 |
-
3. Initial hyperparameter choice during training was naive, through some napkin math I've been able to find a much better learning rate that scales the one found in the Tulu 3 paper according to my computational resources better.
|
| 134 |
|
| 135 |
# Evaluation
|
| 136 |
|
|
|
|
| 118 |
|
| 119 |
# SmolLM2 1.7b Instruction Tuned & DPO Aligned through Tulu 3!
|
| 120 |
|
| 121 |
+

|
| 122 |
|
| 123 |
SmolTulu-v0.1 is the first model in a series of models meant to leverage [AllenAI's Tulu 3 post-training pipeline](https://allenai.org/blog/tulu-3-technical) to tune the [base version of Huggingface's SmolLM2-1.7b](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B)! The post training pipeline AllenAI came up with seemed like something perfect to apply here.
|
| 124 |
|
| 125 |
This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the SFT (supervised finetuning) and DPO (direct preference optimization) stages.
|
| 126 |
|
| 127 |
+
Something important to note, this model has only undergone SFT and DPO, the RLVR (reinforcement learning with verifiable rewards) stage was too computationally expensive to run properly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
# Evaluation
|
| 130 |
|