|
|
--- |
|
|
license: apache-2.0 |
|
|
new_version: Sweaterdog/Smol-reason2.1-LoRA |
|
|
--- |
|
|
|
|
|
# 🧠 Smol-reason, a 3B model test for future models 🧠 |
|
|
|
|
|
## Why? |
|
|
|
|
|
When making the Andy series of models, I have been using PPO techniques to train models. |
|
|
|
|
|
But as the bleeding edge of small models is becoming clear, reasoning models are the winners. |
|
|
|
|
|
So, in order to learn the nuances of training models, I decided to train a small 3B model using GRPO techniques instead of PPO. |
|
|
|
|
|
## --------------------------------------------------------------------------------------------------------------------- |
|
|
|
|
|
The base model was Qwen2.5 3B, it is very smart as is, and even smarter with reasoning. |
|
|
|
|
|
This model uses the following format while responding: |
|
|
``` |
|
|
<think> |
|
|
--reasoning content here-- |
|
|
</think> |
|
|
<answer |
|
|
--answer content here-- |
|
|
</answer> |
|
|
``` |
|
|
|
|
|
Similar to the XML reasoning format but changed to use DeepSeek-R1 / QwQ thinking blocks. |