Disclaimer: The model has been trained on AWS on an Instance type
g4dn.4xlarge(Tesla T4. Num GPUs = 1. Max memory: 14.563 GB. Platform: Linux. Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0).
Model Details
Trained a math reasoning model, converting a standard model Qwen2.5 3B Instruct into a math reasoning model using GRPO (Group Relative Policy Optimization),
a reinforcement learning algorithm that optimizes responses using reward functions.
Defined the rewarding functions to let the model learn how to reason on them, we fine-tuned Qwen2.5 3B Instruct on OpenAI's GSM8K dataset,
which contains grade school math problems.
The training took 10h on a Tesla T4. You can find the code used to train the model here.
Uploaded model
- Developed by: ugriffo
- License: apache-2.0
- Finetuned from model : unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.
- Downloads last month
- 9
4-bit
5-bit
8-bit
16-bit
