Akshint47
/

Nano_R1_Model

text-generation-inference

Model card Files Files and versions

Akshint47 commited on Mar 21, 2025

Commit

808df0f

·

verified ·

1 Parent(s): edd71ab

Update README.md

Files changed (1) hide show

README.md +25 -0

README.md CHANGED Viewed

@@ -1,3 +1,28 @@
 # Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset
 This notebook demonstrates the process of fine-tuning the **Qwen2.5-3B-Instruct** model using **GRPO (Generalized Reward Policy Optimization)** on the **GSM8K** dataset. The goal is to improve the model's ability to solve mathematical reasoning problems by leveraging reinforcement learning with custom reward functions.

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- unsloth/Qwen2.5-3B-Instruct-unsloth-bnb-4bit
+library_name: adapter-transformers
+tags:
+- text-generation-inference
+- transformers
+- unsloth
+- qwen2
+- trl
+- grpo
+---
+# Uploaded  model
+- **Developed by:** Akshint47
+- **License:** apache-2.0
+- **Finetuned from model :** unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
+This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
+[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 # Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset
 This notebook demonstrates the process of fine-tuning the **Qwen2.5-3B-Instruct** model using **GRPO (Generalized Reward Policy Optimization)** on the **GSM8K** dataset. The goal is to improve the model's ability to solve mathematical reasoning problems by leveraging reinforcement learning with custom reward functions.