Update model card: base model + SFT + GRPO adapter details

d33f2b6 verified about 1 month ago

1.82 kB

	---
	base_model: Qwen/Qwen3-4B-Instruct-2507
	library_name: peft
	tags:
	- lora
	- sft
	- grpo
	- reinforcement-learning
	- math
	- tool-use
	---

	# Qwen3-4B-Instruct-2507 — Capstone MathRL

	Fine-tuned from [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) using a two-stage SFT → GRPO pipeline for mathematical reasoning with calculator tool use.

	Author: Mohammad Rafi

	---

	## Base Model

	- Model: `Qwen/Qwen3-4B-Instruct-2507`
	- Parameters: 4B
	- Context length: 32k tokens

	---

	## SFT Adapter — `sft_adapter/`

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Method \| LoRA (Supervised Fine-Tuning) \|
	\| LoRA rank \| 32 \|
	\| Epochs \| 2 \|
	\| Training samples \| 500 \|
	\| Task \| Math reasoning (GSM8K + NuminaMath) \|
	\| Size \| 270.92 MB \|

	---

	## GRPO Adapter — `grpo_adapter/`

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Method \| GRPO (Group Relative Policy Optimization) \|
	\| Training samples \| 400 \|
	\| Group size \| 8 \|
	\| Learning rate \| 3e-6 \|
	\| Substeps \| 1 \|
	\| Curriculum \| easy → intermediate → hard \|
	\| Size \| 270.92 MB \|

	> Recommended: Use `grpo_adapter/` — trained through the full SFT + GRPO pipeline.

	---

	## Usage

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer

	base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")

	# Load GRPO adapter (recommended)
	model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="grpo_adapter")
	model = model.merge_and_unload()

	# Load SFT adapter only
	# model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="sft_adapter")
	# model = model.merge_and_unload()
	```