FutureMa
/

Qwen2.5-7B-Instruct-GRPO-Math

Text Generation

Model card Files Files and versions

Qwen2.5-7B-Instruct-GRPO-Math / README.md

FutureMa's picture

Upload GRPO fine-tuned Qwen2.5-7B-Instruct model

bc4cc58 verified 28 days ago

|

history blame contribute delete

3.38 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-7B-Instruct
	tags:
	- qwen2.5
	- grpo
	- rlhf
	- math
	- reasoning
	- ms-swift
	datasets:
	- AI-MO/NuminaMath-TIR
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	---

	# Qwen2.5-7B-Instruct-GRPO-Math

	This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using GRPO (Group Relative Policy Optimization) on mathematical reasoning tasks.

	## Model Description

	- Base Model: Qwen2.5-7B-Instruct
	- Training Method: GRPO (Reinforcement Learning)
	- Training Framework: [ms-swift](https://github.com/modelscope/ms-swift)
	- Training Data: [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) (500 samples)
	- Hardware: 1x NVIDIA H100 PCIe (80GB)
	- Training Time: ~2.5 hours

	## Training Details

	### Training Configuration

	```bash
	CUDA_VISIBLE_DEVICES=0 \
	swift rlhf \
	--rlhf_type grpo \
	--model Qwen/Qwen2.5-7B-Instruct \
	--reward_funcs accuracy format \
	--train_type lora \
	--lora_rank 8 \
	--lora_alpha 32 \
	--target_modules all-linear \
	--torch_dtype bfloat16 \
	--dataset 'AI-MO/NuminaMath-TIR#500' \
	--num_train_epochs 1 \
	--per_device_train_batch_size 2 \
	--learning_rate 5e-5 \
	--num_generations 2
	```

	### Training Metrics

	- Final Loss: 0.00011567
	- Math Accuracy: 70%
	- Reward: 0.7
	- Training Steps: 500

	## Usage

	### Using with Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	# Load base model
	base_model = AutoModelForCausalLM.from_pretrained(
	"Qwen/Qwen2.5-7B-Instruct",
	torch_dtype="auto",
	device_map="auto"
	)

	# Load LoRA adapter
	model = PeftModel.from_pretrained(
	base_model,
	"FutureMa/Qwen2.5-7B-Instruct-GRPO-Math"
	)

	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

	# Generate
	messages = [
	{"role": "user", "content": "Solve for x: 2x^2 - 3x + 1 = 0"}
	]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt").to(model.device)

	outputs = model.generate(**inputs, max_new_tokens=512)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Using with ms-swift

	```bash
	# Inference
	swift infer \
	--ckpt_dir FutureMa/Qwen2.5-7B-Instruct-GRPO-Math \
	--eval_human false
	```

	## Intended Use

	This model is optimized for:
	- ✅ Mathematical reasoning and problem-solving
	- ✅ Step-by-step solution generation
	- ✅ Algebraic equation solving
	- ✅ Arithmetic calculations

	## Limitations

	- Trained on a relatively small dataset (500 samples)
	- May not generalize well to very complex mathematical problems
	- LoRA fine-tuning may have limited capacity compared to full fine-tuning

	## Citation

	```bibtex
	@misc{qwen2.5-grpo-math,
	author = {FutureMa},
	title = {Qwen2.5-7B-Instruct Fine-tuned with GRPO on Math Tasks},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/FutureMa/Qwen2.5-7B-Instruct-GRPO-Math}}
	}
	```

	## Acknowledgments

	- Base model: [Qwen Team](https://huggingface.co/Qwen)
	- Training framework: [ms-swift](https://github.com/modelscope/ms-swift)
	- Dataset: [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR)