|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen2.5-7B-Instruct |
|
|
tags: |
|
|
- qwen2.5 |
|
|
- grpo |
|
|
- rlhf |
|
|
- math |
|
|
- reasoning |
|
|
- ms-swift |
|
|
datasets: |
|
|
- AI-MO/NuminaMath-TIR |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Qwen2.5-7B-Instruct-GRPO-Math |
|
|
|
|
|
This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using **GRPO (Group Relative Policy Optimization)** on mathematical reasoning tasks. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base Model**: Qwen2.5-7B-Instruct |
|
|
- **Training Method**: GRPO (Reinforcement Learning) |
|
|
- **Training Framework**: [ms-swift](https://github.com/modelscope/ms-swift) |
|
|
- **Training Data**: [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) (500 samples) |
|
|
- **Hardware**: 1x NVIDIA H100 PCIe (80GB) |
|
|
- **Training Time**: ~2.5 hours |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
```bash |
|
|
CUDA_VISIBLE_DEVICES=0 \ |
|
|
swift rlhf \ |
|
|
--rlhf_type grpo \ |
|
|
--model Qwen/Qwen2.5-7B-Instruct \ |
|
|
--reward_funcs accuracy format \ |
|
|
--train_type lora \ |
|
|
--lora_rank 8 \ |
|
|
--lora_alpha 32 \ |
|
|
--target_modules all-linear \ |
|
|
--torch_dtype bfloat16 \ |
|
|
--dataset 'AI-MO/NuminaMath-TIR#500' \ |
|
|
--num_train_epochs 1 \ |
|
|
--per_device_train_batch_size 2 \ |
|
|
--learning_rate 5e-5 \ |
|
|
--num_generations 2 |
|
|
``` |
|
|
|
|
|
### Training Metrics |
|
|
|
|
|
- **Final Loss**: 0.00011567 |
|
|
- **Math Accuracy**: 70% |
|
|
- **Reward**: 0.7 |
|
|
- **Training Steps**: 500 |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using with Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from peft import PeftModel |
|
|
|
|
|
# Load base model |
|
|
base_model = AutoModelForCausalLM.from_pretrained( |
|
|
"Qwen/Qwen2.5-7B-Instruct", |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Load LoRA adapter |
|
|
model = PeftModel.from_pretrained( |
|
|
base_model, |
|
|
"FutureMa/Qwen2.5-7B-Instruct-GRPO-Math" |
|
|
) |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct") |
|
|
|
|
|
# Generate |
|
|
messages = [ |
|
|
{"role": "user", "content": "Solve for x: 2x^2 - 3x + 1 = 0"} |
|
|
] |
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate(**inputs, max_new_tokens=512) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
### Using with ms-swift |
|
|
|
|
|
```bash |
|
|
# Inference |
|
|
swift infer \ |
|
|
--ckpt_dir FutureMa/Qwen2.5-7B-Instruct-GRPO-Math \ |
|
|
--eval_human false |
|
|
``` |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is optimized for: |
|
|
- ✅ Mathematical reasoning and problem-solving |
|
|
- ✅ Step-by-step solution generation |
|
|
- ✅ Algebraic equation solving |
|
|
- ✅ Arithmetic calculations |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained on a relatively small dataset (500 samples) |
|
|
- May not generalize well to very complex mathematical problems |
|
|
- LoRA fine-tuning may have limited capacity compared to full fine-tuning |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{qwen2.5-grpo-math, |
|
|
author = {FutureMa}, |
|
|
title = {Qwen2.5-7B-Instruct Fine-tuned with GRPO on Math Tasks}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/FutureMa/Qwen2.5-7B-Instruct-GRPO-Math}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Base model: [Qwen Team](https://huggingface.co/Qwen) |
|
|
- Training framework: [ms-swift](https://github.com/modelscope/ms-swift) |
|
|
- Dataset: [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) |
|
|
|