DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper β’ 2501.12948 β’ Published β’ 448
A Qwen2.5-1.5B-Instruct model fine-tuned with GRPO (Group Relative Policy Optimization) on the GSM8K math dataset. The model learned to solve grade school math word problems through reinforcement learning β no human-labeled reasoning chains, just a binary correctness signal.
GRPO is the same RL technique used to train DeepSeek-R1. Instead of learning from labeled step-by-step solutions, the model:
This teaches the model to reason and self-verify without explicit reasoning supervision.
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Method | GRPO with LoRA (r=16, alpha=32) |
| Dataset | openai/gsm8k β 7,473 math problems |
| Hardware | NVIDIA RTX 5090 (32GB VRAM) |
| Training time | ~8.4 hours |
| Epochs | 1 |
| Batch size | 1 (gradient accumulation: 8, effective: 8) |
| Learning rate | 5e-6 (cosine schedule, 20 warmup steps) |
| GRPO group size | 4 completions per prompt |
| Max completion length | 512 tokens |
| Precision | bf16 |
| Framework | TRL 0.29.1 + Transformers 5.3.0 |
| Metric | Value |
|---|---|
| Starting accuracy | 25% (pre-training baseline) |
| Final accuracy (MA-50) | 83.3% |
| Peak batch accuracy | 100% |
The model went from solving 1 in 4 math problems to 4 in 5 through pure RL training.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model + LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base_model, "usama10/qwen-1.5b-grpo-math")
tokenizer = AutoTokenizer.from_pretrained("usama10/qwen-1.5b-grpo-math")
messages = [
{"role": "system", "content": "You are a math tutor. Solve the problem step by step, then give the final numerical answer after ####."},
{"role": "user", "content": "A store sells apples for $2 each and oranges for $3 each. If Sarah buys 5 apples and 3 oranges, how much does she spend?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
The reward function extracts the final numerical answer after #### from the model's output and compares it to the ground truth:
def correctness_reward(completion, ground_truth):
predicted = extract_number_after_hashes(completion)
expected = extract_number_after_hashes(ground_truth)
return 1.0 if predicted == expected else 0.0