GameTheory-Reasoner (GRPO Phase 2)
A game theory reasoning model trained with Group Relative Policy Optimization (GRPO) and verifiable reward functions.
This is a LoRA adapter trained on top of the Phase 1 Solver (which itself is fine-tuned from Qwen/Qwen2.5-7B-Instruct). It represents Phase 2 of a two-phase training pipeline designed to build a strong game theory problem solver with enhanced reasoning capabilities.
Training Pipeline
Qwen2.5-7B-Instruct (base)
|
+-- Phase 1: Supervised Fine-Tuning (QLoRA)
| +-- GameTheory-Solver adapter
| +-- Merged into: phase1_merged/
|
+-- Phase 2: GRPO Reinforcement Learning
+-- GameTheory-Reasoner adapter (this model)
Trained on top of phase1_merged
Benchmark Results (GameTheory-Bench, n=50)
Overall Performance
| Metric |
Base (Qwen2.5-7B) |
Solver (Phase 1) |
Reasoner (Phase 2) |
| Exact Accuracy |
82.0% |
94.0% |
94.0% |
| Partial Accuracy |
82.0% |
94.0% |
94.0% |
| Format Quality |
0.92 |
0.70 |
0.70 |
| Reasoning Quality |
0.53 |
0.51 |
0.54 |
| Avg Response Length |
523 words |
169 words |
181 words |
Performance by Difficulty
| Difficulty |
Base |
Solver |
Reasoner |
| Easy (n=9) |
100.0% |
88.9% |
88.9% |
| Medium (n=23) |
87.0% |
95.7% |
95.7% |
| Hard (n=18) |
66.7% |
94.4% |
94.4% |
Performance by Category
| Category |
Base |
Solver |
Reasoner |
| normal_form_2x2 |
100.0% |
80.0% |
80.0% |
| normal_form_3x3 |
80.0% |
60.0% |
60.0% |
| normal_form_3x4 |
100.0% |
100.0% |
100.0% |
| normal_form_4x4 |
100.0% |
100.0% |
100.0% |
| zero_sum |
100.0% |
100.0% |
100.0% |
| sequential_game |
100.0% |
100.0% |
100.0% |
| auction_theory |
80.0% |
100.0% |
100.0% |
| bayesian_game |
0.0% |
100.0% |
100.0% |
| cooperative_game |
100.0% |
100.0% |
100.0% |
| mechanism_design |
60.0% |
100.0% |
100.0% |
Key Findings
- +12% accuracy over base Qwen2.5-7B-Instruct (82% to 94%)
- Massive gains on hard problems: 66.7% to 94.4% (+27.7%)
- Bayesian games: 0% to 100% (the most dramatic improvement)
- Mechanism design: 60% to 100%
- Reasoning quality improved by GRPO: 0.51 (Solver) to 0.54 (Reasoner)
- Concise outputs: ~65% shorter than base model while being more accurate
Training Details
GRPO Configuration
| Parameter |
Value |
| Method |
Group Relative Policy Optimization (GRPO) |
| Steps |
750 |
| Training Time |
~8 hours on RTX 3090 |
| LoRA Rank (r) |
32 |
| LoRA Alpha |
64 |
| Learning Rate |
5e-6 |
| KL Beta |
0.04 |
| Num Generations |
4 |
| Max Completion Length |
1024 |
Reward Functions (3 verifiable rewards)
| Reward |
Range |
Description |
| Accuracy |
0.85 to 1.0 |
Verifies correctness against gold answers using domain-specific comparators |
| Format |
0.64 to 0.82 |
Checks structured output format (think/answer tags) |
| Reasoning |
0.55 to 0.79 |
Evaluates reasoning chain quality and mathematical notation |
| Total |
2.36 to 2.55 |
Combined reward signal |
Training Dynamics
| Metric |
Value |
| Final Loss |
~0.0002 |
| KL Divergence |
0.004 to 0.015 |
Usage
Loading the Model
This adapter requires a two-step loading process since it was trained on top of the Phase 1 merged model:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"Alogotron/GameTheory-Solver",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "Alogotron/GameTheory-Reasoner")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
Inference
system_prompt = (
"You are a game theory expert. Solve the following problem step by step. "
"Show your reasoning clearly, then provide your final answer."
)
problem = "Consider a 2-player game with the following payoff matrix: " "L: (3,2) (1,4), R: (2,3) (4,1). Find all Nash Equilibria."
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": problem},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Related Resources
License
Apache-2.0