A game theory reasoning model trained with Group Relative Policy Optimization (GRPO) and verifiable reward functions.
This is a LoRA adapter trained on top of the Phase 1 Solver (which itself is fine-tuned from Qwen/Qwen2.5-7B-Instruct). It represents Phase 2 of a two-phase training pipeline designed to build a strong game theory problem solver with enhanced reasoning capabilities.
+12% accuracy over base Qwen2.5-7B-Instruct (82% to 94%)
Massive gains on hard problems: 66.7% to 94.4% (+27.7%)
Bayesian games: 0% to 100% (the most dramatic improvement)
Mechanism design: 60% to 100%
Reasoning quality improved by GRPO: 0.51 (Solver) to 0.54 (Reasoner)
Concise outputs: ~65% shorter than base model while being more accurate
Training Details
GRPO Configuration
Parameter
Value
Method
Group Relative Policy Optimization (GRPO)
Steps
750
Training Time
~8 hours on RTX 3090
LoRA Rank (r)
32
LoRA Alpha
64
Learning Rate
5e-6
KL Beta
0.04
Num Generations
4
Max Completion Length
1024
Reward Functions (3 verifiable rewards)
Reward
Range
Description
Accuracy
0.85 to 1.0
Verifies correctness against gold answers using domain-specific comparators
Format
0.64 to 0.82
Checks structured output format (think/answer tags)
Reasoning
0.55 to 0.79
Evaluates reasoning chain quality and mathematical notation
Total
2.36 to 2.55
Combined reward signal
Training Dynamics
Metric
Value
Final Loss
~0.0002
KL Divergence
0.004 to 0.015
Usage
Loading the Model
This adapter requires a two-step loading process since it was trained on top of the Phase 1 merged model:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Step 1: Load the Phase 1 merged model as base
base_model = AutoModelForCausalLM.from_pretrained(
"Alogotron/GameTheory-Solver", # or your local phase1_merged path
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Step 2: Apply the GRPO Reasoner adapter
model = PeftModel.from_pretrained(base_model, "Alogotron/GameTheory-Reasoner")
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
Inference
system_prompt = (
"You are a game theory expert. Solve the following problem step by step. ""Show your reasoning clearly, then provide your final answer."
)
problem = "Consider a 2-player game with the following payoff matrix: ""L: (3,2) (1,4), R: (2,3) (4,1). Find all Nash Equilibria."
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": problem},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)