GameTheory-Reasoner (GRPO Phase 2)

A game theory reasoning model trained with Group Relative Policy Optimization (GRPO) and verifiable reward functions.

This is a LoRA adapter trained on top of the Phase 1 Solver (which itself is fine-tuned from Qwen/Qwen2.5-7B-Instruct). It represents Phase 2 of a two-phase training pipeline designed to build a strong game theory problem solver with enhanced reasoning capabilities.

Training Pipeline

Qwen2.5-7B-Instruct (base)
  |
  +-- Phase 1: Supervised Fine-Tuning (QLoRA)
  |   +-- GameTheory-Solver adapter
  |       +-- Merged into: phase1_merged/
  |
  +-- Phase 2: GRPO Reinforcement Learning
      +-- GameTheory-Reasoner adapter (this model)
          Trained on top of phase1_merged

Benchmark Results (GameTheory-Bench, n=50)

Overall Performance

Metric	Base (Qwen2.5-7B)	Solver (Phase 1)	Reasoner (Phase 2)
Exact Accuracy	82.0%	94.0%	94.0%
Partial Accuracy	82.0%	94.0%	94.0%
Format Quality	0.92	0.70	0.70
Reasoning Quality	0.53	0.51	0.54
Avg Response Length	523 words	169 words	181 words

Performance by Difficulty

Difficulty	Base	Solver	Reasoner
Easy (n=9)	100.0%	88.9%	88.9%
Medium (n=23)	87.0%	95.7%	95.7%
Hard (n=18)	66.7%	94.4%	94.4%

Performance by Category

Category	Base	Solver	Reasoner
normal_form_2x2	100.0%	80.0%	80.0%
normal_form_3x3	80.0%	60.0%	60.0%
normal_form_3x4	100.0%	100.0%	100.0%
normal_form_4x4	100.0%	100.0%	100.0%
zero_sum	100.0%	100.0%	100.0%
sequential_game	100.0%	100.0%	100.0%
auction_theory	80.0%	100.0%	100.0%
bayesian_game	0.0%	100.0%	100.0%
cooperative_game	100.0%	100.0%	100.0%
mechanism_design	60.0%	100.0%	100.0%

Key Findings

+12% accuracy over base Qwen2.5-7B-Instruct (82% to 94%)
Massive gains on hard problems: 66.7% to 94.4% (+27.7%)
Bayesian games: 0% to 100% (the most dramatic improvement)
Mechanism design: 60% to 100%
Reasoning quality improved by GRPO: 0.51 (Solver) to 0.54 (Reasoner)
Concise outputs: ~65% shorter than base model while being more accurate

Training Details

GRPO Configuration

Parameter	Value
Method	Group Relative Policy Optimization (GRPO)
Steps	750
Training Time	~8 hours on RTX 3090
LoRA Rank (r)	32
LoRA Alpha	64
Learning Rate	5e-6
KL Beta	0.04
Num Generations	4
Max Completion Length	1024

Reward Functions (3 verifiable rewards)

Reward	Range	Description
Accuracy	0.85 to 1.0	Verifies correctness against gold answers using domain-specific comparators
Format	0.64 to 0.82	Checks structured output format (think/answer tags)
Reasoning	0.55 to 0.79	Evaluates reasoning chain quality and mathematical notation
Total	2.36 to 2.55	Combined reward signal

Training Dynamics

Metric	Value
Final Loss	~0.0002
KL Divergence	0.004 to 0.015

Usage

Loading the Model

This adapter requires a two-step loading process since it was trained on top of the Phase 1 merged model:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Step 1: Load the Phase 1 merged model as base
base_model = AutoModelForCausalLM.from_pretrained(
    "Alogotron/GameTheory-Solver",  # or your local phase1_merged path
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Step 2: Apply the GRPO Reasoner adapter
model = PeftModel.from_pretrained(base_model, "Alogotron/GameTheory-Reasoner")
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

Inference

system_prompt = (
    "You are a game theory expert. Solve the following problem step by step. "
    "Show your reasoning clearly, then provide your final answer."
)

problem = "Consider a 2-player game with the following payoff matrix: "    "L: (3,2) (1,4), R: (2,3) (4,1). Find all Nash Equilibria."

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": problem},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Related Resources

Dataset: Alogotron/GameTheory-Bench - 2,913 game theory problems
Phase 1 Model: Alogotron/GameTheory-Solver - SFT fine-tuned solver
Demo: Game Theory Solver Space

📚 Related Work

"Game Theory Meets Large Language Models: A Systematic Survey" — IJCAI 2025 (arxiv:2502.09053) — The definitive survey on game theory × LLMs, covering RLHF alignment, multi-agent interactions, and strategic reasoning.
DeepMind SHOR-PSRO (April 2026) — LLM-driven rewriting of game theory algorithms that outperformed hand-designed baselines (MarkTechPost).
GT-HarmBench — Game-theoretic framing for AI safety benchmarking (arxiv:2602.12316).

📄 Citation

@model{alogotron_gametheory_reasoner_2026,
  author    = {Alogotron},
  title     = {GameTheory-Reasoner: GRPO-Trained Game Theory Reasoning Model},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/Alogotron/GameTheory-Reasoner},
  note      = {Phase 2 GRPO adapter with +6\% reasoning quality improvement}
}