---
library_name: peft
base_model: Qwen/Qwen2.5-7B-Instruct
tags:
  - game-theory
  - grpo
  - reinforcement-learning
  - reasoning
  - qwen2.5
  - lora
  - peft
license: apache-2.0
datasets:
  - Alogotron/GameTheory-Bench
metrics:
  - accuracy
pipeline_tag: text-generation
model-index:
  - name: GameTheory-Reasoner
    results:
      - task:
          type: text-generation
          name: Game Theory Problem Solving
        dataset:
          name: GameTheory-Bench
          type: Alogotron/GameTheory-Bench
        metrics:
          - name: Exact Accuracy
            type: accuracy
            value: 94.0
            verified: true
---

# GameTheory-Reasoner (GRPO Phase 2)

**A game theory reasoning model trained with Group Relative Policy Optimization (GRPO) and verifiable reward functions.**

This is a LoRA adapter trained on top of the [Phase 1 Solver](https://huggingface.co/Alogotron/GameTheory-Solver) (which itself is fine-tuned from [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)). It represents Phase 2 of a two-phase training pipeline designed to build a strong game theory problem solver with enhanced reasoning capabilities.

## Training Pipeline

```
Qwen2.5-7B-Instruct (base)
  |
  +-- Phase 1: Supervised Fine-Tuning (QLoRA)
  |   +-- GameTheory-Solver adapter
  |       +-- Merged into: phase1_merged/
  |
  +-- Phase 2: GRPO Reinforcement Learning
      +-- GameTheory-Reasoner adapter (this model)
          Trained on top of phase1_merged
```

## Benchmark Results (GameTheory-Bench, n=50)

### Overall Performance

| Metric | Base (Qwen2.5-7B) | Solver (Phase 1) | **Reasoner (Phase 2)** |
|---|---|---|---|
| **Exact Accuracy** | 82.0% | 94.0% | **94.0%** |
| **Partial Accuracy** | 82.0% | 94.0% | **94.0%** |
| Format Quality | 0.92 | 0.70 | 0.70 |
| **Reasoning Quality** | 0.53 | 0.51 | **0.54** |
| Avg Response Length | 523 words | 169 words | 181 words |

### Performance by Difficulty

| Difficulty | Base | Solver | **Reasoner** |
|---|---|---|---|
| Easy (n=9) | 100.0% | 88.9% | 88.9% |
| Medium (n=23) | 87.0% | 95.7% | 95.7% |
| Hard (n=18) | 66.7% | 94.4% | **94.4%** |

### Performance by Category

| Category | Base | Solver | **Reasoner** |
|---|---|---|---|
| normal_form_2x2 | 100.0% | 80.0% | 80.0% |
| normal_form_3x3 | 80.0% | 60.0% | 60.0% |
| normal_form_3x4 | 100.0% | 100.0% | 100.0% |
| normal_form_4x4 | 100.0% | 100.0% | 100.0% |
| zero_sum | 100.0% | 100.0% | 100.0% |
| sequential_game | 100.0% | 100.0% | 100.0% |
| auction_theory | 80.0% | 100.0% | 100.0% |
| bayesian_game | **0.0%** | **100.0%** | **100.0%** |
| cooperative_game | 100.0% | 100.0% | 100.0% |
| mechanism_design | 60.0% | 100.0% | 100.0% |

### Key Findings

- **+12% accuracy** over base Qwen2.5-7B-Instruct (82% to 94%)
- **Massive gains on hard problems**: 66.7% to 94.4% (+27.7%)
- **Bayesian games**: 0% to 100% (the most dramatic improvement)
- **Mechanism design**: 60% to 100%
- **Reasoning quality improved** by GRPO: 0.51 (Solver) to 0.54 (Reasoner)
- **Concise outputs**: ~65% shorter than base model while being more accurate

## Training Details

### GRPO Configuration
| Parameter | Value |
|---|---|
| Method | Group Relative Policy Optimization (GRPO) |
| Steps | 750 |
| Training Time | ~8 hours on RTX 3090 |
| LoRA Rank (r) | 32 |
| LoRA Alpha | 64 |
| Learning Rate | 5e-6 |
| KL Beta | 0.04 |
| Num Generations | 4 |
| Max Completion Length | 1024 |

### Reward Functions (3 verifiable rewards)
| Reward | Range | Description |
|---|---|---|
| **Accuracy** | 0.85 to 1.0 | Verifies correctness against gold answers using domain-specific comparators |
| **Format** | 0.64 to 0.82 | Checks structured output format (think/answer tags) |
| **Reasoning** | 0.55 to 0.79 | Evaluates reasoning chain quality and mathematical notation |
| **Total** | 2.36 to 2.55 | Combined reward signal |

### Training Dynamics
| Metric | Value |
|---|---|
| Final Loss | ~0.0002 |
| KL Divergence | 0.004 to 0.015 |

## Usage

### Loading the Model

This adapter requires a two-step loading process since it was trained on top of the Phase 1 merged model:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Step 1: Load the Phase 1 merged model as base
base_model = AutoModelForCausalLM.from_pretrained(
    "Alogotron/GameTheory-Solver",  # or your local phase1_merged path
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Step 2: Apply the GRPO Reasoner adapter
model = PeftModel.from_pretrained(base_model, "Alogotron/GameTheory-Reasoner")
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
```

### Inference

```python
system_prompt = (
    "You are a game theory expert. Solve the following problem step by step. "
    "Show your reasoning clearly, then provide your final answer."
)

problem = "Consider a 2-player game with the following payoff matrix: "    "L: (3,2) (1,4), R: (2,3) (4,1). Find all Nash Equilibria."

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": problem},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
```

## Related Resources

- **Dataset**: [Alogotron/GameTheory-Bench](https://huggingface.co/datasets/Alogotron/GameTheory-Bench) - 2,913 game theory problems
- **Phase 1 Model**: [Alogotron/GameTheory-Solver](https://huggingface.co/Alogotron/GameTheory-Solver) - SFT fine-tuned solver
- **Demo**: [Game Theory Solver Space](https://huggingface.co/spaces/Alogotron/GameTheory-Solver)

## License

Apache-2.0