| | --- |
| | library_name: peft |
| | base_model: Qwen/Qwen2.5-7B-Instruct |
| | tags: |
| | - game-theory |
| | - grpo |
| | - reinforcement-learning |
| | - reasoning |
| | - qwen2.5 |
| | - lora |
| | - peft |
| | license: apache-2.0 |
| | datasets: |
| | - Alogotron/GameTheory-Bench |
| | metrics: |
| | - accuracy |
| | pipeline_tag: text-generation |
| | model-index: |
| | - name: GameTheory-Reasoner |
| | results: |
| | - task: |
| | type: text-generation |
| | name: Game Theory Problem Solving |
| | dataset: |
| | name: GameTheory-Bench |
| | type: Alogotron/GameTheory-Bench |
| | metrics: |
| | - name: Exact Accuracy |
| | type: accuracy |
| | value: 94.0 |
| | verified: true |
| | --- |
| | |
| | # GameTheory-Reasoner (GRPO Phase 2) |
| |
|
| | **A game theory reasoning model trained with Group Relative Policy Optimization (GRPO) and verifiable reward functions.** |
| |
|
| | This is a LoRA adapter trained on top of the [Phase 1 Solver](https://huggingface.co/Alogotron/GameTheory-Solver) (which itself is fine-tuned from [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)). It represents Phase 2 of a two-phase training pipeline designed to build a strong game theory problem solver with enhanced reasoning capabilities. |
| |
|
| | ## Training Pipeline |
| |
|
| | ``` |
| | Qwen2.5-7B-Instruct (base) |
| | | |
| | +-- Phase 1: Supervised Fine-Tuning (QLoRA) |
| | | +-- GameTheory-Solver adapter |
| | | +-- Merged into: phase1_merged/ |
| | | |
| | +-- Phase 2: GRPO Reinforcement Learning |
| | +-- GameTheory-Reasoner adapter (this model) |
| | Trained on top of phase1_merged |
| | ``` |
| |
|
| | ## Benchmark Results (GameTheory-Bench, n=50) |
| |
|
| | ### Overall Performance |
| |
|
| | | Metric | Base (Qwen2.5-7B) | Solver (Phase 1) | **Reasoner (Phase 2)** | |
| | |---|---|---|---| |
| | | **Exact Accuracy** | 82.0% | 94.0% | **94.0%** | |
| | | **Partial Accuracy** | 82.0% | 94.0% | **94.0%** | |
| | | Format Quality | 0.92 | 0.70 | 0.70 | |
| | | **Reasoning Quality** | 0.53 | 0.51 | **0.54** | |
| | | Avg Response Length | 523 words | 169 words | 181 words | |
| |
|
| | ### Performance by Difficulty |
| |
|
| | | Difficulty | Base | Solver | **Reasoner** | |
| | |---|---|---|---| |
| | | Easy (n=9) | 100.0% | 88.9% | 88.9% | |
| | | Medium (n=23) | 87.0% | 95.7% | 95.7% | |
| | | Hard (n=18) | 66.7% | 94.4% | **94.4%** | |
| |
|
| | ### Performance by Category |
| |
|
| | | Category | Base | Solver | **Reasoner** | |
| | |---|---|---|---| |
| | | normal_form_2x2 | 100.0% | 80.0% | 80.0% | |
| | | normal_form_3x3 | 80.0% | 60.0% | 60.0% | |
| | | normal_form_3x4 | 100.0% | 100.0% | 100.0% | |
| | | normal_form_4x4 | 100.0% | 100.0% | 100.0% | |
| | | zero_sum | 100.0% | 100.0% | 100.0% | |
| | | sequential_game | 100.0% | 100.0% | 100.0% | |
| | | auction_theory | 80.0% | 100.0% | 100.0% | |
| | | bayesian_game | **0.0%** | **100.0%** | **100.0%** | |
| | | cooperative_game | 100.0% | 100.0% | 100.0% | |
| | | mechanism_design | 60.0% | 100.0% | 100.0% | |
| |
|
| | ### Key Findings |
| |
|
| | - **+12% accuracy** over base Qwen2.5-7B-Instruct (82% to 94%) |
| | - **Massive gains on hard problems**: 66.7% to 94.4% (+27.7%) |
| | - **Bayesian games**: 0% to 100% (the most dramatic improvement) |
| | - **Mechanism design**: 60% to 100% |
| | - **Reasoning quality improved** by GRPO: 0.51 (Solver) to 0.54 (Reasoner) |
| | - **Concise outputs**: ~65% shorter than base model while being more accurate |
| |
|
| | ## Training Details |
| |
|
| | ### GRPO Configuration |
| | | Parameter | Value | |
| | |---|---| |
| | | Method | Group Relative Policy Optimization (GRPO) | |
| | | Steps | 750 | |
| | | Training Time | ~8 hours on RTX 3090 | |
| | | LoRA Rank (r) | 32 | |
| | | LoRA Alpha | 64 | |
| | | Learning Rate | 5e-6 | |
| | | KL Beta | 0.04 | |
| | | Num Generations | 4 | |
| | | Max Completion Length | 1024 | |
| |
|
| | ### Reward Functions (3 verifiable rewards) |
| | | Reward | Range | Description | |
| | |---|---|---| |
| | | **Accuracy** | 0.85 to 1.0 | Verifies correctness against gold answers using domain-specific comparators | |
| | | **Format** | 0.64 to 0.82 | Checks structured output format (think/answer tags) | |
| | | **Reasoning** | 0.55 to 0.79 | Evaluates reasoning chain quality and mathematical notation | |
| | | **Total** | 2.36 to 2.55 | Combined reward signal | |
| |
|
| | ### Training Dynamics |
| | | Metric | Value | |
| | |---|---| |
| | | Final Loss | ~0.0002 | |
| | | KL Divergence | 0.004 to 0.015 | |
| |
|
| | ## Usage |
| |
|
| | ### Loading the Model |
| |
|
| | This adapter requires a two-step loading process since it was trained on top of the Phase 1 merged model: |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | from peft import PeftModel |
| | import torch |
| | |
| | # Step 1: Load the Phase 1 merged model as base |
| | base_model = AutoModelForCausalLM.from_pretrained( |
| | "Alogotron/GameTheory-Solver", # or your local phase1_merged path |
| | torch_dtype=torch.bfloat16, |
| | device_map="auto", |
| | ) |
| | |
| | # Step 2: Apply the GRPO Reasoner adapter |
| | model = PeftModel.from_pretrained(base_model, "Alogotron/GameTheory-Reasoner") |
| | model.eval() |
| | |
| | # Load tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct") |
| | ``` |
| |
|
| | ### Inference |
| |
|
| | ```python |
| | system_prompt = ( |
| | "You are a game theory expert. Solve the following problem step by step. " |
| | "Show your reasoning clearly, then provide your final answer." |
| | ) |
| | |
| | problem = "Consider a 2-player game with the following payoff matrix: " "L: (3,2) (1,4), R: (2,3) (4,1). Find all Nash Equilibria." |
| | |
| | messages = [ |
| | {"role": "system", "content": system_prompt}, |
| | {"role": "user", "content": problem}, |
| | ] |
| | |
| | prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | |
| | with torch.no_grad(): |
| | output = model.generate(**inputs, max_new_tokens=1024, do_sample=False) |
| | |
| | response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) |
| | print(response) |
| | ``` |
| |
|
| | ## Related Resources |
| |
|
| | - **Dataset**: [Alogotron/GameTheory-Bench](https://huggingface.co/datasets/Alogotron/GameTheory-Bench) - 2,913 game theory problems |
| | - **Phase 1 Model**: [Alogotron/GameTheory-Solver](https://huggingface.co/Alogotron/GameTheory-Solver) - SFT fine-tuned solver |
| | - **Demo**: [Game Theory Solver Space](https://huggingface.co/spaces/Alogotron/GameTheory-Solver) |
| |
|
| | ## License |
| |
|
| | Apache-2.0 |
| |
|