Upload folder using huggingface_hub

2a6d34f verified 6 days ago

5.91 kB

	---
	library_name: peft
	base_model: Qwen/Qwen2.5-7B-Instruct
	tags:
	- game-theory
	- grpo
	- reinforcement-learning
	- reasoning
	- qwen2.5
	- lora
	- peft
	license: apache-2.0
	datasets:
	- Alogotron/GameTheory-Bench
	metrics:
	- accuracy
	pipeline_tag: text-generation
	model-index:
	- name: GameTheory-Reasoner
	results:
	- task:
	type: text-generation
	name: Game Theory Problem Solving
	dataset:
	name: GameTheory-Bench
	type: Alogotron/GameTheory-Bench
	metrics:
	- name: Exact Accuracy
	type: accuracy
	value: 94.0
	verified: true
	---

	# GameTheory-Reasoner (GRPO Phase 2)

	A game theory reasoning model trained with Group Relative Policy Optimization (GRPO) and verifiable reward functions.

	This is a LoRA adapter trained on top of the [Phase 1 Solver](https://huggingface.co/Alogotron/GameTheory-Solver) (which itself is fine-tuned from [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)). It represents Phase 2 of a two-phase training pipeline designed to build a strong game theory problem solver with enhanced reasoning capabilities.

	## Training Pipeline

	```
	Qwen2.5-7B-Instruct (base)
	\|
	+-- Phase 1: Supervised Fine-Tuning (QLoRA)
	\| +-- GameTheory-Solver adapter
	\| +-- Merged into: phase1_merged/
	\|
	+-- Phase 2: GRPO Reinforcement Learning
	+-- GameTheory-Reasoner adapter (this model)
	Trained on top of phase1_merged
	```

	## Benchmark Results (GameTheory-Bench, n=50)

	### Overall Performance

	\| Metric \| Base (Qwen2.5-7B) \| Solver (Phase 1) \| Reasoner (Phase 2) \|
	\|---\|---\|---\|---\|
	\| Exact Accuracy \| 82.0% \| 94.0% \| 94.0% \|
	\| Partial Accuracy \| 82.0% \| 94.0% \| 94.0% \|
	\| Format Quality \| 0.92 \| 0.70 \| 0.70 \|
	\| Reasoning Quality \| 0.53 \| 0.51 \| 0.54 \|
	\| Avg Response Length \| 523 words \| 169 words \| 181 words \|

	### Performance by Difficulty

	\| Difficulty \| Base \| Solver \| Reasoner \|
	\|---\|---\|---\|---\|
	\| Easy (n=9) \| 100.0% \| 88.9% \| 88.9% \|
	\| Medium (n=23) \| 87.0% \| 95.7% \| 95.7% \|
	\| Hard (n=18) \| 66.7% \| 94.4% \| 94.4% \|

	### Performance by Category

	\| Category \| Base \| Solver \| Reasoner \|
	\|---\|---\|---\|---\|
	\| normal_form_2x2 \| 100.0% \| 80.0% \| 80.0% \|
	\| normal_form_3x3 \| 80.0% \| 60.0% \| 60.0% \|
	\| normal_form_3x4 \| 100.0% \| 100.0% \| 100.0% \|
	\| normal_form_4x4 \| 100.0% \| 100.0% \| 100.0% \|
	\| zero_sum \| 100.0% \| 100.0% \| 100.0% \|
	\| sequential_game \| 100.0% \| 100.0% \| 100.0% \|
	\| auction_theory \| 80.0% \| 100.0% \| 100.0% \|
	\| bayesian_game \| 0.0% \| 100.0% \| 100.0% \|
	\| cooperative_game \| 100.0% \| 100.0% \| 100.0% \|
	\| mechanism_design \| 60.0% \| 100.0% \| 100.0% \|

	### Key Findings

	- +12% accuracy over base Qwen2.5-7B-Instruct (82% to 94%)
	- Massive gains on hard problems: 66.7% to 94.4% (+27.7%)
	- Bayesian games: 0% to 100% (the most dramatic improvement)
	- Mechanism design: 60% to 100%
	- Reasoning quality improved by GRPO: 0.51 (Solver) to 0.54 (Reasoner)
	- Concise outputs: ~65% shorter than base model while being more accurate

	## Training Details

	### GRPO Configuration
	\| Parameter \| Value \|
	\|---\|---\|
	\| Method \| Group Relative Policy Optimization (GRPO) \|
	\| Steps \| 750 \|
	\| Training Time \| ~8 hours on RTX 3090 \|
	\| LoRA Rank (r) \| 32 \|
	\| LoRA Alpha \| 64 \|
	\| Learning Rate \| 5e-6 \|
	\| KL Beta \| 0.04 \|
	\| Num Generations \| 4 \|
	\| Max Completion Length \| 1024 \|

	### Reward Functions (3 verifiable rewards)
	\| Reward \| Range \| Description \|
	\|---\|---\|---\|
	\| Accuracy \| 0.85 to 1.0 \| Verifies correctness against gold answers using domain-specific comparators \|
	\| Format \| 0.64 to 0.82 \| Checks structured output format (think/answer tags) \|
	\| Reasoning \| 0.55 to 0.79 \| Evaluates reasoning chain quality and mathematical notation \|
	\| Total \| 2.36 to 2.55 \| Combined reward signal \|

	### Training Dynamics
	\| Metric \| Value \|
	\|---\|---\|
	\| Final Loss \| ~0.0002 \|
	\| KL Divergence \| 0.004 to 0.015 \|

	## Usage

	### Loading the Model

	This adapter requires a two-step loading process since it was trained on top of the Phase 1 merged model:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch

	# Step 1: Load the Phase 1 merged model as base
	base_model = AutoModelForCausalLM.from_pretrained(
	"Alogotron/GameTheory-Solver", # or your local phase1_merged path
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	# Step 2: Apply the GRPO Reasoner adapter
	model = PeftModel.from_pretrained(base_model, "Alogotron/GameTheory-Reasoner")
	model.eval()

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
	```

	### Inference

	```python
	system_prompt = (
	"You are a game theory expert. Solve the following problem step by step. "
	"Show your reasoning clearly, then provide your final answer."
	)

	problem = "Consider a 2-player game with the following payoff matrix: " "L: (3,2) (1,4), R: (2,3) (4,1). Find all Nash Equilibria."

	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": problem},
	]

	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	with torch.no_grad():
	output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)

	response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
	print(response)
	```

	## Related Resources

	- Dataset: [Alogotron/GameTheory-Bench](https://huggingface.co/datasets/Alogotron/GameTheory-Bench) - 2,913 game theory problems
	- Phase 1 Model: [Alogotron/GameTheory-Solver](https://huggingface.co/Alogotron/GameTheory-Solver) - SFT fine-tuned solver
	- Demo: [Game Theory Solver Space](https://huggingface.co/spaces/Alogotron/GameTheory-Solver)

	## License

	Apache-2.0