Alogotron commited on
Commit
2a6d34f
·
verified ·
1 Parent(s): 3e2c059

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ base_model: Qwen/Qwen2.5-7B-Instruct
4
+ tags:
5
+ - game-theory
6
+ - grpo
7
+ - reinforcement-learning
8
+ - reasoning
9
+ - qwen2.5
10
+ - lora
11
+ - peft
12
+ license: apache-2.0
13
+ datasets:
14
+ - Alogotron/GameTheory-Bench
15
+ metrics:
16
+ - accuracy
17
+ pipeline_tag: text-generation
18
+ model-index:
19
+ - name: GameTheory-Reasoner
20
+ results:
21
+ - task:
22
+ type: text-generation
23
+ name: Game Theory Problem Solving
24
+ dataset:
25
+ name: GameTheory-Bench
26
+ type: Alogotron/GameTheory-Bench
27
+ metrics:
28
+ - name: Exact Accuracy
29
+ type: accuracy
30
+ value: 94.0
31
+ verified: true
32
+ ---
33
+
34
+ # GameTheory-Reasoner (GRPO Phase 2)
35
+
36
+ **A game theory reasoning model trained with Group Relative Policy Optimization (GRPO) and verifiable reward functions.**
37
+
38
+ This is a LoRA adapter trained on top of the [Phase 1 Solver](https://huggingface.co/Alogotron/GameTheory-Solver) (which itself is fine-tuned from [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)). It represents Phase 2 of a two-phase training pipeline designed to build a strong game theory problem solver with enhanced reasoning capabilities.
39
+
40
+ ## Training Pipeline
41
+
42
+ ```
43
+ Qwen2.5-7B-Instruct (base)
44
+ |
45
+ +-- Phase 1: Supervised Fine-Tuning (QLoRA)
46
+ | +-- GameTheory-Solver adapter
47
+ | +-- Merged into: phase1_merged/
48
+ |
49
+ +-- Phase 2: GRPO Reinforcement Learning
50
+ +-- GameTheory-Reasoner adapter (this model)
51
+ Trained on top of phase1_merged
52
+ ```
53
+
54
+ ## Benchmark Results (GameTheory-Bench, n=50)
55
+
56
+ ### Overall Performance
57
+
58
+ | Metric | Base (Qwen2.5-7B) | Solver (Phase 1) | **Reasoner (Phase 2)** |
59
+ |---|---|---|---|
60
+ | **Exact Accuracy** | 82.0% | 94.0% | **94.0%** |
61
+ | **Partial Accuracy** | 82.0% | 94.0% | **94.0%** |
62
+ | Format Quality | 0.92 | 0.70 | 0.70 |
63
+ | **Reasoning Quality** | 0.53 | 0.51 | **0.54** |
64
+ | Avg Response Length | 523 words | 169 words | 181 words |
65
+
66
+ ### Performance by Difficulty
67
+
68
+ | Difficulty | Base | Solver | **Reasoner** |
69
+ |---|---|---|---|
70
+ | Easy (n=9) | 100.0% | 88.9% | 88.9% |
71
+ | Medium (n=23) | 87.0% | 95.7% | 95.7% |
72
+ | Hard (n=18) | 66.7% | 94.4% | **94.4%** |
73
+
74
+ ### Performance by Category
75
+
76
+ | Category | Base | Solver | **Reasoner** |
77
+ |---|---|---|---|
78
+ | normal_form_2x2 | 100.0% | 80.0% | 80.0% |
79
+ | normal_form_3x3 | 80.0% | 60.0% | 60.0% |
80
+ | normal_form_3x4 | 100.0% | 100.0% | 100.0% |
81
+ | normal_form_4x4 | 100.0% | 100.0% | 100.0% |
82
+ | zero_sum | 100.0% | 100.0% | 100.0% |
83
+ | sequential_game | 100.0% | 100.0% | 100.0% |
84
+ | auction_theory | 80.0% | 100.0% | 100.0% |
85
+ | bayesian_game | **0.0%** | **100.0%** | **100.0%** |
86
+ | cooperative_game | 100.0% | 100.0% | 100.0% |
87
+ | mechanism_design | 60.0% | 100.0% | 100.0% |
88
+
89
+ ### Key Findings
90
+
91
+ - **+12% accuracy** over base Qwen2.5-7B-Instruct (82% to 94%)
92
+ - **Massive gains on hard problems**: 66.7% to 94.4% (+27.7%)
93
+ - **Bayesian games**: 0% to 100% (the most dramatic improvement)
94
+ - **Mechanism design**: 60% to 100%
95
+ - **Reasoning quality improved** by GRPO: 0.51 (Solver) to 0.54 (Reasoner)
96
+ - **Concise outputs**: ~65% shorter than base model while being more accurate
97
+
98
+ ## Training Details
99
+
100
+ ### GRPO Configuration
101
+ | Parameter | Value |
102
+ |---|---|
103
+ | Method | Group Relative Policy Optimization (GRPO) |
104
+ | Steps | 750 |
105
+ | Training Time | ~8 hours on RTX 3090 |
106
+ | LoRA Rank (r) | 32 |
107
+ | LoRA Alpha | 64 |
108
+ | Learning Rate | 5e-6 |
109
+ | KL Beta | 0.04 |
110
+ | Num Generations | 4 |
111
+ | Max Completion Length | 1024 |
112
+
113
+ ### Reward Functions (3 verifiable rewards)
114
+ | Reward | Range | Description |
115
+ |---|---|---|
116
+ | **Accuracy** | 0.85 to 1.0 | Verifies correctness against gold answers using domain-specific comparators |
117
+ | **Format** | 0.64 to 0.82 | Checks structured output format (think/answer tags) |
118
+ | **Reasoning** | 0.55 to 0.79 | Evaluates reasoning chain quality and mathematical notation |
119
+ | **Total** | 2.36 to 2.55 | Combined reward signal |
120
+
121
+ ### Training Dynamics
122
+ | Metric | Value |
123
+ |---|---|
124
+ | Final Loss | ~0.0002 |
125
+ | KL Divergence | 0.004 to 0.015 |
126
+
127
+ ## Usage
128
+
129
+ ### Loading the Model
130
+
131
+ This adapter requires a two-step loading process since it was trained on top of the Phase 1 merged model:
132
+
133
+ ```python
134
+ from transformers import AutoModelForCausalLM, AutoTokenizer
135
+ from peft import PeftModel
136
+ import torch
137
+
138
+ # Step 1: Load the Phase 1 merged model as base
139
+ base_model = AutoModelForCausalLM.from_pretrained(
140
+ "Alogotron/GameTheory-Solver", # or your local phase1_merged path
141
+ torch_dtype=torch.bfloat16,
142
+ device_map="auto",
143
+ )
144
+
145
+ # Step 2: Apply the GRPO Reasoner adapter
146
+ model = PeftModel.from_pretrained(base_model, "Alogotron/GameTheory-Reasoner")
147
+ model.eval()
148
+
149
+ # Load tokenizer
150
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
151
+ ```
152
+
153
+ ### Inference
154
+
155
+ ```python
156
+ system_prompt = (
157
+ "You are a game theory expert. Solve the following problem step by step. "
158
+ "Show your reasoning clearly, then provide your final answer."
159
+ )
160
+
161
+ problem = "Consider a 2-player game with the following payoff matrix: " "L: (3,2) (1,4), R: (2,3) (4,1). Find all Nash Equilibria."
162
+
163
+ messages = [
164
+ {"role": "system", "content": system_prompt},
165
+ {"role": "user", "content": problem},
166
+ ]
167
+
168
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
169
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
170
+
171
+ with torch.no_grad():
172
+ output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
173
+
174
+ response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
175
+ print(response)
176
+ ```
177
+
178
+ ## Related Resources
179
+
180
+ - **Dataset**: [Alogotron/GameTheory-Bench](https://huggingface.co/datasets/Alogotron/GameTheory-Bench) - 2,913 game theory problems
181
+ - **Phase 1 Model**: [Alogotron/GameTheory-Solver](https://huggingface.co/Alogotron/GameTheory-Solver) - SFT fine-tuned solver
182
+ - **Demo**: [Game Theory Solver Space](https://huggingface.co/spaces/Alogotron/GameTheory-Solver)
183
+
184
+ ## License
185
+
186
+ Apache-2.0
adapter_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "/home/beta1/gt-training/phase1_merged",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 64,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "peft_version": "0.18.1",
27
+ "qalora_group_size": 16,
28
+ "r": 32,
29
+ "rank_pattern": {},
30
+ "revision": null,
31
+ "target_modules": [
32
+ "k_proj",
33
+ "q_proj",
34
+ "v_proj",
35
+ "up_proj",
36
+ "gate_proj",
37
+ "down_proj",
38
+ "o_proj"
39
+ ],
40
+ "target_parameters": null,
41
+ "task_type": "CAUSAL_LM",
42
+ "trainable_token_indices": null,
43
+ "use_dora": false,
44
+ "use_qalora": false,
45
+ "use_rslora": false
46
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fb0735fbf4192c1cabbe826b3c72c40f20d9e55c71f0f26328c3b8e3d9960b20
3
+ size 161533584
chat_template.jinja ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0]['role'] == 'system' %}
4
+ {{- messages[0]['content'] }}
5
+ {%- else %}
6
+ {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
7
+ {%- endif %}
8
+ {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
+ {%- for tool in tools %}
10
+ {{- "\n" }}
11
+ {{- tool | tojson }}
12
+ {%- endfor %}
13
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
+ {%- else %}
15
+ {%- if messages[0]['role'] == 'system' %}
16
+ {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
+ {%- else %}
18
+ {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
19
+ {%- endif %}
20
+ {%- endif %}
21
+ {%- for message in messages %}
22
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
+ {%- elif message.role == "assistant" %}
25
+ {{- '<|im_start|>' + message.role }}
26
+ {%- if message.content %}
27
+ {{- '\n' + message.content }}
28
+ {%- endif %}
29
+ {%- for tool_call in message.tool_calls %}
30
+ {%- if tool_call.function is defined %}
31
+ {%- set tool_call = tool_call.function %}
32
+ {%- endif %}
33
+ {{- '\n<tool_call>\n{"name": "' }}
34
+ {{- tool_call.name }}
35
+ {{- '", "arguments": ' }}
36
+ {{- tool_call.arguments | tojson }}
37
+ {{- '}\n</tool_call>' }}
38
+ {%- endfor %}
39
+ {{- '<|im_end|>\n' }}
40
+ {%- elif message.role == "tool" %}
41
+ {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
+ {{- '<|im_start|>user' }}
43
+ {%- endif %}
44
+ {{- '\n<tool_response>\n' }}
45
+ {{- message.content }}
46
+ {{- '\n</tool_response>' }}
47
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
+ {{- '<|im_end|>\n' }}
49
+ {%- endif %}
50
+ {%- endif %}
51
+ {%- endfor %}
52
+ {%- if add_generation_prompt %}
53
+ {{- '<|im_start|>assistant\n' }}
54
+ {%- endif %}
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7f96da3a872b5e901575b2067c744ad336c3a3d77a21584d20024557b1bd7f0
3
+ size 11422059
tokenizer_config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "errors": "replace",
8
+ "extra_special_tokens": [
9
+ "<|im_start|>",
10
+ "<|im_end|>",
11
+ "<|object_ref_start|>",
12
+ "<|object_ref_end|>",
13
+ "<|box_start|>",
14
+ "<|box_end|>",
15
+ "<|quad_start|>",
16
+ "<|quad_end|>",
17
+ "<|vision_start|>",
18
+ "<|vision_end|>",
19
+ "<|vision_pad|>",
20
+ "<|image_pad|>",
21
+ "<|video_pad|>"
22
+ ],
23
+ "is_local": true,
24
+ "model_max_length": 131072,
25
+ "pad_token": "<|endoftext|>",
26
+ "split_special_tokens": false,
27
+ "tokenizer_class": "Qwen2Tokenizer",
28
+ "unk_token": null
29
+ }