Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

.gitattributes +1 -0
README.md +186 -0
adapter_config.json +46 -0
adapter_model.safetensors +3 -0
chat_template.jinja +54 -0
tokenizer.json +3 -0
tokenizer_config.json +29 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,186 @@

+---
+library_name: peft
+base_model: Qwen/Qwen2.5-7B-Instruct
+tags:
+  - game-theory
+  - grpo
+  - reinforcement-learning
+  - reasoning
+  - qwen2.5
+  - lora
+  - peft
+license: apache-2.0
+datasets:
+  - Alogotron/GameTheory-Bench
+metrics:
+  - accuracy
+pipeline_tag: text-generation
+model-index:
+  - name: GameTheory-Reasoner
+    results:
+      - task:
+          type: text-generation
+          name: Game Theory Problem Solving
+        dataset:
+          name: GameTheory-Bench
+          type: Alogotron/GameTheory-Bench
+        metrics:
+          - name: Exact Accuracy
+            type: accuracy
+            value: 94.0
+            verified: true
+---
+# GameTheory-Reasoner (GRPO Phase 2)
+**A game theory reasoning model trained with Group Relative Policy Optimization (GRPO) and verifiable reward functions.**
+This is a LoRA adapter trained on top of the [Phase 1 Solver](https://huggingface.co/Alogotron/GameTheory-Solver) (which itself is fine-tuned from [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)). It represents Phase 2 of a two-phase training pipeline designed to build a strong game theory problem solver with enhanced reasoning capabilities.
+## Training Pipeline
+```
+Qwen2.5-7B-Instruct (base)
+  |
+  +-- Phase 1: Supervised Fine-Tuning (QLoRA)
+  |   +-- GameTheory-Solver adapter
+  |       +-- Merged into: phase1_merged/
+  |
+  +-- Phase 2: GRPO Reinforcement Learning
+      +-- GameTheory-Reasoner adapter (this model)
+          Trained on top of phase1_merged
+```
+## Benchmark Results (GameTheory-Bench, n=50)
+### Overall Performance
+| Metric | Base (Qwen2.5-7B) | Solver (Phase 1) | **Reasoner (Phase 2)** |
+|---|---|---|---|
+| **Exact Accuracy** | 82.0% | 94.0% | **94.0%** |
+| **Partial Accuracy** | 82.0% | 94.0% | **94.0%** |
+| Format Quality | 0.92 | 0.70 | 0.70 |
+| **Reasoning Quality** | 0.53 | 0.51 | **0.54** |
+| Avg Response Length | 523 words | 169 words | 181 words |
+### Performance by Difficulty
+| Difficulty | Base | Solver | **Reasoner** |
+|---|---|---|---|
+| Easy (n=9) | 100.0% | 88.9% | 88.9% |
+| Medium (n=23) | 87.0% | 95.7% | 95.7% |
+| Hard (n=18) | 66.7% | 94.4% | **94.4%** |
+### Performance by Category
+| Category | Base | Solver | **Reasoner** |
+|---|---|---|---|
+| normal_form_2x2 | 100.0% | 80.0% | 80.0% |
+| normal_form_3x3 | 80.0% | 60.0% | 60.0% |
+| normal_form_3x4 | 100.0% | 100.0% | 100.0% |
+| normal_form_4x4 | 100.0% | 100.0% | 100.0% |
+| zero_sum | 100.0% | 100.0% | 100.0% |
+| sequential_game | 100.0% | 100.0% | 100.0% |
+| auction_theory | 80.0% | 100.0% | 100.0% |
+| bayesian_game | **0.0%** | **100.0%** | **100.0%** |
+| cooperative_game | 100.0% | 100.0% | 100.0% |
+| mechanism_design | 60.0% | 100.0% | 100.0% |
+### Key Findings
+- **+12% accuracy** over base Qwen2.5-7B-Instruct (82% to 94%)
+- **Massive gains on hard problems**: 66.7% to 94.4% (+27.7%)
+- **Bayesian games**: 0% to 100% (the most dramatic improvement)
+- **Mechanism design**: 60% to 100%
+- **Reasoning quality improved** by GRPO: 0.51 (Solver) to 0.54 (Reasoner)
+- **Concise outputs**: ~65% shorter than base model while being more accurate
+## Training Details
+### GRPO Configuration
+| Parameter | Value |
+|---|---|
+| Method | Group Relative Policy Optimization (GRPO) |
+| Steps | 750 |
+| Training Time | ~8 hours on RTX 3090 |
+| LoRA Rank (r) | 32 |
+| LoRA Alpha | 64 |
+| Learning Rate | 5e-6 |
+| KL Beta | 0.04 |
+| Num Generations | 4 |
+| Max Completion Length | 1024 |
+### Reward Functions (3 verifiable rewards)
+| Reward | Range | Description |
+|---|---|---|
+| **Accuracy** | 0.85 to 1.0 | Verifies correctness against gold answers using domain-specific comparators |
+| **Format** | 0.64 to 0.82 | Checks structured output format (think/answer tags) |
+| **Reasoning** | 0.55 to 0.79 | Evaluates reasoning chain quality and mathematical notation |
+| **Total** | 2.36 to 2.55 | Combined reward signal |
+### Training Dynamics
+| Metric | Value |
+|---|---|
+| Final Loss | ~0.0002 |
+| KL Divergence | 0.004 to 0.015 |
+## Usage
+### Loading the Model
+This adapter requires a two-step loading process since it was trained on top of the Phase 1 merged model:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+import torch
+# Step 1: Load the Phase 1 merged model as base
+base_model = AutoModelForCausalLM.from_pretrained(
+    "Alogotron/GameTheory-Solver",  # or your local phase1_merged path
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+# Step 2: Apply the GRPO Reasoner adapter
+model = PeftModel.from_pretrained(base_model, "Alogotron/GameTheory-Reasoner")
+model.eval()
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
+```
+### Inference
+```python
+system_prompt = (
+    "You are a game theory expert. Solve the following problem step by step. "
+    "Show your reasoning clearly, then provide your final answer."
+)
+problem = "Consider a 2-player game with the following payoff matrix: "    "L: (3,2) (1,4), R: (2,3) (4,1). Find all Nash Equilibria."
+messages = [
+    {"role": "system", "content": system_prompt},
+    {"role": "user", "content": problem},
+]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
+response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+print(response)
+```
+## Related Resources
+- **Dataset**: [Alogotron/GameTheory-Bench](https://huggingface.co/datasets/Alogotron/GameTheory-Bench) - 2,913 game theory problems
+- **Phase 1 Model**: [Alogotron/GameTheory-Solver](https://huggingface.co/Alogotron/GameTheory-Solver) - SFT fine-tuned solver
+- **Demo**: [Game Theory Solver Space](https://huggingface.co/spaces/Alogotron/GameTheory-Solver)
+## License
+Apache-2.0

adapter_config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "/home/beta1/gt-training/phase1_merged",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 64,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.18.1",
+  "qalora_group_size": 16,
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "k_proj",
+    "q_proj",
+    "v_proj",
+    "up_proj",
+    "gate_proj",
+    "down_proj",
+    "o_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fb0735fbf4192c1cabbe826b3c72c40f20d9e55c71f0f26328c3b8e3d9960b20
+size 161533584

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,54 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7f96da3a872b5e901575b2067c744ad336c3a3d77a21584d20024557b1bd7f0
+size 11422059

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "is_local": true,
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}