Spaces:
Sleeping
Sleeping
Commit ·
aed711f
1
Parent(s): b3123f8
Update references: L40S -> A100 across codebase
Browse files- PRD.md +2 -2
- README.md +8 -8
- blog.md +2 -2
- docs/TRAINING_GUIDE.md +3 -3
- docs/architecture.md +2 -2
- {hf_space_l40s → hf_space_a100}/.python-version +0 -0
- {hf_space_l40s → hf_space_a100}/Dockerfile +0 -0
- {hf_space_l40s → hf_space_a100}/README.md +2 -2
- {hf_space_l40s → hf_space_a100}/app.py +2 -2
- {hf_space_l40s → hf_space_a100}/requirements.txt +0 -0
- {hf_space_l40s → hf_space_a100}/train_script.py +2 -2
- openenv.yaml +3 -3
PRD.md
CHANGED
|
@@ -170,7 +170,7 @@ ScenarioGenerator → ConflictBenchEnv → Verifier → GRPO Trainer
|
|
| 170 |
| Epochs | 2 | Reward peaks around epoch 2; more risks KL drift |
|
| 171 |
| Learning rate | 3e-6 | Conservative; preserves instruction following |
|
| 172 |
| β (KL penalty) | 0.04 | Prevents excessive drift; 0.02 was insufficient |
|
| 173 |
-
| num_generations | 4–6 | 4 minimum; 6 preferred on
|
| 174 |
|
| 175 |
---
|
| 176 |
|
|
@@ -215,7 +215,7 @@ All targets exceeded.
|
|
| 215 |
| Reward hacking via length | Low | Medium | Efficiency rubric penalises over-inclusion |
|
| 216 |
| JSON format memorisation without semantic understanding | Medium | High | Conflict ID F1 requires correct instruction IDs |
|
| 217 |
| KL divergence runaway | Medium | Medium | β=0.04 provides sufficient penalty |
|
| 218 |
-
| Unsloth kernel dtype bug on full-precision path | High (
|
| 219 |
| Kaggle session timeout mid-training | High | Medium | Checkpoint-every-50-steps + resume support |
|
| 220 |
|
| 221 |
---
|
|
|
|
| 170 |
| Epochs | 2 | Reward peaks around epoch 2; more risks KL drift |
|
| 171 |
| Learning rate | 3e-6 | Conservative; preserves instruction following |
|
| 172 |
| β (KL penalty) | 0.04 | Prevents excessive drift; 0.02 was insufficient |
|
| 173 |
+
| num_generations | 4–6 | 4 minimum; 6 preferred on A100 |
|
| 174 |
|
| 175 |
---
|
| 176 |
|
|
|
|
| 215 |
| Reward hacking via length | Low | Medium | Efficiency rubric penalises over-inclusion |
|
| 216 |
| JSON format memorisation without semantic understanding | Medium | High | Conflict ID F1 requires correct instruction IDs |
|
| 217 |
| KL divergence runaway | Medium | Medium | β=0.04 provides sufficient penalty |
|
| 218 |
+
| Unsloth kernel dtype bug on full-precision path | High (A100) | Critical | Always use 4-bit quantisation |
|
| 219 |
| Kaggle session timeout mid-training | High | Medium | Checkpoint-every-50-steps + resume support |
|
| 220 |
|
| 221 |
---
|
README.md
CHANGED
|
@@ -5,7 +5,7 @@ colorFrom: indigo
|
|
| 5 |
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
python_version: "3.10"
|
| 8 |
-
app_file:
|
| 9 |
pinned: true
|
| 10 |
license: mit
|
| 11 |
---
|
|
@@ -138,7 +138,7 @@ Partial credit via F1 scoring gives GRPO a dense, informative gradient signal at
|
|
| 138 |
|---|---|---|---|
|
| 139 |
| Baseline (no training) | — | 0.14 | Qwen2.5-3B zero-shot |
|
| 140 |
| Run 1 warmup | 0–120 | 0.14 → 0.22 | Colab T4, 600 scenarios, 4-bit |
|
| 141 |
-
| Run 2 (production) | 0–500 | 0.37 → 0.48 |
|
| 142 |
| Run 2 peak | ~step 250 | 0.50 | Best checkpoint |
|
| 143 |
|
| 144 |
---
|
|
@@ -191,7 +191,7 @@ Partial credit via F1 scoring gives GRPO a dense, informative gradient signal at
|
|
| 191 |
│ Max tokens: 768 completion / 3200 prompt │
|
| 192 |
│ Epochs: 2–3 │
|
| 193 |
│ LR: 3 × 10⁻⁶ | Warmup: 5% | β (KL): 0.04 │
|
| 194 |
-
│ Hardware:
|
| 195 |
└─────────────────────────────────────────────────────────────┘
|
| 196 |
```
|
| 197 |
|
|
@@ -199,7 +199,7 @@ Partial credit via F1 scoring gives GRPO a dense, informative gradient signal at
|
|
| 199 |
|
| 200 |
## Results
|
| 201 |
|
| 202 |
-
> All results from Run 2 (
|
| 203 |
|
| 204 |
### Training Curves (Run 2)
|
| 205 |
|
|
@@ -233,9 +233,9 @@ This repository contains two training scripts serving different purposes:
|
|
| 233 |
| Script | Location | Purpose |
|
| 234 |
|---|---|---|
|
| 235 |
| `train_grpo.py` | Root directory | Local training, full configurability, research use |
|
| 236 |
-
| `train.py` | `
|
| 237 |
|
| 238 |
-
The HF Spaces script (`
|
| 239 |
|
| 240 |
---
|
| 241 |
|
|
@@ -354,7 +354,7 @@ print("Patched for quick eval run (60 scenarios, 1 epoch, 4 generations)")
|
|
| 354 |
print("Expected runtime: ~30 minutes on T4, ~15 minutes on L4/A10G")
|
| 355 |
print()
|
| 356 |
print("NOTE: Reward values from this run will be lower than reported results.")
|
| 357 |
-
print("Full results used 400 scenarios × 2 epochs on
|
| 358 |
```
|
| 359 |
|
| 360 |
Then run normally:
|
|
@@ -378,7 +378,7 @@ Conflict_Bench/
|
|
| 378 |
├── openenv.yaml # OpenEnv manifest
|
| 379 |
├── requirements.txt # Python dependencies
|
| 380 |
│
|
| 381 |
-
├──
|
| 382 |
│ ├── app.py # Gradio training dashboard (live log streaming)
|
| 383 |
│ ├── train.py # Training script optimised for HF Spaces / Colab
|
| 384 |
│ └── README.md # HF Spaces card
|
|
|
|
| 5 |
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
python_version: "3.10"
|
| 8 |
+
app_file: hf_space_a100/app.py
|
| 9 |
pinned: true
|
| 10 |
license: mit
|
| 11 |
---
|
|
|
|
| 138 |
|---|---|---|---|
|
| 139 |
| Baseline (no training) | — | 0.14 | Qwen2.5-3B zero-shot |
|
| 140 |
| Run 1 warmup | 0–120 | 0.14 → 0.22 | Colab T4, 600 scenarios, 4-bit |
|
| 141 |
+
| Run 2 (production) | 0–500 | 0.37 → 0.48 | A100 48GB, 400 scenarios, 2 epochs |
|
| 142 |
| Run 2 peak | ~step 250 | 0.50 | Best checkpoint |
|
| 143 |
|
| 144 |
---
|
|
|
|
| 191 |
│ Max tokens: 768 completion / 3200 prompt │
|
| 192 |
│ Epochs: 2–3 │
|
| 193 |
│ LR: 3 × 10⁻⁶ | Warmup: 5% | β (KL): 0.04 │
|
| 194 |
+
│ Hardware: A100 48GB (~8h) / L4 24GB (~14h) │
|
| 195 |
└─────────────────────────────────────────────────────────────┘
|
| 196 |
```
|
| 197 |
|
|
|
|
| 199 |
|
| 200 |
## Results
|
| 201 |
|
| 202 |
+
> All results from Run 2 (A100 48GB, 400 scenarios, 2 epochs, 4-bit quantised Qwen2.5-3B).
|
| 203 |
|
| 204 |
### Training Curves (Run 2)
|
| 205 |
|
|
|
|
| 233 |
| Script | Location | Purpose |
|
| 234 |
|---|---|---|
|
| 235 |
| `train_grpo.py` | Root directory | Local training, full configurability, research use |
|
| 236 |
+
| `train.py` | `hf_space_a100/` | HF Spaces + Colab training, Gradio UI integration, production use |
|
| 237 |
|
| 238 |
+
The HF Spaces script (`hf_space_a100/train.py`) is pre-configured for the A100 GPU and includes the Gradio dashboard for live monitoring. The root script (`train_grpo.py`) exposes all hyperparameters directly and is intended for local or Kaggle training with more manual control.
|
| 239 |
|
| 240 |
---
|
| 241 |
|
|
|
|
| 354 |
print("Expected runtime: ~30 minutes on T4, ~15 minutes on L4/A10G")
|
| 355 |
print()
|
| 356 |
print("NOTE: Reward values from this run will be lower than reported results.")
|
| 357 |
+
print("Full results used 400 scenarios × 2 epochs on A100 48GB.")
|
| 358 |
```
|
| 359 |
|
| 360 |
Then run normally:
|
|
|
|
| 378 |
├── openenv.yaml # OpenEnv manifest
|
| 379 |
├── requirements.txt # Python dependencies
|
| 380 |
│
|
| 381 |
+
├── hf_space_a100/ # HF Spaces deployment package
|
| 382 |
│ ├── app.py # Gradio training dashboard (live log streaming)
|
| 383 |
│ ├── train.py # Training script optimised for HF Spaces / Colab
|
| 384 |
│ └── README.md # HF Spaces card
|
blog.md
CHANGED
|
@@ -46,7 +46,7 @@ GRPO works by generating multiple completions for the same prompt, scoring each,
|
|
| 46 |
|
| 47 |
This is exactly right for authority resolution. The model does not need to know what the correct answer looks like. It needs to learn to generate better answers than its own average. The reward function provides the signal that defines "better," and because the reward is fully deterministic and rule-based, the gradient is clean — there is no ambiguity about whether a completion deserved its score.
|
| 48 |
|
| 49 |
-
We used Unsloth's optimised GRPO implementation with TRL's GRPOTrainer, applied to Qwen2.5-3B-Instruct with 4-bit quantisation and LoRA (rank 32, targeting all attention and MLP projections). The full training run on an
|
| 50 |
|
| 51 |
---
|
| 52 |
|
|
@@ -70,7 +70,7 @@ Three common reward-gaming strategies all fail against this rubric set. Always f
|
|
| 70 |
|
| 71 |
## What the Training Curves Show
|
| 72 |
|
| 73 |
-
The results from Run 2 — 400 training scenarios, 2 epochs,
|
| 74 |
|
| 75 |
The baseline Qwen2.5-3B-Instruct achieves a composite reward of 0.14 on this task with no training. This is low, but not zero: the model can produce valid JSON and occasionally resolves conflicts correctly by chance.
|
| 76 |
|
|
|
|
| 46 |
|
| 47 |
This is exactly right for authority resolution. The model does not need to know what the correct answer looks like. It needs to learn to generate better answers than its own average. The reward function provides the signal that defines "better," and because the reward is fully deterministic and rule-based, the gradient is clean — there is no ambiguity about whether a completion deserved its score.
|
| 48 |
|
| 49 |
+
We used Unsloth's optimised GRPO implementation with TRL's GRPOTrainer, applied to Qwen2.5-3B-Instruct with 4-bit quantisation and LoRA (rank 32, targeting all attention and MLP projections). The full training run on an A100 48GB GPU takes approximately 8 hours for 400 scenarios across 2 epochs.
|
| 50 |
|
| 51 |
---
|
| 52 |
|
|
|
|
| 70 |
|
| 71 |
## What the Training Curves Show
|
| 72 |
|
| 73 |
+
The results from Run 2 — 400 training scenarios, 2 epochs, A100 48GB — are instructive.
|
| 74 |
|
| 75 |
The baseline Qwen2.5-3B-Instruct achieves a composite reward of 0.14 on this task with no training. This is low, but not zero: the model can produce valid JSON and occasionally resolves conflicts correctly by chance.
|
| 76 |
|
docs/TRAINING_GUIDE.md
CHANGED
|
@@ -4,7 +4,7 @@
|
|
| 4 |
|
| 5 |
| Platform | GPU | VRAM | Est. Time | Cost | Recommended For |
|
| 6 |
|---|---|---|---|---|---|
|
| 7 |
-
| HF Spaces (
|
| 8 |
| HF Spaces (L4) | L4 | 24GB | ~14h | ~$11 | Production runs |
|
| 9 |
| Google Colab Pro | L4 | 24GB | ~14h | ~$10 | Experimentation |
|
| 10 |
| Google Colab Pro+ | A100 | 40GB | ~6h | ~$20 | Fast iteration |
|
|
@@ -17,7 +17,7 @@
|
|
| 17 |
|
| 18 |
This repository has two training entry points:
|
| 19 |
|
| 20 |
-
**`
|
| 21 |
|
| 22 |
**`train_grpo.py`** — for local machines, Kaggle, and research use. Exposes all hyperparameters directly. More verbose logging. Supports checkpoint resume.
|
| 23 |
|
|
@@ -33,7 +33,7 @@ EVAL_SCENARIOS = 60 # held-out evaluation scenarios
|
|
| 33 |
NUM_EPOCHS = 2 # full passes over the training set
|
| 34 |
LEARNING_RATE = 3e-6 # conservative; prevents catastrophic forgetting
|
| 35 |
BETA = 0.04 # KL penalty; 0.02 causes excessive drift (use 0.04+)
|
| 36 |
-
num_generations = 4 # GRPO group size; 6 recommended on
|
| 37 |
MAX_NEW_TOKENS = 768 # completion budget; average ~300 tokens in practice
|
| 38 |
MAX_PROMPT_LENGTH = 3200 # prompt budget; generator enforces 4000 char limit
|
| 39 |
SAVE_STEPS = 50 # checkpoint frequency
|
|
|
|
| 4 |
|
| 5 |
| Platform | GPU | VRAM | Est. Time | Cost | Recommended For |
|
| 6 |
|---|---|---|---|---|---|
|
| 7 |
+
| HF Spaces (A100) | A100 | 48GB | ~8h | ~$14 | Production runs |
|
| 8 |
| HF Spaces (L4) | L4 | 24GB | ~14h | ~$11 | Production runs |
|
| 9 |
| Google Colab Pro | L4 | 24GB | ~14h | ~$10 | Experimentation |
|
| 10 |
| Google Colab Pro+ | A100 | 40GB | ~6h | ~$20 | Fast iteration |
|
|
|
|
| 17 |
|
| 18 |
This repository has two training entry points:
|
| 19 |
|
| 20 |
+
**`hf_space_a100/train.py`** — for HF Spaces and Colab. Integrates with the Gradio dashboard, streams live logs, auto-uploads to HF Hub. Pre-configured for A100 but auto-detects GPU.
|
| 21 |
|
| 22 |
**`train_grpo.py`** — for local machines, Kaggle, and research use. Exposes all hyperparameters directly. More verbose logging. Supports checkpoint resume.
|
| 23 |
|
|
|
|
| 33 |
NUM_EPOCHS = 2 # full passes over the training set
|
| 34 |
LEARNING_RATE = 3e-6 # conservative; prevents catastrophic forgetting
|
| 35 |
BETA = 0.04 # KL penalty; 0.02 causes excessive drift (use 0.04+)
|
| 36 |
+
num_generations = 4 # GRPO group size; 6 recommended on A100
|
| 37 |
MAX_NEW_TOKENS = 768 # completion budget; average ~300 tokens in practice
|
| 38 |
MAX_PROMPT_LENGTH = 3200 # prompt budget; generator enforces 4000 char limit
|
| 39 |
SAVE_STEPS = 50 # checkpoint frequency
|
docs/architecture.md
CHANGED
|
@@ -192,7 +192,7 @@ Next Episode
|
|
| 192 |
| `generator.py` | ~500 | Scenario generation with all templates |
|
| 193 |
| `verifier.py` | ~300 | Five rubric functions + composite scorer |
|
| 194 |
| `train_grpo.py` | ~350 | Root training script (local/Kaggle) |
|
| 195 |
-
| `
|
| 196 |
-
| `
|
| 197 |
| `diagnose_tokens.py` | ~80 | Token budget analysis utility |
|
| 198 |
| `app.py` | ~100 | Local demo (base vs fine-tuned comparison) |
|
|
|
|
| 192 |
| `generator.py` | ~500 | Scenario generation with all templates |
|
| 193 |
| `verifier.py` | ~300 | Five rubric functions + composite scorer |
|
| 194 |
| `train_grpo.py` | ~350 | Root training script (local/Kaggle) |
|
| 195 |
+
| `hf_space_a100/train.py` | ~400 | HF Spaces/Colab training with Gradio integration |
|
| 196 |
+
| `hf_space_a100/app.py` | ~130 | Gradio dashboard |
|
| 197 |
| `diagnose_tokens.py` | ~80 | Token budget analysis utility |
|
| 198 |
| `app.py` | ~100 | Local demo (base vs fine-tuned comparison) |
|
{hf_space_l40s → hf_space_a100}/.python-version
RENAMED
|
File without changes
|
{hf_space_l40s → hf_space_a100}/Dockerfile
RENAMED
|
File without changes
|
{hf_space_l40s → hf_space_a100}/README.md
RENAMED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title: ConflictBench GRPO (
|
| 3 |
emoji: ⚔️
|
| 4 |
colorFrom: indigo
|
| 5 |
colorTo: purple
|
|
@@ -9,5 +9,5 @@ pinned: false
|
|
| 9 |
|
| 10 |
# ConflictBench GRPO Training Pipeline
|
| 11 |
|
| 12 |
-
One-click automated GRPO training pipeline tailored for an **NVIDIA
|
| 13 |
This pipeline automatically clones the ConflictBench environment, generates scenarios based on a defined curriculum, runs GRPO fine-tuning using Unsloth, generates loss/reward plots, and pushes the final LoRA weights to the Hugging Face Hub.
|
|
|
|
| 1 |
---
|
| 2 |
+
title: ConflictBench GRPO (A100)
|
| 3 |
emoji: ⚔️
|
| 4 |
colorFrom: indigo
|
| 5 |
colorTo: purple
|
|
|
|
| 9 |
|
| 10 |
# ConflictBench GRPO Training Pipeline
|
| 11 |
|
| 12 |
+
One-click automated GRPO training pipeline tailored for an **NVIDIA A100** GPU.
|
| 13 |
This pipeline automatically clones the ConflictBench environment, generates scenarios based on a defined curriculum, runs GRPO fine-tuning using Unsloth, generates loss/reward plots, and pushes the final LoRA weights to the Hugging Face Hub.
|
{hf_space_l40s → hf_space_a100}/app.py
RENAMED
|
@@ -70,8 +70,8 @@ CSS = """
|
|
| 70 |
color: #c9d1d9 !important; }
|
| 71 |
"""
|
| 72 |
|
| 73 |
-
with gr.Blocks(title="ConflictBench GRPO Trainer (
|
| 74 |
-
gr.Markdown("# ⚔️ ConflictBench — GRPO Training Dashboard (
|
| 75 |
gr.Markdown("**One-click** production GRPO training script mapped to Run 2 parameters. "
|
| 76 |
"Automatically streams logs and generates plots.")
|
| 77 |
|
|
|
|
| 70 |
color: #c9d1d9 !important; }
|
| 71 |
"""
|
| 72 |
|
| 73 |
+
with gr.Blocks(title="ConflictBench GRPO Trainer (A100)") as demo:
|
| 74 |
+
gr.Markdown("# ⚔️ ConflictBench — GRPO Training Dashboard (A100 Target)", elem_classes="main-title")
|
| 75 |
gr.Markdown("**One-click** production GRPO training script mapped to Run 2 parameters. "
|
| 76 |
"Automatically streams logs and generates plots.")
|
| 77 |
|
{hf_space_l40s → hf_space_a100}/requirements.txt
RENAMED
|
File without changes
|
{hf_space_l40s → hf_space_a100}/train_script.py
RENAMED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
"""
|
| 2 |
-
ConflictBench — GRPO Training Script for HF Space (
|
| 3 |
-
Tailored for NVIDIA
|
| 4 |
"""
|
| 5 |
|
| 6 |
import os
|
|
|
|
| 1 |
"""
|
| 2 |
+
ConflictBench — GRPO Training Script for HF Space (A100 Target)
|
| 3 |
+
Tailored for NVIDIA A100 (48GB VRAM)
|
| 4 |
"""
|
| 5 |
|
| 6 |
import os
|
openenv.yaml
CHANGED
|
@@ -11,10 +11,10 @@ description: >
|
|
| 11 |
runtime: python-3.10
|
| 12 |
|
| 13 |
# HF Spaces entry point (Gradio dashboard + training UI)
|
| 14 |
-
entry_point:
|
| 15 |
|
| 16 |
# HF Spaces / Colab training script
|
| 17 |
-
training_entry:
|
| 18 |
|
| 19 |
# Root training script (local / Kaggle / research — full configurability)
|
| 20 |
local_training_entry: train_grpo.py
|
|
@@ -66,7 +66,7 @@ training:
|
|
| 66 |
max_completion_tokens: 768
|
| 67 |
max_prompt_tokens: 3200
|
| 68 |
hardware:
|
| 69 |
-
-
|
| 70 |
- L4-24GB # ~14h
|
| 71 |
- A10G-24GB # ~12h
|
| 72 |
- T4-16GB # ~28h (2 Kaggle sessions)
|
|
|
|
| 11 |
runtime: python-3.10
|
| 12 |
|
| 13 |
# HF Spaces entry point (Gradio dashboard + training UI)
|
| 14 |
+
entry_point: hf_space_a100/app.py
|
| 15 |
|
| 16 |
# HF Spaces / Colab training script
|
| 17 |
+
training_entry: hf_space_a100/train.py
|
| 18 |
|
| 19 |
# Root training script (local / Kaggle / research — full configurability)
|
| 20 |
local_training_entry: train_grpo.py
|
|
|
|
| 66 |
max_completion_tokens: 768
|
| 67 |
max_prompt_tokens: 3200
|
| 68 |
hardware:
|
| 69 |
+
- A100-48GB # ~8h
|
| 70 |
- L4-24GB # ~14h
|
| 71 |
- A10G-24GB # ~12h
|
| 72 |
- T4-16GB # ~28h (2 Kaggle sessions)
|