Harsh-9209 commited on
Commit
aed711f
·
1 Parent(s): b3123f8

Update references: L40S -> A100 across codebase

Browse files
PRD.md CHANGED
@@ -170,7 +170,7 @@ ScenarioGenerator → ConflictBenchEnv → Verifier → GRPO Trainer
170
  | Epochs | 2 | Reward peaks around epoch 2; more risks KL drift |
171
  | Learning rate | 3e-6 | Conservative; preserves instruction following |
172
  | β (KL penalty) | 0.04 | Prevents excessive drift; 0.02 was insufficient |
173
- | num_generations | 4–6 | 4 minimum; 6 preferred on L40S |
174
 
175
  ---
176
 
@@ -215,7 +215,7 @@ All targets exceeded.
215
  | Reward hacking via length | Low | Medium | Efficiency rubric penalises over-inclusion |
216
  | JSON format memorisation without semantic understanding | Medium | High | Conflict ID F1 requires correct instruction IDs |
217
  | KL divergence runaway | Medium | Medium | β=0.04 provides sufficient penalty |
218
- | Unsloth kernel dtype bug on full-precision path | High (L40S) | Critical | Always use 4-bit quantisation |
219
  | Kaggle session timeout mid-training | High | Medium | Checkpoint-every-50-steps + resume support |
220
 
221
  ---
 
170
  | Epochs | 2 | Reward peaks around epoch 2; more risks KL drift |
171
  | Learning rate | 3e-6 | Conservative; preserves instruction following |
172
  | β (KL penalty) | 0.04 | Prevents excessive drift; 0.02 was insufficient |
173
+ | num_generations | 4–6 | 4 minimum; 6 preferred on A100 |
174
 
175
  ---
176
 
 
215
  | Reward hacking via length | Low | Medium | Efficiency rubric penalises over-inclusion |
216
  | JSON format memorisation without semantic understanding | Medium | High | Conflict ID F1 requires correct instruction IDs |
217
  | KL divergence runaway | Medium | Medium | β=0.04 provides sufficient penalty |
218
+ | Unsloth kernel dtype bug on full-precision path | High (A100) | Critical | Always use 4-bit quantisation |
219
  | Kaggle session timeout mid-training | High | Medium | Checkpoint-every-50-steps + resume support |
220
 
221
  ---
README.md CHANGED
@@ -5,7 +5,7 @@ colorFrom: indigo
5
  colorTo: purple
6
  sdk: gradio
7
  python_version: "3.10"
8
- app_file: hf_space_l40s/app.py
9
  pinned: true
10
  license: mit
11
  ---
@@ -138,7 +138,7 @@ Partial credit via F1 scoring gives GRPO a dense, informative gradient signal at
138
  |---|---|---|---|
139
  | Baseline (no training) | — | 0.14 | Qwen2.5-3B zero-shot |
140
  | Run 1 warmup | 0–120 | 0.14 → 0.22 | Colab T4, 600 scenarios, 4-bit |
141
- | Run 2 (production) | 0–500 | 0.37 → 0.48 | L40S 48GB, 400 scenarios, 2 epochs |
142
  | Run 2 peak | ~step 250 | 0.50 | Best checkpoint |
143
 
144
  ---
@@ -191,7 +191,7 @@ Partial credit via F1 scoring gives GRPO a dense, informative gradient signal at
191
  │ Max tokens: 768 completion / 3200 prompt │
192
  │ Epochs: 2–3 │
193
  │ LR: 3 × 10⁻⁶ | Warmup: 5% | β (KL): 0.04 │
194
- │ Hardware: L40S 48GB (~8h) / L4 24GB (~14h) │
195
  └─────────────────────────────────────────────────────────────┘
196
  ```
197
 
@@ -199,7 +199,7 @@ Partial credit via F1 scoring gives GRPO a dense, informative gradient signal at
199
 
200
  ## Results
201
 
202
- > All results from Run 2 (L40S 48GB, 400 scenarios, 2 epochs, 4-bit quantised Qwen2.5-3B).
203
 
204
  ### Training Curves (Run 2)
205
 
@@ -233,9 +233,9 @@ This repository contains two training scripts serving different purposes:
233
  | Script | Location | Purpose |
234
  |---|---|---|
235
  | `train_grpo.py` | Root directory | Local training, full configurability, research use |
236
- | `train.py` | `hf_space_l40s/` | HF Spaces + Colab training, Gradio UI integration, production use |
237
 
238
- The HF Spaces script (`hf_space_l40s/train.py`) is pre-configured for the L40S GPU and includes the Gradio dashboard for live monitoring. The root script (`train_grpo.py`) exposes all hyperparameters directly and is intended for local or Kaggle training with more manual control.
239
 
240
  ---
241
 
@@ -354,7 +354,7 @@ print("Patched for quick eval run (60 scenarios, 1 epoch, 4 generations)")
354
  print("Expected runtime: ~30 minutes on T4, ~15 minutes on L4/A10G")
355
  print()
356
  print("NOTE: Reward values from this run will be lower than reported results.")
357
- print("Full results used 400 scenarios × 2 epochs on L40S 48GB.")
358
  ```
359
 
360
  Then run normally:
@@ -378,7 +378,7 @@ Conflict_Bench/
378
  ├── openenv.yaml # OpenEnv manifest
379
  ├── requirements.txt # Python dependencies
380
 
381
- ├── hf_space_l40s/ # HF Spaces deployment package
382
  │ ├── app.py # Gradio training dashboard (live log streaming)
383
  │ ├── train.py # Training script optimised for HF Spaces / Colab
384
  │ └── README.md # HF Spaces card
 
5
  colorTo: purple
6
  sdk: gradio
7
  python_version: "3.10"
8
+ app_file: hf_space_a100/app.py
9
  pinned: true
10
  license: mit
11
  ---
 
138
  |---|---|---|---|
139
  | Baseline (no training) | — | 0.14 | Qwen2.5-3B zero-shot |
140
  | Run 1 warmup | 0–120 | 0.14 → 0.22 | Colab T4, 600 scenarios, 4-bit |
141
+ | Run 2 (production) | 0–500 | 0.37 → 0.48 | A100 48GB, 400 scenarios, 2 epochs |
142
  | Run 2 peak | ~step 250 | 0.50 | Best checkpoint |
143
 
144
  ---
 
191
  │ Max tokens: 768 completion / 3200 prompt │
192
  │ Epochs: 2–3 │
193
  │ LR: 3 × 10⁻⁶ | Warmup: 5% | β (KL): 0.04 │
194
+ │ Hardware: A100 48GB (~8h) / L4 24GB (~14h) │
195
  └─────────────────────────────────────────────────────────────┘
196
  ```
197
 
 
199
 
200
  ## Results
201
 
202
+ > All results from Run 2 (A100 48GB, 400 scenarios, 2 epochs, 4-bit quantised Qwen2.5-3B).
203
 
204
  ### Training Curves (Run 2)
205
 
 
233
  | Script | Location | Purpose |
234
  |---|---|---|
235
  | `train_grpo.py` | Root directory | Local training, full configurability, research use |
236
+ | `train.py` | `hf_space_a100/` | HF Spaces + Colab training, Gradio UI integration, production use |
237
 
238
+ The HF Spaces script (`hf_space_a100/train.py`) is pre-configured for the A100 GPU and includes the Gradio dashboard for live monitoring. The root script (`train_grpo.py`) exposes all hyperparameters directly and is intended for local or Kaggle training with more manual control.
239
 
240
  ---
241
 
 
354
  print("Expected runtime: ~30 minutes on T4, ~15 minutes on L4/A10G")
355
  print()
356
  print("NOTE: Reward values from this run will be lower than reported results.")
357
+ print("Full results used 400 scenarios × 2 epochs on A100 48GB.")
358
  ```
359
 
360
  Then run normally:
 
378
  ├── openenv.yaml # OpenEnv manifest
379
  ├── requirements.txt # Python dependencies
380
 
381
+ ├── hf_space_a100/ # HF Spaces deployment package
382
  │ ├── app.py # Gradio training dashboard (live log streaming)
383
  │ ├── train.py # Training script optimised for HF Spaces / Colab
384
  │ └── README.md # HF Spaces card
blog.md CHANGED
@@ -46,7 +46,7 @@ GRPO works by generating multiple completions for the same prompt, scoring each,
46
 
47
  This is exactly right for authority resolution. The model does not need to know what the correct answer looks like. It needs to learn to generate better answers than its own average. The reward function provides the signal that defines "better," and because the reward is fully deterministic and rule-based, the gradient is clean — there is no ambiguity about whether a completion deserved its score.
48
 
49
- We used Unsloth's optimised GRPO implementation with TRL's GRPOTrainer, applied to Qwen2.5-3B-Instruct with 4-bit quantisation and LoRA (rank 32, targeting all attention and MLP projections). The full training run on an L40S 48GB GPU takes approximately 8 hours for 400 scenarios across 2 epochs.
50
 
51
  ---
52
 
@@ -70,7 +70,7 @@ Three common reward-gaming strategies all fail against this rubric set. Always f
70
 
71
  ## What the Training Curves Show
72
 
73
- The results from Run 2 — 400 training scenarios, 2 epochs, L40S 48GB — are instructive.
74
 
75
  The baseline Qwen2.5-3B-Instruct achieves a composite reward of 0.14 on this task with no training. This is low, but not zero: the model can produce valid JSON and occasionally resolves conflicts correctly by chance.
76
 
 
46
 
47
  This is exactly right for authority resolution. The model does not need to know what the correct answer looks like. It needs to learn to generate better answers than its own average. The reward function provides the signal that defines "better," and because the reward is fully deterministic and rule-based, the gradient is clean — there is no ambiguity about whether a completion deserved its score.
48
 
49
+ We used Unsloth's optimised GRPO implementation with TRL's GRPOTrainer, applied to Qwen2.5-3B-Instruct with 4-bit quantisation and LoRA (rank 32, targeting all attention and MLP projections). The full training run on an A100 48GB GPU takes approximately 8 hours for 400 scenarios across 2 epochs.
50
 
51
  ---
52
 
 
70
 
71
  ## What the Training Curves Show
72
 
73
+ The results from Run 2 — 400 training scenarios, 2 epochs, A100 48GB — are instructive.
74
 
75
  The baseline Qwen2.5-3B-Instruct achieves a composite reward of 0.14 on this task with no training. This is low, but not zero: the model can produce valid JSON and occasionally resolves conflicts correctly by chance.
76
 
docs/TRAINING_GUIDE.md CHANGED
@@ -4,7 +4,7 @@
4
 
5
  | Platform | GPU | VRAM | Est. Time | Cost | Recommended For |
6
  |---|---|---|---|---|---|
7
- | HF Spaces (L40S) | L40S | 48GB | ~8h | ~$14 | Production runs |
8
  | HF Spaces (L4) | L4 | 24GB | ~14h | ~$11 | Production runs |
9
  | Google Colab Pro | L4 | 24GB | ~14h | ~$10 | Experimentation |
10
  | Google Colab Pro+ | A100 | 40GB | ~6h | ~$20 | Fast iteration |
@@ -17,7 +17,7 @@
17
 
18
  This repository has two training entry points:
19
 
20
- **`hf_space_l40s/train.py`** — for HF Spaces and Colab. Integrates with the Gradio dashboard, streams live logs, auto-uploads to HF Hub. Pre-configured for L40S but auto-detects GPU.
21
 
22
  **`train_grpo.py`** — for local machines, Kaggle, and research use. Exposes all hyperparameters directly. More verbose logging. Supports checkpoint resume.
23
 
@@ -33,7 +33,7 @@ EVAL_SCENARIOS = 60 # held-out evaluation scenarios
33
  NUM_EPOCHS = 2 # full passes over the training set
34
  LEARNING_RATE = 3e-6 # conservative; prevents catastrophic forgetting
35
  BETA = 0.04 # KL penalty; 0.02 causes excessive drift (use 0.04+)
36
- num_generations = 4 # GRPO group size; 6 recommended on L40S
37
  MAX_NEW_TOKENS = 768 # completion budget; average ~300 tokens in practice
38
  MAX_PROMPT_LENGTH = 3200 # prompt budget; generator enforces 4000 char limit
39
  SAVE_STEPS = 50 # checkpoint frequency
 
4
 
5
  | Platform | GPU | VRAM | Est. Time | Cost | Recommended For |
6
  |---|---|---|---|---|---|
7
+ | HF Spaces (A100) | A100 | 48GB | ~8h | ~$14 | Production runs |
8
  | HF Spaces (L4) | L4 | 24GB | ~14h | ~$11 | Production runs |
9
  | Google Colab Pro | L4 | 24GB | ~14h | ~$10 | Experimentation |
10
  | Google Colab Pro+ | A100 | 40GB | ~6h | ~$20 | Fast iteration |
 
17
 
18
  This repository has two training entry points:
19
 
20
+ **`hf_space_a100/train.py`** — for HF Spaces and Colab. Integrates with the Gradio dashboard, streams live logs, auto-uploads to HF Hub. Pre-configured for A100 but auto-detects GPU.
21
 
22
  **`train_grpo.py`** — for local machines, Kaggle, and research use. Exposes all hyperparameters directly. More verbose logging. Supports checkpoint resume.
23
 
 
33
  NUM_EPOCHS = 2 # full passes over the training set
34
  LEARNING_RATE = 3e-6 # conservative; prevents catastrophic forgetting
35
  BETA = 0.04 # KL penalty; 0.02 causes excessive drift (use 0.04+)
36
+ num_generations = 4 # GRPO group size; 6 recommended on A100
37
  MAX_NEW_TOKENS = 768 # completion budget; average ~300 tokens in practice
38
  MAX_PROMPT_LENGTH = 3200 # prompt budget; generator enforces 4000 char limit
39
  SAVE_STEPS = 50 # checkpoint frequency
docs/architecture.md CHANGED
@@ -192,7 +192,7 @@ Next Episode
192
  | `generator.py` | ~500 | Scenario generation with all templates |
193
  | `verifier.py` | ~300 | Five rubric functions + composite scorer |
194
  | `train_grpo.py` | ~350 | Root training script (local/Kaggle) |
195
- | `hf_space_l40s/train.py` | ~400 | HF Spaces/Colab training with Gradio integration |
196
- | `hf_space_l40s/app.py` | ~130 | Gradio dashboard |
197
  | `diagnose_tokens.py` | ~80 | Token budget analysis utility |
198
  | `app.py` | ~100 | Local demo (base vs fine-tuned comparison) |
 
192
  | `generator.py` | ~500 | Scenario generation with all templates |
193
  | `verifier.py` | ~300 | Five rubric functions + composite scorer |
194
  | `train_grpo.py` | ~350 | Root training script (local/Kaggle) |
195
+ | `hf_space_a100/train.py` | ~400 | HF Spaces/Colab training with Gradio integration |
196
+ | `hf_space_a100/app.py` | ~130 | Gradio dashboard |
197
  | `diagnose_tokens.py` | ~80 | Token budget analysis utility |
198
  | `app.py` | ~100 | Local demo (base vs fine-tuned comparison) |
{hf_space_l40s → hf_space_a100}/.python-version RENAMED
File without changes
{hf_space_l40s → hf_space_a100}/Dockerfile RENAMED
File without changes
{hf_space_l40s → hf_space_a100}/README.md RENAMED
@@ -1,5 +1,5 @@
1
  ---
2
- title: ConflictBench GRPO (L40S)
3
  emoji: ⚔️
4
  colorFrom: indigo
5
  colorTo: purple
@@ -9,5 +9,5 @@ pinned: false
9
 
10
  # ConflictBench GRPO Training Pipeline
11
 
12
- One-click automated GRPO training pipeline tailored for an **NVIDIA L40S** GPU.
13
  This pipeline automatically clones the ConflictBench environment, generates scenarios based on a defined curriculum, runs GRPO fine-tuning using Unsloth, generates loss/reward plots, and pushes the final LoRA weights to the Hugging Face Hub.
 
1
  ---
2
+ title: ConflictBench GRPO (A100)
3
  emoji: ⚔️
4
  colorFrom: indigo
5
  colorTo: purple
 
9
 
10
  # ConflictBench GRPO Training Pipeline
11
 
12
+ One-click automated GRPO training pipeline tailored for an **NVIDIA A100** GPU.
13
  This pipeline automatically clones the ConflictBench environment, generates scenarios based on a defined curriculum, runs GRPO fine-tuning using Unsloth, generates loss/reward plots, and pushes the final LoRA weights to the Hugging Face Hub.
{hf_space_l40s → hf_space_a100}/app.py RENAMED
@@ -70,8 +70,8 @@ CSS = """
70
  color: #c9d1d9 !important; }
71
  """
72
 
73
- with gr.Blocks(title="ConflictBench GRPO Trainer (L40S)") as demo:
74
- gr.Markdown("# ⚔️ ConflictBench — GRPO Training Dashboard (L40S Target)", elem_classes="main-title")
75
  gr.Markdown("**One-click** production GRPO training script mapped to Run 2 parameters. "
76
  "Automatically streams logs and generates plots.")
77
 
 
70
  color: #c9d1d9 !important; }
71
  """
72
 
73
+ with gr.Blocks(title="ConflictBench GRPO Trainer (A100)") as demo:
74
+ gr.Markdown("# ⚔️ ConflictBench — GRPO Training Dashboard (A100 Target)", elem_classes="main-title")
75
  gr.Markdown("**One-click** production GRPO training script mapped to Run 2 parameters. "
76
  "Automatically streams logs and generates plots.")
77
 
{hf_space_l40s → hf_space_a100}/requirements.txt RENAMED
File without changes
{hf_space_l40s → hf_space_a100}/train_script.py RENAMED
@@ -1,6 +1,6 @@
1
  """
2
- ConflictBench — GRPO Training Script for HF Space (L40S Target)
3
- Tailored for NVIDIA L40S (48GB VRAM)
4
  """
5
 
6
  import os
 
1
  """
2
+ ConflictBench — GRPO Training Script for HF Space (A100 Target)
3
+ Tailored for NVIDIA A100 (48GB VRAM)
4
  """
5
 
6
  import os
openenv.yaml CHANGED
@@ -11,10 +11,10 @@ description: >
11
  runtime: python-3.10
12
 
13
  # HF Spaces entry point (Gradio dashboard + training UI)
14
- entry_point: hf_space_l40s/app.py
15
 
16
  # HF Spaces / Colab training script
17
- training_entry: hf_space_l40s/train.py
18
 
19
  # Root training script (local / Kaggle / research — full configurability)
20
  local_training_entry: train_grpo.py
@@ -66,7 +66,7 @@ training:
66
  max_completion_tokens: 768
67
  max_prompt_tokens: 3200
68
  hardware:
69
- - L40S-48GB # ~8h
70
  - L4-24GB # ~14h
71
  - A10G-24GB # ~12h
72
  - T4-16GB # ~28h (2 Kaggle sessions)
 
11
  runtime: python-3.10
12
 
13
  # HF Spaces entry point (Gradio dashboard + training UI)
14
+ entry_point: hf_space_a100/app.py
15
 
16
  # HF Spaces / Colab training script
17
+ training_entry: hf_space_a100/train.py
18
 
19
  # Root training script (local / Kaggle / research — full configurability)
20
  local_training_entry: train_grpo.py
 
66
  max_completion_tokens: 768
67
  max_prompt_tokens: 3200
68
  hardware:
69
+ - A100-48GB # ~8h
70
  - L4-24GB # ~14h
71
  - A10G-24GB # ~12h
72
  - T4-16GB # ~28h (2 Kaggle sessions)