Spaces:

Harsh-9209
/

Conflict_Bench

Sleeping

App Files Files Community

Harsh-9209 commited on Apr 26

Commit

aed711f

1 Parent(s): b3123f8

Update references: L40S -> A100 across codebase

Browse files

Files changed (12) hide show

PRD.md +2 -2
README.md +8 -8
blog.md +2 -2
docs/TRAINING_GUIDE.md +3 -3
docs/architecture.md +2 -2
{hf_space_l40s → hf_space_a100}/.python-version +0 -0
{hf_space_l40s → hf_space_a100}/Dockerfile +0 -0
{hf_space_l40s → hf_space_a100}/README.md +2 -2
{hf_space_l40s → hf_space_a100}/app.py +2 -2
{hf_space_l40s → hf_space_a100}/requirements.txt +0 -0
{hf_space_l40s → hf_space_a100}/train_script.py +2 -2
openenv.yaml +3 -3

PRD.md CHANGED Viewed

@@ -170,7 +170,7 @@ ScenarioGenerator → ConflictBenchEnv → Verifier → GRPO Trainer
 | Epochs | 2 | Reward peaks around epoch 2; more risks KL drift |
 | Learning rate | 3e-6 | Conservative; preserves instruction following |
 | β (KL penalty) | 0.04 | Prevents excessive drift; 0.02 was insufficient |
-| num_generations | 4–6 | 4 minimum; 6 preferred on L40S |
 ---
@@ -215,7 +215,7 @@ All targets exceeded.
 | Reward hacking via length | Low | Medium | Efficiency rubric penalises over-inclusion |
 | JSON format memorisation without semantic understanding | Medium | High | Conflict ID F1 requires correct instruction IDs |
 | KL divergence runaway | Medium | Medium | β=0.04 provides sufficient penalty |
-| Unsloth kernel dtype bug on full-precision path | High (L40S) | Critical | Always use 4-bit quantisation |
 | Kaggle session timeout mid-training | High | Medium | Checkpoint-every-50-steps + resume support |
 ---

 | Epochs | 2 | Reward peaks around epoch 2; more risks KL drift |
 | Learning rate | 3e-6 | Conservative; preserves instruction following |
 | β (KL penalty) | 0.04 | Prevents excessive drift; 0.02 was insufficient |
+| num_generations | 4–6 | 4 minimum; 6 preferred on A100 |
 ---
 | Reward hacking via length | Low | Medium | Efficiency rubric penalises over-inclusion |
 | JSON format memorisation without semantic understanding | Medium | High | Conflict ID F1 requires correct instruction IDs |
 | KL divergence runaway | Medium | Medium | β=0.04 provides sufficient penalty |
+| Unsloth kernel dtype bug on full-precision path | High (A100) | Critical | Always use 4-bit quantisation |
 | Kaggle session timeout mid-training | High | Medium | Checkpoint-every-50-steps + resume support |
 ---

README.md CHANGED Viewed

@@ -5,7 +5,7 @@ colorFrom: indigo
 colorTo: purple
 sdk: gradio
 python_version: "3.10"
-app_file: hf_space_l40s/app.py
 pinned: true
 license: mit
 ---
@@ -138,7 +138,7 @@ Partial credit via F1 scoring gives GRPO a dense, informative gradient signal at
 |---|---|---|---|
 | Baseline (no training) | — | 0.14 | Qwen2.5-3B zero-shot |
 | Run 1 warmup | 0–120 | 0.14 → 0.22 | Colab T4, 600 scenarios, 4-bit |
-| Run 2 (production) | 0–500 | 0.37 → 0.48 | L40S 48GB, 400 scenarios, 2 epochs |
 | Run 2 peak | ~step 250 | 0.50 | Best checkpoint |
 ---
@@ -191,7 +191,7 @@ Partial credit via F1 scoring gives GRPO a dense, informative gradient signal at
 │  Max tokens:   768 completion / 3200 prompt                 │
 │  Epochs:       2–3                                          │
 │  LR:           3 × 10⁻⁶  |  Warmup: 5%  |  β (KL): 0.04   │
-│  Hardware:     L40S 48GB (~8h) / L4 24GB (~14h)            │
 └─────────────────────────────────────────────────────────────┘
 ```
@@ -199,7 +199,7 @@ Partial credit via F1 scoring gives GRPO a dense, informative gradient signal at
 ## Results
-> All results from Run 2 (L40S 48GB, 400 scenarios, 2 epochs, 4-bit quantised Qwen2.5-3B).
 ### Training Curves (Run 2)
@@ -233,9 +233,9 @@ This repository contains two training scripts serving different purposes:
 | Script | Location | Purpose |
 |---|---|---|
 | `train_grpo.py` | Root directory | Local training, full configurability, research use |
-| `train.py` | `hf_space_l40s/` | HF Spaces + Colab training, Gradio UI integration, production use |
-The HF Spaces script (`hf_space_l40s/train.py`) is pre-configured for the L40S GPU and includes the Gradio dashboard for live monitoring. The root script (`train_grpo.py`) exposes all hyperparameters directly and is intended for local or Kaggle training with more manual control.
 ---
@@ -354,7 +354,7 @@ print("Patched for quick eval run (60 scenarios, 1 epoch, 4 generations)")
 print("Expected runtime: ~30 minutes on T4, ~15 minutes on L4/A10G")
 print()
 print("NOTE: Reward values from this run will be lower than reported results.")
-print("Full results used 400 scenarios × 2 epochs on L40S 48GB.")
 ```
 Then run normally:
@@ -378,7 +378,7 @@ Conflict_Bench/
 ├── openenv.yaml               # OpenEnv manifest
 ├── requirements.txt           # Python dependencies
 │
-├── hf_space_l40s/             # HF Spaces deployment package
 │   ├── app.py                 # Gradio training dashboard (live log streaming)
 │   ├── train.py               # Training script optimised for HF Spaces / Colab
 │   └── README.md              # HF Spaces card

 colorTo: purple
 sdk: gradio
 python_version: "3.10"
+app_file: hf_space_a100/app.py
 pinned: true
 license: mit
 ---
 |---|---|---|---|
 | Baseline (no training) | — | 0.14 | Qwen2.5-3B zero-shot |
 | Run 1 warmup | 0–120 | 0.14 → 0.22 | Colab T4, 600 scenarios, 4-bit |
+| Run 2 (production) | 0–500 | 0.37 → 0.48 | A100 48GB, 400 scenarios, 2 epochs |
 | Run 2 peak | ~step 250 | 0.50 | Best checkpoint |
 ---
 │  Max tokens:   768 completion / 3200 prompt                 │
 │  Epochs:       2–3                                          │
 │  LR:           3 × 10⁻⁶  |  Warmup: 5%  |  β (KL): 0.04   │
+│  Hardware:     A100 48GB (~8h) / L4 24GB (~14h)            │
 └─────────────────────────────────────────────────────────────┘
 ```
 ## Results
+> All results from Run 2 (A100 48GB, 400 scenarios, 2 epochs, 4-bit quantised Qwen2.5-3B).
 ### Training Curves (Run 2)
 | Script | Location | Purpose |
 |---|---|---|
 | `train_grpo.py` | Root directory | Local training, full configurability, research use |
+| `train.py` | `hf_space_a100/` | HF Spaces + Colab training, Gradio UI integration, production use |
+The HF Spaces script (`hf_space_a100/train.py`) is pre-configured for the A100 GPU and includes the Gradio dashboard for live monitoring. The root script (`train_grpo.py`) exposes all hyperparameters directly and is intended for local or Kaggle training with more manual control.
 ---
 print("Expected runtime: ~30 minutes on T4, ~15 minutes on L4/A10G")
 print()
 print("NOTE: Reward values from this run will be lower than reported results.")
+print("Full results used 400 scenarios × 2 epochs on A100 48GB.")
 ```
 Then run normally:
 ├── openenv.yaml               # OpenEnv manifest
 ├── requirements.txt           # Python dependencies
 │
+├── hf_space_a100/             # HF Spaces deployment package
 │   ├── app.py                 # Gradio training dashboard (live log streaming)
 │   ├── train.py               # Training script optimised for HF Spaces / Colab
 │   └── README.md              # HF Spaces card

blog.md CHANGED Viewed

@@ -46,7 +46,7 @@ GRPO works by generating multiple completions for the same prompt, scoring each,
 This is exactly right for authority resolution. The model does not need to know what the correct answer looks like. It needs to learn to generate better answers than its own average. The reward function provides the signal that defines "better," and because the reward is fully deterministic and rule-based, the gradient is clean — there is no ambiguity about whether a completion deserved its score.
-We used Unsloth's optimised GRPO implementation with TRL's GRPOTrainer, applied to Qwen2.5-3B-Instruct with 4-bit quantisation and LoRA (rank 32, targeting all attention and MLP projections). The full training run on an L40S 48GB GPU takes approximately 8 hours for 400 scenarios across 2 epochs.
 ---
@@ -70,7 +70,7 @@ Three common reward-gaming strategies all fail against this rubric set. Always f
 ## What the Training Curves Show
-The results from Run 2 — 400 training scenarios, 2 epochs, L40S 48GB — are instructive.
 The baseline Qwen2.5-3B-Instruct achieves a composite reward of 0.14 on this task with no training. This is low, but not zero: the model can produce valid JSON and occasionally resolves conflicts correctly by chance.

 This is exactly right for authority resolution. The model does not need to know what the correct answer looks like. It needs to learn to generate better answers than its own average. The reward function provides the signal that defines "better," and because the reward is fully deterministic and rule-based, the gradient is clean — there is no ambiguity about whether a completion deserved its score.
+We used Unsloth's optimised GRPO implementation with TRL's GRPOTrainer, applied to Qwen2.5-3B-Instruct with 4-bit quantisation and LoRA (rank 32, targeting all attention and MLP projections). The full training run on an A100 48GB GPU takes approximately 8 hours for 400 scenarios across 2 epochs.
 ---
 ## What the Training Curves Show
+The results from Run 2 — 400 training scenarios, 2 epochs, A100 48GB — are instructive.
 The baseline Qwen2.5-3B-Instruct achieves a composite reward of 0.14 on this task with no training. This is low, but not zero: the model can produce valid JSON and occasionally resolves conflicts correctly by chance.

docs/TRAINING_GUIDE.md CHANGED Viewed

@@ -4,7 +4,7 @@
 | Platform | GPU | VRAM | Est. Time | Cost | Recommended For |
 |---|---|---|---|---|---|
-| HF Spaces (L40S) | L40S | 48GB | ~8h | ~$14 | Production runs |
 | HF Spaces (L4) | L4 | 24GB | ~14h | ~$11 | Production runs |
 | Google Colab Pro | L4 | 24GB | ~14h | ~$10 | Experimentation |
 | Google Colab Pro+ | A100 | 40GB | ~6h | ~$20 | Fast iteration |
@@ -17,7 +17,7 @@
 This repository has two training entry points:
-**`hf_space_l40s/train.py`** — for HF Spaces and Colab. Integrates with the Gradio dashboard, streams live logs, auto-uploads to HF Hub. Pre-configured for L40S but auto-detects GPU.
 **`train_grpo.py`** — for local machines, Kaggle, and research use. Exposes all hyperparameters directly. More verbose logging. Supports checkpoint resume.
@@ -33,7 +33,7 @@ EVAL_SCENARIOS  = 60        # held-out evaluation scenarios
 NUM_EPOCHS      = 2         # full passes over the training set
 LEARNING_RATE   = 3e-6      # conservative; prevents catastrophic forgetting
 BETA            = 0.04      # KL penalty; 0.02 causes excessive drift (use 0.04+)
-num_generations = 4         # GRPO group size; 6 recommended on L40S
 MAX_NEW_TOKENS  = 768       # completion budget; average ~300 tokens in practice
 MAX_PROMPT_LENGTH = 3200    # prompt budget; generator enforces 4000 char limit
 SAVE_STEPS      = 50        # checkpoint frequency

 | Platform | GPU | VRAM | Est. Time | Cost | Recommended For |
 |---|---|---|---|---|---|
+| HF Spaces (A100) | A100 | 48GB | ~8h | ~$14 | Production runs |
 | HF Spaces (L4) | L4 | 24GB | ~14h | ~$11 | Production runs |
 | Google Colab Pro | L4 | 24GB | ~14h | ~$10 | Experimentation |
 | Google Colab Pro+ | A100 | 40GB | ~6h | ~$20 | Fast iteration |
 This repository has two training entry points:
+**`hf_space_a100/train.py`** — for HF Spaces and Colab. Integrates with the Gradio dashboard, streams live logs, auto-uploads to HF Hub. Pre-configured for A100 but auto-detects GPU.
 **`train_grpo.py`** — for local machines, Kaggle, and research use. Exposes all hyperparameters directly. More verbose logging. Supports checkpoint resume.
 NUM_EPOCHS      = 2         # full passes over the training set
 LEARNING_RATE   = 3e-6      # conservative; prevents catastrophic forgetting
 BETA            = 0.04      # KL penalty; 0.02 causes excessive drift (use 0.04+)
+num_generations = 4         # GRPO group size; 6 recommended on A100
 MAX_NEW_TOKENS  = 768       # completion budget; average ~300 tokens in practice
 MAX_PROMPT_LENGTH = 3200    # prompt budget; generator enforces 4000 char limit
 SAVE_STEPS      = 50        # checkpoint frequency

docs/architecture.md CHANGED Viewed

@@ -192,7 +192,7 @@ Next Episode
 | `generator.py` | ~500 | Scenario generation with all templates |
 | `verifier.py` | ~300 | Five rubric functions + composite scorer |
 | `train_grpo.py` | ~350 | Root training script (local/Kaggle) |
-| `hf_space_l40s/train.py` | ~400 | HF Spaces/Colab training with Gradio integration |
-| `hf_space_l40s/app.py` | ~130 | Gradio dashboard |
 | `diagnose_tokens.py` | ~80 | Token budget analysis utility |
 | `app.py` | ~100 | Local demo (base vs fine-tuned comparison) |

 | `generator.py` | ~500 | Scenario generation with all templates |
 | `verifier.py` | ~300 | Five rubric functions + composite scorer |
 | `train_grpo.py` | ~350 | Root training script (local/Kaggle) |
+| `hf_space_a100/train.py` | ~400 | HF Spaces/Colab training with Gradio integration |
+| `hf_space_a100/app.py` | ~130 | Gradio dashboard |
 | `diagnose_tokens.py` | ~80 | Token budget analysis utility |
 | `app.py` | ~100 | Local demo (base vs fine-tuned comparison) |

{hf_space_l40s → hf_space_a100}/.python-version RENAMED Viewed

File without changes

{hf_space_l40s → hf_space_a100}/Dockerfile RENAMED Viewed

File without changes

{hf_space_l40s → hf_space_a100}/README.md RENAMED Viewed

@@ -1,5 +1,5 @@
 ---
-title: ConflictBench GRPO (L40S)
 emoji: ⚔️
 colorFrom: indigo
 colorTo: purple
@@ -9,5 +9,5 @@ pinned: false
 # ConflictBench GRPO Training Pipeline
-One-click automated GRPO training pipeline tailored for an **NVIDIA L40S** GPU.
 This pipeline automatically clones the ConflictBench environment, generates scenarios based on a defined curriculum, runs GRPO fine-tuning using Unsloth, generates loss/reward plots, and pushes the final LoRA weights to the Hugging Face Hub.

 ---
+title: ConflictBench GRPO (A100)
 emoji: ⚔️
 colorFrom: indigo
 colorTo: purple
 # ConflictBench GRPO Training Pipeline
+One-click automated GRPO training pipeline tailored for an **NVIDIA A100** GPU.
 This pipeline automatically clones the ConflictBench environment, generates scenarios based on a defined curriculum, runs GRPO fine-tuning using Unsloth, generates loss/reward plots, and pushes the final LoRA weights to the Hugging Face Hub.

{hf_space_l40s → hf_space_a100}/app.py RENAMED Viewed

@@ -70,8 +70,8 @@ CSS = """
                      color: #c9d1d9 !important; }
 """
-with gr.Blocks(title="ConflictBench GRPO Trainer (L40S)") as demo:
-    gr.Markdown("# ⚔️ ConflictBench — GRPO Training Dashboard (L40S Target)", elem_classes="main-title")
     gr.Markdown("**One-click** production GRPO training script mapped to Run 2 parameters. "
                 "Automatically streams logs and generates plots.")

                      color: #c9d1d9 !important; }
 """
+with gr.Blocks(title="ConflictBench GRPO Trainer (A100)") as demo:
+    gr.Markdown("# ⚔️ ConflictBench — GRPO Training Dashboard (A100 Target)", elem_classes="main-title")
     gr.Markdown("**One-click** production GRPO training script mapped to Run 2 parameters. "
                 "Automatically streams logs and generates plots.")

{hf_space_l40s → hf_space_a100}/requirements.txt RENAMED Viewed

File without changes

{hf_space_l40s → hf_space_a100}/train_script.py RENAMED Viewed

@@ -1,6 +1,6 @@
 """
-ConflictBench — GRPO Training Script for HF Space (L40S Target)
-Tailored for NVIDIA L40S (48GB VRAM)
 """
 import os

 """
+ConflictBench — GRPO Training Script for HF Space (A100 Target)
+Tailored for NVIDIA A100 (48GB VRAM)
 """
 import os

openenv.yaml CHANGED Viewed

@@ -11,10 +11,10 @@ description: >
 runtime: python-3.10
 # HF Spaces entry point (Gradio dashboard + training UI)
-entry_point: hf_space_l40s/app.py
 # HF Spaces / Colab training script
-training_entry: hf_space_l40s/train.py
 # Root training script (local / Kaggle / research — full configurability)
 local_training_entry: train_grpo.py
@@ -66,7 +66,7 @@ training:
   max_completion_tokens: 768
   max_prompt_tokens: 3200
   hardware:
-    - L40S-48GB  # ~8h
     - L4-24GB    # ~14h
     - A10G-24GB  # ~12h
     - T4-16GB    # ~28h (2 Kaggle sessions)

 runtime: python-3.10
 # HF Spaces entry point (Gradio dashboard + training UI)
+entry_point: hf_space_a100/app.py
 # HF Spaces / Colab training script
+training_entry: hf_space_a100/train.py
 # Root training script (local / Kaggle / research — full configurability)
 local_training_entry: train_grpo.py
   max_completion_tokens: 768
   max_prompt_tokens: 3200
   hardware:
+    - A100-48GB  # ~8h
     - L4-24GB    # ~14h
     - A10G-24GB  # ~12h
     - T4-16GB    # ~28h (2 Kaggle sessions)