Spaces:

agentDebugger
/

AgentDebugger-training-v3

Running

App Files Files Community

shank commited on May 18

Commit

e160aa1

1 Parent(s): 3eb8edc

Fix checkpoint persistence, add leaderboard and update HF links

Browse files

Files changed (4) hide show

README.md +3 -2
app.py +1 -0
leaderboard/index.html +275 -0
training/train_grpo.py +47 -1

README.md CHANGED Viewed

@@ -11,9 +11,10 @@ pinned: false
 # AgentDebuggerEnv
 **Hackathon Links:**
-- 🌌 **[Live Hugging Face Space](https://huggingface.co/spaces/shashaank0707/AgentDebugger-training-v3)**
 - 📹 **[Watch the 2-Minute Demo](#)** *(Replace with YouTube Link)*
-- 📝 **[Read the Technical Writeup](#)** *(Replace with HF Blog Link)*
 ### 🚀 One-Line Pitch
 An OpenEnv-backed reinforcement learning environment that trains LLMs to debug code systematically via Group Relative Policy Optimization (GRPO) and secure sandbox execution.

 # AgentDebuggerEnv
 **Hackathon Links:**
+- 🌌 **[Live Hugging Face Space](https://huggingface.co/spaces/agentDebugger/AgentDebugger-training-v3)**
+- 📊 **[Model Leaderboard Space](https://huggingface.co/spaces/agentDebugger/AgentDebugger-leaderboard)** *(coming soon)*
 - 📹 **[Watch the 2-Minute Demo](#)** *(Replace with YouTube Link)*
+- 📝 **[Read the Technical Writeup](./Blog.md)**
 ### 🚀 One-Line Pitch
 An OpenEnv-backed reinforcement learning environment that trains LLMs to debug code systematically via Group Relative Policy Optimization (GRPO) and secure sandbox execution.

app.py CHANGED Viewed

@@ -89,6 +89,7 @@ Training **Qwen2.5-Coder-7B-Instruct** on structured hypothesis-driven debugging
 - Algorithm: GRPO (same as DeepSeek-R1)
 - Dataset: 90 hand-validated bugs across 3 difficulty tiers
 - Curriculum: Tier 1 (steps 0–150) → Tier 1+2 (150–350) → All tiers (350–500)
         """
     )
     status_box = gr.Textbox(

 - Algorithm: GRPO (same as DeepSeek-R1)
 - Dataset: 90 hand-validated bugs across 3 difficulty tiers
 - Curriculum: Tier 1 (steps 0–150) → Tier 1+2 (150–350) → All tiers (350–500)
+- 📊 **[View Model Leaderboard](https://huggingface.co/spaces/agentDebugger/AgentDebugger-leaderboard)**
         """
     )
     status_box = gr.Textbox(

leaderboard/index.html ADDED Viewed

	@@ -0,0 +1,275 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>AgentDebuggerEnv Benchmark Leaderboard</title>
+    <style>
+        :root {
+            --bg-color: #0f172a;
+            --glass-bg: rgba(30, 41, 59, 0.7);
+            --glass-border: rgba(255, 255, 255, 0.1);
+            --text-primary: #f8fafc;
+            --text-secondary: #94a3b8;
+            --accent-primary: #8b5cf6;
+            --accent-secondary: #6366f1;
+            --success: #10b981;
+            --warning: #f59e0b;
+            --danger: #ef4444;
+        }
+        body {
+            margin: 0;
+            padding: 0;
+            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Open Sans', 'Helvetica Neue', sans-serif;
+            background-color: var(--bg-color);
+            background-image:
+                radial-gradient(circle at 15% 50%, rgba(99, 102, 241, 0.15) 0%, transparent 50%),
+                radial-gradient(circle at 85% 30%, rgba(139, 92, 246, 0.15) 0%, transparent 50%);
+            color: var(--text-primary);
+            min-height: 100vh;
+        }
+        .container {
+            max-width: 1200px;
+            margin: 0 auto;
+            padding: 2rem;
+        }
+        header {
+            text-align: center;
+            margin-bottom: 3rem;
+        }
+        h1 {
+            font-size: 3rem;
+            margin-bottom: 0.5rem;
+            background: linear-gradient(to right, #a78bfa, #818cf8);
+            -webkit-background-clip: text;
+            -webkit-text-fill-color: transparent;
+        }
+        .subtitle {
+            color: var(--text-secondary);
+            font-size: 1.2rem;
+        }
+        .glass-panel {
+            background: var(--glass-bg);
+            backdrop-filter: blur(12px);
+            -webkit-backdrop-filter: blur(12px);
+            border: 1px solid var(--glass-border);
+            border-radius: 16px;
+            padding: 2rem;
+            box-shadow: 0 4px 30px rgba(0, 0, 0, 0.1);
+            margin-bottom: 2rem;
+        }
+        table {
+            width: 100%;
+            border-collapse: collapse;
+            margin-top: 1rem;
+        }
+        th, td {
+            padding: 1rem;
+            text-align: left;
+            border-bottom: 1px solid var(--glass-border);
+        }
+        th {
+            color: var(--text-secondary);
+            font-weight: 600;
+            text-transform: uppercase;
+            font-size: 0.875rem;
+            letter-spacing: 0.05em;
+        }
+        tr:last-child td {
+            border-bottom: none;
+        }
+        tr:hover td {
+            background: rgba(255, 255, 255, 0.03);
+        }
+        .model-name {
+            font-weight: 600;
+            display: flex;
+            align-items: center;
+            gap: 0.5rem;
+        }
+        .badge {
+            background: linear-gradient(135deg, var(--accent-primary), var(--accent-secondary));
+            padding: 0.25rem 0.5rem;
+            border-radius: 4px;
+            font-size: 0.75rem;
+            font-weight: 600;
+        }
+        .score-bar-container {
+            width: 100%;
+            background: rgba(255, 255, 255, 0.1);
+            border-radius: 8px;
+            height: 8px;
+            overflow: hidden;
+            margin-top: 0.5rem;
+        }
+        .score-bar {
+            height: 100%;
+            background: linear-gradient(90deg, var(--accent-secondary), var(--accent-primary));
+            border-radius: 8px;
+        }
+        .score-value {
+            font-weight: 700;
+            font-size: 1.1rem;
+        }
+        .tier-score {
+            font-variant-numeric: tabular-nums;
+        }
+        .cta-container {
+            text-align: center;
+            margin-top: 3rem;
+        }
+        .btn {
+            display: inline-block;
+            background: linear-gradient(135deg, var(--accent-secondary), var(--accent-primary));
+            color: white;
+            text-decoration: none;
+            padding: 0.75rem 1.5rem;
+            border-radius: 8px;
+            font-weight: 600;
+            transition: transform 0.2s, box-shadow 0.2s;
+        }
+        .btn:hover {
+            transform: translateY(-2px);
+            box-shadow: 0 4px 15px rgba(139, 92, 246, 0.4);
+        }
+        .info-grid {
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
+            gap: 1.5rem;
+            margin-top: 2rem;
+        }
+        .info-card {
+            background: rgba(255, 255, 255, 0.03);
+            border: 1px solid var(--glass-border);
+            border-radius: 12px;
+            padding: 1.5rem;
+        }
+        .info-card h3 {
+            margin-top: 0;
+            color: #a78bfa;
+        }
+        .info-card p {
+            color: var(--text-secondary);
+            line-height: 1.6;
+            margin-bottom: 0;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <header>
+            <h1>AgentDebuggerEnv</h1>
+            <p class="subtitle">Ranking LLMs on Hypothesis-Driven Debugging</p>
+        </header>
+        <div class="glass-panel">
+            <table>
+                <thead>
+                    <tr>
+                        <th>Rank</th>
+                        <th>Model</th>
+                        <th>Tier 1 (Easy)</th>
+                        <th>Tier 2 (Med)</th>
+                        <th>Tier 3 (Hard)</th>
+                        <th>Mean Score</th>
+                    </tr>
+                </thead>
+                <tbody>
+                    <tr>
+                        <td>🥇 1</td>
+                        <td>
+                            <div class="model-name">
+                                GPT-4o
+                            </div>
+                        </td>
+                        <td class="tier-score">89.0%</td>
+                        <td class="tier-score">71.0%</td>
+                        <td class="tier-score">38.0%</td>
+                        <td>
+                            <div class="score-value">0.742</div>
+                            <div class="score-bar-container">
+                                <div class="score-bar" style="width: 74.2%"></div>
+                            </div>
+                        </td>
+                    </tr>
+                    <tr>
+                        <td>🥈 2</td>
+                        <td>
+                            <div class="model-name">
+                                Llama-3.1-70B-Instruct
+                                <span class="badge">Baseline</span>
+                            </div>
+                        </td>
+                        <td class="tier-score">21.0%</td>
+                        <td class="tier-score">21.5%</td>
+                        <td class="tier-score">21.5%</td>
+                        <td>
+                            <div class="score-value">0.210</div>
+                            <div class="score-bar-container">
+                                <div class="score-bar" style="width: 21.0%"></div>
+                            </div>
+                        </td>
+                    </tr>
+                    <tr>
+                        <td>⏳ -</td>
+                        <td>
+                            <div class="model-name">
+                                AgentDebugger-Qwen2.5-7B
+                                <span class="badge" style="background: var(--warning)">Training</span>
+                            </div>
+                        </td>
+                        <td class="tier-score">-</td>
+                        <td class="tier-score">-</td>
+                        <td class="tier-score">-</td>
+                        <td>
+                            <div class="score-value" style="color: var(--text-secondary)">TBD</div>
+                            <div class="score-bar-container">
+                                <div class="score-bar" style="width: 0%; background: var(--text-secondary)"></div>
+                            </div>
+                        </td>
+                    </tr>
+                </tbody>
+            </table>
+        </div>
+        <div class="info-grid">
+            <div class="info-card">
+                <h3>🧪 The Benchmark</h3>
+                <p>Models are evaluated on 90 hand-validated Python bugs across 3 difficulty tiers. They must formulate a specific hypothesis before proposing a fix. Blind guessing is heavily penalized by the grading environment.</p>
+            </div>
+            <div class="info-card">
+                <h3>⚖️ The Grading</h3>
+                <p>A hybrid deterministic/semantic grader evaluates the quality of the hypothesis (via Llama-3.1-70B), format compliance, bug localization, and execution correctness inside a secure sandbox.</p>
+            </div>
+        </div>
+        <div class="cta-container">
+            <a href="https://github.com/shasshaank/meta_hackthon" class="btn" target="_blank">View GitHub Repository</a>
+        </div>
+    </div>
+</body>
+</html>

training/train_grpo.py CHANGED Viewed

@@ -489,7 +489,8 @@ config = GRPOConfig(
     max_completion_length=_max_comp,
     temperature=0.9,
     logging_steps=5,
-    save_steps=50,
     report_to="wandb" if WANDB_API_KEY else "none",
 )
@@ -513,6 +514,51 @@ class CurriculumCallback(TrainerCallback):
 trainer.add_callback(CurriculumCallback())
 # ── Train ─────────────────────────────────────────────────────────────────────
 print(f"\nStarting GRPO training. Max steps: {MAX_STEPS}")
 print(f"Baseline solve rate: {baseline['solve_rate']:.1%} — target: >60% after training")

     max_completion_length=_max_comp,
     temperature=0.9,
     logging_steps=5,
+    save_steps=25,               # Save to local disk every 25 steps
+    save_strategy="steps",
     report_to="wandb" if WANDB_API_KEY else "none",
 )
 trainer.add_callback(CurriculumCallback())
+# ── HF Hub checkpoint push callback (CRITICAL: survives container restarts) ────
+# Pushes LoRA adapter weights to HF Hub every HUB_PUSH_EVERY steps.
+# This is the fix for the original problem: ephemeral Space storage meant that
+# checkpoints saved to ./checkpoints/ were lost when the Space stopped.
+# Now even if training is interrupted, the latest adapter weights are on HF Hub.
+HUB_PUSH_EVERY = 50  # push every 50 steps — ~15min on T4, ~5min on A100
+class CheckpointPushCallback(TrainerCallback):
+    """Push LoRA adapter to HF Hub every HUB_PUSH_EVERY steps."""
+    def on_step_end(self, args, state, control, **kwargs):
+        step = state.global_step
+        if not HF_TOKEN or step == 0 or step % HUB_PUSH_EVERY != 0:
+            return
+        try:
+            push_repo = HF_REPO + "-checkpoints"
+            print(f"\n[HubPush] Pushing checkpoint at step {step} to {push_repo}...", flush=True)
+            model.push_to_hub(
+                push_repo,
+                token=HF_TOKEN,
+                private=True,
+                commit_message=f"checkpoint-step-{step}",
+            )
+            tokenizer.push_to_hub(
+                push_repo,
+                token=HF_TOKEN,
+                private=True,
+                commit_message=f"tokenizer checkpoint-step-{step}",
+            )
+            # Write a step marker file so we know the latest pushed step
+            with open("./last_hub_push.txt", "w") as _f:
+                _f.write(str(step))
+            print(f"[HubPush] ✓ Step {step} pushed to HF Hub.", flush=True)
+            if WANDB_API_KEY:
+                wandb.log({"hub/last_pushed_step": step})
+        except Exception as e:
+            # Never crash training because of a push failure
+            print(f"[HubPush] WARNING: push failed at step {step}: {e}", flush=True)
+if not args.test:  # Don't push during 10-step test runs
+    trainer.add_callback(CheckpointPushCallback())
+    print(f"HF Hub checkpoint push enabled every {HUB_PUSH_EVERY} steps → {HF_REPO}-checkpoints")
+else:
+    print("[TEST MODE] Hub checkpoint push disabled.")
 # ── Train ─────────────────────────────────────────────────────────────────────
 print(f"\nStarting GRPO training. Max steps: {MAX_STEPS}")
 print(f"Baseline solve rate: {baseline['solve_rate']:.1%} — target: >60% after training")