shank commited on
Commit
e160aa1
Β·
1 Parent(s): 3eb8edc

Fix checkpoint persistence, add leaderboard and update HF links

Browse files
Files changed (4) hide show
  1. README.md +3 -2
  2. app.py +1 -0
  3. leaderboard/index.html +275 -0
  4. training/train_grpo.py +47 -1
README.md CHANGED
@@ -11,9 +11,10 @@ pinned: false
11
  # AgentDebuggerEnv
12
 
13
  **Hackathon Links:**
14
- - 🌌 **[Live Hugging Face Space](https://huggingface.co/spaces/shashaank0707/AgentDebugger-training-v3)**
 
15
  - πŸ“Ή **[Watch the 2-Minute Demo](#)** *(Replace with YouTube Link)*
16
- - πŸ“ **[Read the Technical Writeup](#)** *(Replace with HF Blog Link)*
17
 
18
  ### πŸš€ One-Line Pitch
19
  An OpenEnv-backed reinforcement learning environment that trains LLMs to debug code systematically via Group Relative Policy Optimization (GRPO) and secure sandbox execution.
 
11
  # AgentDebuggerEnv
12
 
13
  **Hackathon Links:**
14
+ - 🌌 **[Live Hugging Face Space](https://huggingface.co/spaces/agentDebugger/AgentDebugger-training-v3)**
15
+ - πŸ“Š **[Model Leaderboard Space](https://huggingface.co/spaces/agentDebugger/AgentDebugger-leaderboard)** *(coming soon)*
16
  - πŸ“Ή **[Watch the 2-Minute Demo](#)** *(Replace with YouTube Link)*
17
+ - πŸ“ **[Read the Technical Writeup](./Blog.md)**
18
 
19
  ### πŸš€ One-Line Pitch
20
  An OpenEnv-backed reinforcement learning environment that trains LLMs to debug code systematically via Group Relative Policy Optimization (GRPO) and secure sandbox execution.
app.py CHANGED
@@ -89,6 +89,7 @@ Training **Qwen2.5-Coder-7B-Instruct** on structured hypothesis-driven debugging
89
  - Algorithm: GRPO (same as DeepSeek-R1)
90
  - Dataset: 90 hand-validated bugs across 3 difficulty tiers
91
  - Curriculum: Tier 1 (steps 0–150) β†’ Tier 1+2 (150–350) β†’ All tiers (350–500)
 
92
  """
93
  )
94
  status_box = gr.Textbox(
 
89
  - Algorithm: GRPO (same as DeepSeek-R1)
90
  - Dataset: 90 hand-validated bugs across 3 difficulty tiers
91
  - Curriculum: Tier 1 (steps 0–150) β†’ Tier 1+2 (150–350) β†’ All tiers (350–500)
92
+ - πŸ“Š **[View Model Leaderboard](https://huggingface.co/spaces/agentDebugger/AgentDebugger-leaderboard)**
93
  """
94
  )
95
  status_box = gr.Textbox(
leaderboard/index.html ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>AgentDebuggerEnv Benchmark Leaderboard</title>
7
+ <style>
8
+ :root {
9
+ --bg-color: #0f172a;
10
+ --glass-bg: rgba(30, 41, 59, 0.7);
11
+ --glass-border: rgba(255, 255, 255, 0.1);
12
+ --text-primary: #f8fafc;
13
+ --text-secondary: #94a3b8;
14
+ --accent-primary: #8b5cf6;
15
+ --accent-secondary: #6366f1;
16
+ --success: #10b981;
17
+ --warning: #f59e0b;
18
+ --danger: #ef4444;
19
+ }
20
+
21
+ body {
22
+ margin: 0;
23
+ padding: 0;
24
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Open Sans', 'Helvetica Neue', sans-serif;
25
+ background-color: var(--bg-color);
26
+ background-image:
27
+ radial-gradient(circle at 15% 50%, rgba(99, 102, 241, 0.15) 0%, transparent 50%),
28
+ radial-gradient(circle at 85% 30%, rgba(139, 92, 246, 0.15) 0%, transparent 50%);
29
+ color: var(--text-primary);
30
+ min-height: 100vh;
31
+ }
32
+
33
+ .container {
34
+ max-width: 1200px;
35
+ margin: 0 auto;
36
+ padding: 2rem;
37
+ }
38
+
39
+ header {
40
+ text-align: center;
41
+ margin-bottom: 3rem;
42
+ }
43
+
44
+ h1 {
45
+ font-size: 3rem;
46
+ margin-bottom: 0.5rem;
47
+ background: linear-gradient(to right, #a78bfa, #818cf8);
48
+ -webkit-background-clip: text;
49
+ -webkit-text-fill-color: transparent;
50
+ }
51
+
52
+ .subtitle {
53
+ color: var(--text-secondary);
54
+ font-size: 1.2rem;
55
+ }
56
+
57
+ .glass-panel {
58
+ background: var(--glass-bg);
59
+ backdrop-filter: blur(12px);
60
+ -webkit-backdrop-filter: blur(12px);
61
+ border: 1px solid var(--glass-border);
62
+ border-radius: 16px;
63
+ padding: 2rem;
64
+ box-shadow: 0 4px 30px rgba(0, 0, 0, 0.1);
65
+ margin-bottom: 2rem;
66
+ }
67
+
68
+ table {
69
+ width: 100%;
70
+ border-collapse: collapse;
71
+ margin-top: 1rem;
72
+ }
73
+
74
+ th, td {
75
+ padding: 1rem;
76
+ text-align: left;
77
+ border-bottom: 1px solid var(--glass-border);
78
+ }
79
+
80
+ th {
81
+ color: var(--text-secondary);
82
+ font-weight: 600;
83
+ text-transform: uppercase;
84
+ font-size: 0.875rem;
85
+ letter-spacing: 0.05em;
86
+ }
87
+
88
+ tr:last-child td {
89
+ border-bottom: none;
90
+ }
91
+
92
+ tr:hover td {
93
+ background: rgba(255, 255, 255, 0.03);
94
+ }
95
+
96
+ .model-name {
97
+ font-weight: 600;
98
+ display: flex;
99
+ align-items: center;
100
+ gap: 0.5rem;
101
+ }
102
+
103
+ .badge {
104
+ background: linear-gradient(135deg, var(--accent-primary), var(--accent-secondary));
105
+ padding: 0.25rem 0.5rem;
106
+ border-radius: 4px;
107
+ font-size: 0.75rem;
108
+ font-weight: 600;
109
+ }
110
+
111
+ .score-bar-container {
112
+ width: 100%;
113
+ background: rgba(255, 255, 255, 0.1);
114
+ border-radius: 8px;
115
+ height: 8px;
116
+ overflow: hidden;
117
+ margin-top: 0.5rem;
118
+ }
119
+
120
+ .score-bar {
121
+ height: 100%;
122
+ background: linear-gradient(90deg, var(--accent-secondary), var(--accent-primary));
123
+ border-radius: 8px;
124
+ }
125
+
126
+ .score-value {
127
+ font-weight: 700;
128
+ font-size: 1.1rem;
129
+ }
130
+
131
+ .tier-score {
132
+ font-variant-numeric: tabular-nums;
133
+ }
134
+
135
+ .cta-container {
136
+ text-align: center;
137
+ margin-top: 3rem;
138
+ }
139
+
140
+ .btn {
141
+ display: inline-block;
142
+ background: linear-gradient(135deg, var(--accent-secondary), var(--accent-primary));
143
+ color: white;
144
+ text-decoration: none;
145
+ padding: 0.75rem 1.5rem;
146
+ border-radius: 8px;
147
+ font-weight: 600;
148
+ transition: transform 0.2s, box-shadow 0.2s;
149
+ }
150
+
151
+ .btn:hover {
152
+ transform: translateY(-2px);
153
+ box-shadow: 0 4px 15px rgba(139, 92, 246, 0.4);
154
+ }
155
+
156
+ .info-grid {
157
+ display: grid;
158
+ grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
159
+ gap: 1.5rem;
160
+ margin-top: 2rem;
161
+ }
162
+
163
+ .info-card {
164
+ background: rgba(255, 255, 255, 0.03);
165
+ border: 1px solid var(--glass-border);
166
+ border-radius: 12px;
167
+ padding: 1.5rem;
168
+ }
169
+
170
+ .info-card h3 {
171
+ margin-top: 0;
172
+ color: #a78bfa;
173
+ }
174
+
175
+ .info-card p {
176
+ color: var(--text-secondary);
177
+ line-height: 1.6;
178
+ margin-bottom: 0;
179
+ }
180
+ </style>
181
+ </head>
182
+ <body>
183
+ <div class="container">
184
+ <header>
185
+ <h1>AgentDebuggerEnv</h1>
186
+ <p class="subtitle">Ranking LLMs on Hypothesis-Driven Debugging</p>
187
+ </header>
188
+
189
+ <div class="glass-panel">
190
+ <table>
191
+ <thead>
192
+ <tr>
193
+ <th>Rank</th>
194
+ <th>Model</th>
195
+ <th>Tier 1 (Easy)</th>
196
+ <th>Tier 2 (Med)</th>
197
+ <th>Tier 3 (Hard)</th>
198
+ <th>Mean Score</th>
199
+ </tr>
200
+ </thead>
201
+ <tbody>
202
+ <tr>
203
+ <td>πŸ₯‡ 1</td>
204
+ <td>
205
+ <div class="model-name">
206
+ GPT-4o
207
+ </div>
208
+ </td>
209
+ <td class="tier-score">89.0%</td>
210
+ <td class="tier-score">71.0%</td>
211
+ <td class="tier-score">38.0%</td>
212
+ <td>
213
+ <div class="score-value">0.742</div>
214
+ <div class="score-bar-container">
215
+ <div class="score-bar" style="width: 74.2%"></div>
216
+ </div>
217
+ </td>
218
+ </tr>
219
+ <tr>
220
+ <td>πŸ₯ˆ 2</td>
221
+ <td>
222
+ <div class="model-name">
223
+ Llama-3.1-70B-Instruct
224
+ <span class="badge">Baseline</span>
225
+ </div>
226
+ </td>
227
+ <td class="tier-score">21.0%</td>
228
+ <td class="tier-score">21.5%</td>
229
+ <td class="tier-score">21.5%</td>
230
+ <td>
231
+ <div class="score-value">0.210</div>
232
+ <div class="score-bar-container">
233
+ <div class="score-bar" style="width: 21.0%"></div>
234
+ </div>
235
+ </td>
236
+ </tr>
237
+ <tr>
238
+ <td>⏳ -</td>
239
+ <td>
240
+ <div class="model-name">
241
+ AgentDebugger-Qwen2.5-7B
242
+ <span class="badge" style="background: var(--warning)">Training</span>
243
+ </div>
244
+ </td>
245
+ <td class="tier-score">-</td>
246
+ <td class="tier-score">-</td>
247
+ <td class="tier-score">-</td>
248
+ <td>
249
+ <div class="score-value" style="color: var(--text-secondary)">TBD</div>
250
+ <div class="score-bar-container">
251
+ <div class="score-bar" style="width: 0%; background: var(--text-secondary)"></div>
252
+ </div>
253
+ </td>
254
+ </tr>
255
+ </tbody>
256
+ </table>
257
+ </div>
258
+
259
+ <div class="info-grid">
260
+ <div class="info-card">
261
+ <h3>πŸ§ͺ The Benchmark</h3>
262
+ <p>Models are evaluated on 90 hand-validated Python bugs across 3 difficulty tiers. They must formulate a specific hypothesis before proposing a fix. Blind guessing is heavily penalized by the grading environment.</p>
263
+ </div>
264
+ <div class="info-card">
265
+ <h3>βš–οΈ The Grading</h3>
266
+ <p>A hybrid deterministic/semantic grader evaluates the quality of the hypothesis (via Llama-3.1-70B), format compliance, bug localization, and execution correctness inside a secure sandbox.</p>
267
+ </div>
268
+ </div>
269
+
270
+ <div class="cta-container">
271
+ <a href="https://github.com/shasshaank/meta_hackthon" class="btn" target="_blank">View GitHub Repository</a>
272
+ </div>
273
+ </div>
274
+ </body>
275
+ </html>
training/train_grpo.py CHANGED
@@ -489,7 +489,8 @@ config = GRPOConfig(
489
  max_completion_length=_max_comp,
490
  temperature=0.9,
491
  logging_steps=5,
492
- save_steps=50,
 
493
  report_to="wandb" if WANDB_API_KEY else "none",
494
  )
495
 
@@ -513,6 +514,51 @@ class CurriculumCallback(TrainerCallback):
513
 
514
  trainer.add_callback(CurriculumCallback())
515
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
516
  # ── Train ─────────────────────────────────────────────────────────────────────
517
  print(f"\nStarting GRPO training. Max steps: {MAX_STEPS}")
518
  print(f"Baseline solve rate: {baseline['solve_rate']:.1%} β€” target: >60% after training")
 
489
  max_completion_length=_max_comp,
490
  temperature=0.9,
491
  logging_steps=5,
492
+ save_steps=25, # Save to local disk every 25 steps
493
+ save_strategy="steps",
494
  report_to="wandb" if WANDB_API_KEY else "none",
495
  )
496
 
 
514
 
515
  trainer.add_callback(CurriculumCallback())
516
 
517
+ # ── HF Hub checkpoint push callback (CRITICAL: survives container restarts) ────
518
+ # Pushes LoRA adapter weights to HF Hub every HUB_PUSH_EVERY steps.
519
+ # This is the fix for the original problem: ephemeral Space storage meant that
520
+ # checkpoints saved to ./checkpoints/ were lost when the Space stopped.
521
+ # Now even if training is interrupted, the latest adapter weights are on HF Hub.
522
+ HUB_PUSH_EVERY = 50 # push every 50 steps β€” ~15min on T4, ~5min on A100
523
+
524
+ class CheckpointPushCallback(TrainerCallback):
525
+ """Push LoRA adapter to HF Hub every HUB_PUSH_EVERY steps."""
526
+
527
+ def on_step_end(self, args, state, control, **kwargs):
528
+ step = state.global_step
529
+ if not HF_TOKEN or step == 0 or step % HUB_PUSH_EVERY != 0:
530
+ return
531
+ try:
532
+ push_repo = HF_REPO + "-checkpoints"
533
+ print(f"\n[HubPush] Pushing checkpoint at step {step} to {push_repo}...", flush=True)
534
+ model.push_to_hub(
535
+ push_repo,
536
+ token=HF_TOKEN,
537
+ private=True,
538
+ commit_message=f"checkpoint-step-{step}",
539
+ )
540
+ tokenizer.push_to_hub(
541
+ push_repo,
542
+ token=HF_TOKEN,
543
+ private=True,
544
+ commit_message=f"tokenizer checkpoint-step-{step}",
545
+ )
546
+ # Write a step marker file so we know the latest pushed step
547
+ with open("./last_hub_push.txt", "w") as _f:
548
+ _f.write(str(step))
549
+ print(f"[HubPush] βœ“ Step {step} pushed to HF Hub.", flush=True)
550
+ if WANDB_API_KEY:
551
+ wandb.log({"hub/last_pushed_step": step})
552
+ except Exception as e:
553
+ # Never crash training because of a push failure
554
+ print(f"[HubPush] WARNING: push failed at step {step}: {e}", flush=True)
555
+
556
+ if not args.test: # Don't push during 10-step test runs
557
+ trainer.add_callback(CheckpointPushCallback())
558
+ print(f"HF Hub checkpoint push enabled every {HUB_PUSH_EVERY} steps β†’ {HF_REPO}-checkpoints")
559
+ else:
560
+ print("[TEST MODE] Hub checkpoint push disabled.")
561
+
562
  # ── Train ─────────────────────────────────────────────────────────────────────
563
  print(f"\nStarting GRPO training. Max steps: {MAX_STEPS}")
564
  print(f"Baseline solve rate: {baseline['solve_rate']:.1%} β€” target: >60% after training")