shank
Fix checkpoint persistence, add leaderboard and update HF links
e160aa1
Raw
History Blame Contribute Delete
9.09 kB
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>AgentDebuggerEnv Benchmark Leaderboard</title>
<style>
:root {
--bg-color: #0f172a;
--glass-bg: rgba(30, 41, 59, 0.7);
--glass-border: rgba(255, 255, 255, 0.1);
--text-primary: #f8fafc;
--text-secondary: #94a3b8;
--accent-primary: #8b5cf6;
--accent-secondary: #6366f1;
--success: #10b981;
--warning: #f59e0b;
--danger: #ef4444;
}
body {
margin: 0;
padding: 0;
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Open Sans', 'Helvetica Neue', sans-serif;
background-color: var(--bg-color);
background-image:
radial-gradient(circle at 15% 50%, rgba(99, 102, 241, 0.15) 0%, transparent 50%),
radial-gradient(circle at 85% 30%, rgba(139, 92, 246, 0.15) 0%, transparent 50%);
color: var(--text-primary);
min-height: 100vh;
}
.container {
max-width: 1200px;
margin: 0 auto;
padding: 2rem;
}
header {
text-align: center;
margin-bottom: 3rem;
}
h1 {
font-size: 3rem;
margin-bottom: 0.5rem;
background: linear-gradient(to right, #a78bfa, #818cf8);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
}
.subtitle {
color: var(--text-secondary);
font-size: 1.2rem;
}
.glass-panel {
background: var(--glass-bg);
backdrop-filter: blur(12px);
-webkit-backdrop-filter: blur(12px);
border: 1px solid var(--glass-border);
border-radius: 16px;
padding: 2rem;
box-shadow: 0 4px 30px rgba(0, 0, 0, 0.1);
margin-bottom: 2rem;
}
table {
width: 100%;
border-collapse: collapse;
margin-top: 1rem;
}
th, td {
padding: 1rem;
text-align: left;
border-bottom: 1px solid var(--glass-border);
}
th {
color: var(--text-secondary);
font-weight: 600;
text-transform: uppercase;
font-size: 0.875rem;
letter-spacing: 0.05em;
}
tr:last-child td {
border-bottom: none;
}
tr:hover td {
background: rgba(255, 255, 255, 0.03);
}
.model-name {
font-weight: 600;
display: flex;
align-items: center;
gap: 0.5rem;
}
.badge {
background: linear-gradient(135deg, var(--accent-primary), var(--accent-secondary));
padding: 0.25rem 0.5rem;
border-radius: 4px;
font-size: 0.75rem;
font-weight: 600;
}
.score-bar-container {
width: 100%;
background: rgba(255, 255, 255, 0.1);
border-radius: 8px;
height: 8px;
overflow: hidden;
margin-top: 0.5rem;
}
.score-bar {
height: 100%;
background: linear-gradient(90deg, var(--accent-secondary), var(--accent-primary));
border-radius: 8px;
}
.score-value {
font-weight: 700;
font-size: 1.1rem;
}
.tier-score {
font-variant-numeric: tabular-nums;
}
.cta-container {
text-align: center;
margin-top: 3rem;
}
.btn {
display: inline-block;
background: linear-gradient(135deg, var(--accent-secondary), var(--accent-primary));
color: white;
text-decoration: none;
padding: 0.75rem 1.5rem;
border-radius: 8px;
font-weight: 600;
transition: transform 0.2s, box-shadow 0.2s;
}
.btn:hover {
transform: translateY(-2px);
box-shadow: 0 4px 15px rgba(139, 92, 246, 0.4);
}
.info-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
gap: 1.5rem;
margin-top: 2rem;
}
.info-card {
background: rgba(255, 255, 255, 0.03);
border: 1px solid var(--glass-border);
border-radius: 12px;
padding: 1.5rem;
}
.info-card h3 {
margin-top: 0;
color: #a78bfa;
}
.info-card p {
color: var(--text-secondary);
line-height: 1.6;
margin-bottom: 0;
}
</style>
</head>
<body>
<div class="container">
<header>
<h1>AgentDebuggerEnv</h1>
<p class="subtitle">Ranking LLMs on Hypothesis-Driven Debugging</p>
</header>
<div class="glass-panel">
<table>
<thead>
<tr>
<th>Rank</th>
<th>Model</th>
<th>Tier 1 (Easy)</th>
<th>Tier 2 (Med)</th>
<th>Tier 3 (Hard)</th>
<th>Mean Score</th>
</tr>
</thead>
<tbody>
<tr>
<td>🥇 1</td>
<td>
<div class="model-name">
GPT-4o
</div>
</td>
<td class="tier-score">89.0%</td>
<td class="tier-score">71.0%</td>
<td class="tier-score">38.0%</td>
<td>
<div class="score-value">0.742</div>
<div class="score-bar-container">
<div class="score-bar" style="width: 74.2%"></div>
</div>
</td>
</tr>
<tr>
<td>🥈 2</td>
<td>
<div class="model-name">
Llama-3.1-70B-Instruct
<span class="badge">Baseline</span>
</div>
</td>
<td class="tier-score">21.0%</td>
<td class="tier-score">21.5%</td>
<td class="tier-score">21.5%</td>
<td>
<div class="score-value">0.210</div>
<div class="score-bar-container">
<div class="score-bar" style="width: 21.0%"></div>
</div>
</td>
</tr>
<tr>
<td>⏳ -</td>
<td>
<div class="model-name">
AgentDebugger-Qwen2.5-7B
<span class="badge" style="background: var(--warning)">Training</span>
</div>
</td>
<td class="tier-score">-</td>
<td class="tier-score">-</td>
<td class="tier-score">-</td>
<td>
<div class="score-value" style="color: var(--text-secondary)">TBD</div>
<div class="score-bar-container">
<div class="score-bar" style="width: 0%; background: var(--text-secondary)"></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<div class="info-grid">
<div class="info-card">
<h3>🧪 The Benchmark</h3>
<p>Models are evaluated on 90 hand-validated Python bugs across 3 difficulty tiers. They must formulate a specific hypothesis before proposing a fix. Blind guessing is heavily penalized by the grading environment.</p>
</div>
<div class="info-card">
<h3>⚖️ The Grading</h3>
<p>A hybrid deterministic/semantic grader evaluates the quality of the hypothesis (via Llama-3.1-70B), format compliance, bug localization, and execution correctness inside a secure sandbox.</p>
</div>
</div>
<div class="cta-container">
<a href="https://github.com/shasshaank/meta_hackthon" class="btn" target="_blank">View GitHub Repository</a>
</div>
</div>
</body>
</html>