Spaces:
Running
Running
openhands openhands commited on
Commit ·
d8b77db
1
Parent(s): 3b71657
Add methodology section and benchmark instance counts to About page
Browse files- Add instance/task counts for all five benchmarks
- Add new Methodology section explaining:
- Per-benchmark scores are percentages (comparable across datasets)
- Average score is macro-average with equal weighting
- Cost and runtime are per-instance averages
- All evals use identical OpenHands Agent SDK configuration
Co-authored-by: openhands <openhands@all-hands.dev>
about.py
CHANGED
|
@@ -20,15 +20,24 @@ def build_page():
|
|
| 20 |
<h2>Benchmark Details</h2>
|
| 21 |
<p>We evaluate agents across five categories:</p>
|
| 22 |
<ul class="info-list">
|
| 23 |
-
<li><strong>Issue Resolution:</strong> <a href="https://www.swebench.com/" target="_blank">SWE-bench</a></li>
|
| 24 |
-
<li><strong>Frontend:</strong> <a href="https://github.com/OpenHands/SWE-bench-multimodal" target="_blank">SWE-bench Multimodal</a></li>
|
| 25 |
-
<li><strong>Greenfield:</strong> <a href="https://github.com/commit-0/commit0" target="_blank">Commit0</a></li>
|
| 26 |
-
<li><strong>Testing:</strong> <a href="https://github.com/logic-star-ai/swt-bench" target="_blank">SWT-bench</a></li>
|
| 27 |
-
<li><strong>Information Gathering:</strong> <a href="https://huggingface.co/gaia-benchmark" target="_blank">GAIA</a></li>
|
| 28 |
</ul>
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
"""
|
| 33 |
)
|
| 34 |
gr.Markdown("---", elem_classes="divider-line")
|
|
|
|
| 20 |
<h2>Benchmark Details</h2>
|
| 21 |
<p>We evaluate agents across five categories:</p>
|
| 22 |
<ul class="info-list">
|
| 23 |
+
<li><strong>Issue Resolution:</strong> <a href="https://www.swebench.com/" target="_blank">SWE-bench Verified</a> — 500 instances</li>
|
| 24 |
+
<li><strong>Frontend:</strong> <a href="https://github.com/OpenHands/SWE-bench-multimodal" target="_blank">SWE-bench Multimodal</a> — 617 instances</li>
|
| 25 |
+
<li><strong>Greenfield:</strong> <a href="https://github.com/commit-0/commit0" target="_blank">Commit0</a> — 16 libraries (lite split)</li>
|
| 26 |
+
<li><strong>Testing:</strong> <a href="https://github.com/logic-star-ai/swt-bench" target="_blank">SWT-bench Verified</a> — 433 instances</li>
|
| 27 |
+
<li><strong>Information Gathering:</strong> <a href="https://huggingface.co/gaia-benchmark" target="_blank">GAIA</a> — 165 questions (validation split)</li>
|
| 28 |
</ul>
|
| 29 |
+
"""
|
| 30 |
+
)
|
| 31 |
+
gr.Markdown("---", elem_classes="divider-line")
|
| 32 |
+
|
| 33 |
+
# --- Section 3: Methodology ---
|
| 34 |
+
gr.HTML(
|
| 35 |
+
"""
|
| 36 |
+
<h2>Methodology</h2>
|
| 37 |
+
<p><strong>Per-benchmark scores:</strong> Each benchmark reports a percentage metric (resolve rate, accuracy, or test pass rate), making scores comparable regardless of dataset size.</p>
|
| 38 |
+
<p><strong>Average score:</strong> Macro-average across all five categories with equal weighting.</p>
|
| 39 |
+
<p><strong>Cost & Runtime:</strong> Average USD and seconds per task instance. Agents without cost/runtime data are shown separately in Pareto plots.</p>
|
| 40 |
+
<p>All evaluations use the <a href="https://github.com/OpenHands/software-agent-sdk" target="_blank">OpenHands Agent SDK</a> with identical configurations per model.</p>
|
| 41 |
"""
|
| 42 |
)
|
| 43 |
gr.Markdown("---", elem_classes="divider-line")
|