Spaces:

OpenHands
/

openhands-index

Running

openhands openhands commited on Feb 10

Commit

d8b77db

1 Parent(s): 3b71657

Add methodology section and benchmark instance counts to About page

- Add instance/task counts for all five benchmarks
- Add new Methodology section explaining:
- Per-benchmark scores are percentages (comparable across datasets)
- Average score is macro-average with equal weighting
- Cost and runtime are per-instance averages
- All evals use identical OpenHands Agent SDK configuration

Co-authored-by: openhands <openhands@all-hands.dev>

Files changed (1) hide show

about.py +17 -8

about.py CHANGED Viewed

@@ -20,15 +20,24 @@ def build_page():
             <h2>Benchmark Details</h2>
             <p>We evaluate agents across five categories:</p>
             <ul class="info-list">
-                <li><strong>Issue Resolution:</strong> <a href="https://www.swebench.com/" target="_blank">SWE-bench</a></li>
-                <li><strong>Frontend:</strong> <a href="https://github.com/OpenHands/SWE-bench-multimodal" target="_blank">SWE-bench Multimodal</a></li>
-                <li><strong>Greenfield:</strong> <a href="https://github.com/commit-0/commit0" target="_blank">Commit0</a></li>
-                <li><strong>Testing:</strong> <a href="https://github.com/logic-star-ai/swt-bench" target="_blank">SWT-bench</a></li>
-                <li><strong>Information Gathering:</strong> <a href="https://huggingface.co/gaia-benchmark" target="_blank">GAIA</a></li>
             </ul>
-            <p>
-                <strong>Scoring:</strong> Average score is a macro-average across benchmarks (equal weighting). Cost is USD per task; agents without cost data are shown separately in plots.
-            </p>
             """
         )
         gr.Markdown("---", elem_classes="divider-line")

             <h2>Benchmark Details</h2>
             <p>We evaluate agents across five categories:</p>
             <ul class="info-list">
+                <li><strong>Issue Resolution:</strong> <a href="https://www.swebench.com/" target="_blank">SWE-bench Verified</a> — 500 instances</li>
+                <li><strong>Frontend:</strong> <a href="https://github.com/OpenHands/SWE-bench-multimodal" target="_blank">SWE-bench Multimodal</a> — 617 instances</li>
+                <li><strong>Greenfield:</strong> <a href="https://github.com/commit-0/commit0" target="_blank">Commit0</a> — 16 libraries (lite split)</li>
+                <li><strong>Testing:</strong> <a href="https://github.com/logic-star-ai/swt-bench" target="_blank">SWT-bench Verified</a> — 433 instances</li>
+                <li><strong>Information Gathering:</strong> <a href="https://huggingface.co/gaia-benchmark" target="_blank">GAIA</a> — 165 questions (validation split)</li>
             </ul>
+            """
+        )
+        gr.Markdown("---", elem_classes="divider-line")
+        # --- Section 3: Methodology ---
+        gr.HTML(
+            """
+            <h2>Methodology</h2>
+            <p><strong>Per-benchmark scores:</strong> Each benchmark reports a percentage metric (resolve rate, accuracy, or test pass rate), making scores comparable regardless of dataset size.</p>
+            <p><strong>Average score:</strong> Macro-average across all five categories with equal weighting.</p>
+            <p><strong>Cost &amp; Runtime:</strong> Average USD and seconds per task instance. Agents without cost/runtime data are shown separately in Pareto plots.</p>
+            <p>All evaluations use the <a href="https://github.com/OpenHands/software-agent-sdk" target="_blank">OpenHands Agent SDK</a> with identical configurations per model.</p>
             """
         )
         gr.Markdown("---", elem_classes="divider-line")