openhands openhands commited on
Commit
d8b77db
·
1 Parent(s): 3b71657

Add methodology section and benchmark instance counts to About page

Browse files

- Add instance/task counts for all five benchmarks
- Add new Methodology section explaining:
- Per-benchmark scores are percentages (comparable across datasets)
- Average score is macro-average with equal weighting
- Cost and runtime are per-instance averages
- All evals use identical OpenHands Agent SDK configuration

Co-authored-by: openhands <openhands@all-hands.dev>

Files changed (1) hide show
  1. about.py +17 -8
about.py CHANGED
@@ -20,15 +20,24 @@ def build_page():
20
  <h2>Benchmark Details</h2>
21
  <p>We evaluate agents across five categories:</p>
22
  <ul class="info-list">
23
- <li><strong>Issue Resolution:</strong> <a href="https://www.swebench.com/" target="_blank">SWE-bench</a></li>
24
- <li><strong>Frontend:</strong> <a href="https://github.com/OpenHands/SWE-bench-multimodal" target="_blank">SWE-bench Multimodal</a></li>
25
- <li><strong>Greenfield:</strong> <a href="https://github.com/commit-0/commit0" target="_blank">Commit0</a></li>
26
- <li><strong>Testing:</strong> <a href="https://github.com/logic-star-ai/swt-bench" target="_blank">SWT-bench</a></li>
27
- <li><strong>Information Gathering:</strong> <a href="https://huggingface.co/gaia-benchmark" target="_blank">GAIA</a></li>
28
  </ul>
29
- <p>
30
- <strong>Scoring:</strong> Average score is a macro-average across benchmarks (equal weighting). Cost is USD per task; agents without cost data are shown separately in plots.
31
- </p>
 
 
 
 
 
 
 
 
 
32
  """
33
  )
34
  gr.Markdown("---", elem_classes="divider-line")
 
20
  <h2>Benchmark Details</h2>
21
  <p>We evaluate agents across five categories:</p>
22
  <ul class="info-list">
23
+ <li><strong>Issue Resolution:</strong> <a href="https://www.swebench.com/" target="_blank">SWE-bench Verified</a> — 500 instances</li>
24
+ <li><strong>Frontend:</strong> <a href="https://github.com/OpenHands/SWE-bench-multimodal" target="_blank">SWE-bench Multimodal</a> — 617 instances</li>
25
+ <li><strong>Greenfield:</strong> <a href="https://github.com/commit-0/commit0" target="_blank">Commit0</a> — 16 libraries (lite split)</li>
26
+ <li><strong>Testing:</strong> <a href="https://github.com/logic-star-ai/swt-bench" target="_blank">SWT-bench Verified</a> — 433 instances</li>
27
+ <li><strong>Information Gathering:</strong> <a href="https://huggingface.co/gaia-benchmark" target="_blank">GAIA</a> — 165 questions (validation split)</li>
28
  </ul>
29
+ """
30
+ )
31
+ gr.Markdown("---", elem_classes="divider-line")
32
+
33
+ # --- Section 3: Methodology ---
34
+ gr.HTML(
35
+ """
36
+ <h2>Methodology</h2>
37
+ <p><strong>Per-benchmark scores:</strong> Each benchmark reports a percentage metric (resolve rate, accuracy, or test pass rate), making scores comparable regardless of dataset size.</p>
38
+ <p><strong>Average score:</strong> Macro-average across all five categories with equal weighting.</p>
39
+ <p><strong>Cost &amp; Runtime:</strong> Average USD and seconds per task instance. Agents without cost/runtime data are shown separately in Pareto plots.</p>
40
+ <p>All evaluations use the <a href="https://github.com/OpenHands/software-agent-sdk" target="_blank">OpenHands Agent SDK</a> with identical configurations per model.</p>
41
  """
42
  )
43
  gr.Markdown("---", elem_classes="divider-line")