Spaces:
Running
Running
openhands
commited on
Commit
·
cfd4f2a
1
Parent(s):
35aa299
Simplify and update About page
Browse files- Made content more concise
- Updated benchmark list to match actual categories
- Added links to all benchmark sources
- Removed 'Submitting Results' section (not accepting external submissions)
- Added Slack link for contact
- Updated citation author to 'OpenHands Team'
about.py
CHANGED
|
@@ -3,186 +3,91 @@ import gradio as gr
|
|
| 3 |
|
| 4 |
def build_page():
|
| 5 |
with gr.Column(elem_id="about-page-content-wrapper"):
|
| 6 |
-
# --- Section 1: About
|
| 7 |
gr.HTML(
|
| 8 |
"""
|
| 9 |
-
<h2>About
|
| 10 |
<p>
|
| 11 |
-
OpenHands Index
|
| 12 |
</p>
|
| 13 |
"""
|
| 14 |
)
|
| 15 |
gr.Markdown("---", elem_classes="divider-line")
|
| 16 |
|
| 17 |
-
# --- Section 2:
|
| 18 |
gr.HTML(
|
| 19 |
"""
|
| 20 |
-
<h2>
|
| 21 |
-
<p>
|
| 22 |
-
Software engineering benchmarks are scattered across different platforms and evaluation frameworks, making it difficult to compare agent performance holistically. Agents may excel at one type of task but struggle with others. Understanding the true capabilities of coding agents requires comprehensive evaluation across multiple dimensions.
|
| 23 |
-
</p>
|
| 24 |
-
<br>
|
| 25 |
-
<p>
|
| 26 |
-
OpenHands Index fills this gap by providing a unified leaderboard aggregating results from diverse software engineering benchmarks. It helps developers and researchers identify which agents best suit their needs, while providing standardized metrics for comparing agent performance across tasks like repository-level editing, multimodal understanding, and commit message generation.
|
| 27 |
-
</p>
|
| 28 |
-
"""
|
| 29 |
-
)
|
| 30 |
-
gr.Markdown("---", elem_classes="divider-line")
|
| 31 |
-
|
| 32 |
-
# --- Section 3: What Does OpenHands Index Include? ---
|
| 33 |
-
gr.HTML(
|
| 34 |
-
"""
|
| 35 |
-
<h2>What Does OpenHands Index Include?</h2>
|
| 36 |
-
<p>
|
| 37 |
-
OpenHands Index aggregates results from 5 key benchmarks for evaluating AI coding agents:
|
| 38 |
-
</p>
|
| 39 |
<ul class="info-list">
|
| 40 |
-
<li><strong>SWE-bench</
|
| 41 |
-
<li><strong>
|
| 42 |
-
<li><strong>
|
| 43 |
-
<li><strong>
|
| 44 |
-
<li><strong>
|
| 45 |
</ul>
|
| 46 |
<p>
|
| 47 |
-
|
| 48 |
-
</p>
|
| 49 |
-
<p>
|
| 50 |
-
🔍 Learn more at <a href="https://github.com/OpenHands/OpenHands" target="_blank" class="primary-link-button">github.com/OpenHands/OpenHands</a>
|
| 51 |
</p>
|
| 52 |
"""
|
| 53 |
)
|
| 54 |
gr.Markdown("---", elem_classes="divider-line")
|
| 55 |
|
| 56 |
-
# --- Section
|
| 57 |
gr.HTML(
|
| 58 |
"""
|
| 59 |
-
<h2>
|
| 60 |
-
<p>
|
| 61 |
-
The OpenHands Index Overall Leaderboard provides a high-level view of agent performance and efficiency:
|
| 62 |
-
</p>
|
| 63 |
<ul class="info-list">
|
| 64 |
-
<li><strong>Average score
|
| 65 |
-
<li><strong>
|
| 66 |
-
</ul>
|
| 67 |
-
<p>
|
| 68 |
-
Individual benchmark pages provide:
|
| 69 |
-
</p>
|
| 70 |
-
<ul class="info-list">
|
| 71 |
-
<li>Detailed scores and metrics for that specific benchmark</li>
|
| 72 |
-
<li>Cost breakdowns per agent</li>
|
| 73 |
-
<li>Links to submission details and logs</li>
|
| 74 |
</ul>
|
| 75 |
"""
|
| 76 |
)
|
| 77 |
gr.Markdown("---", elem_classes="divider-line")
|
| 78 |
|
| 79 |
-
# --- Section
|
| 80 |
gr.HTML(
|
| 81 |
"""
|
| 82 |
-
<h2>
|
| 83 |
-
<p>
|
| 84 |
-
OpenHands Index provides transparent, standardized evaluation metrics:
|
| 85 |
-
</p>
|
| 86 |
-
|
| 87 |
-
<h3>Scores</h3>
|
| 88 |
-
<ul class="info-list">
|
| 89 |
-
<li>Each benchmark returns an average score based on per-task performance</li>
|
| 90 |
-
<li>All scores are aggregated using macro-averaging (equal weight per benchmark)</li>
|
| 91 |
-
<li>Metrics vary by benchmark (e.g., resolve rate, pass@1, accuracy)</li>
|
| 92 |
-
</ul>
|
| 93 |
-
|
| 94 |
-
<h3>Cost</h3>
|
| 95 |
-
<ul class="info-list">
|
| 96 |
-
<li>Costs are reported in USD per task</li>
|
| 97 |
-
<li>Benchmarks without cost data are excluded from cost averages</li>
|
| 98 |
-
<li>In scatter plots, agents without cost data are clearly marked</li>
|
| 99 |
-
</ul>
|
| 100 |
-
|
| 101 |
<p>
|
| 102 |
-
|
| 103 |
</p>
|
| 104 |
"""
|
| 105 |
)
|
| 106 |
gr.Markdown("---", elem_classes="divider-line")
|
| 107 |
|
| 108 |
-
# --- Section
|
| 109 |
gr.HTML(
|
| 110 |
"""
|
| 111 |
-
<h2>
|
| 112 |
-
|
| 113 |
-
<h3>How to Submit Your Agent Results</h3>
|
| 114 |
-
<p>
|
| 115 |
-
To submit your agent's evaluation results to the OpenHands Index:
|
| 116 |
-
</p>
|
| 117 |
-
<ol class="info-list">
|
| 118 |
-
<li>Run your agent on the supported benchmarks (SWE-bench, SWE-bench Multimodal, SWT-bench, Commit0, GAIA)</li>
|
| 119 |
-
<li>Format your results according to the data structure documented in the repository</li>
|
| 120 |
-
<li>Submit a pull request to <a href="https://github.com/OpenHands/openhands-index-results" target="_blank" class="primary-link-button">github.com/OpenHands/openhands-index-results</a></li>
|
| 121 |
-
<li>Your submission should include:
|
| 122 |
-
<ul>
|
| 123 |
-
<li><code>metadata.json</code> with agent information, model used, and evaluation details</li>
|
| 124 |
-
<li><code>scores.json</code> with benchmark results and scores</li>
|
| 125 |
-
</ul>
|
| 126 |
-
</li>
|
| 127 |
-
</ol>
|
| 128 |
-
|
| 129 |
-
<h3>Accessing Raw Results</h3>
|
| 130 |
-
<p>
|
| 131 |
-
All raw evaluation results displayed on this leaderboard are publicly available at:
|
| 132 |
-
</p>
|
| 133 |
-
<p>
|
| 134 |
-
📊 <a href="https://github.com/OpenHands/openhands-index-results" target="_blank" class="primary-link-button">github.com/OpenHands/openhands-index-results</a>
|
| 135 |
-
</p>
|
| 136 |
<p>
|
| 137 |
-
|
| 138 |
-
</p>
|
| 139 |
-
<ul class="info-list">
|
| 140 |
-
<li>Complete metadata for each agent submission</li>
|
| 141 |
-
<li>Detailed benchmark scores and metrics</li>
|
| 142 |
-
<li>Evaluation dates and configurations</li>
|
| 143 |
-
<li>Model and cost information</li>
|
| 144 |
-
</ul>
|
| 145 |
-
<p>
|
| 146 |
-
You can clone the repository, analyze the data, or use it for your own research and comparisons.
|
| 147 |
</p>
|
| 148 |
"""
|
| 149 |
)
|
| 150 |
gr.Markdown("---", elem_classes="divider-line")
|
| 151 |
|
| 152 |
-
# --- Section
|
| 153 |
gr.HTML(
|
| 154 |
"""
|
| 155 |
<h2>Acknowledgements</h2>
|
| 156 |
<p>
|
| 157 |
-
The
|
| 158 |
-
<a href="https://huggingface.co/spaces/allenai/asta-bench-leaderboard" target="_blank"
|
| 159 |
-
|
| 160 |
-
</p>
|
| 161 |
-
<p>
|
| 162 |
-
We thank the teams behind the component benchmarks used in OpenHands Index:
|
| 163 |
</p>
|
| 164 |
-
<ul class="info-list">
|
| 165 |
-
<li><a href="https://www.swebench.com/" target="_blank">SWE-bench</a> - Princeton NLP Group</li>
|
| 166 |
-
<li><a href="https://github.com/OpenHands/SWE-bench-multimodal" target="_blank">SWE-bench Multimodal</a> - OpenHands Team</li>
|
| 167 |
-
<li><a href="https://github.com/logic-star-ai/swt-bench" target="_blank">SWT-bench</a> - Logic Star AI</li>
|
| 168 |
-
<li><a href="https://github.com/commit-0/commit0" target="_blank">Commit0</a> - Commit0 Team</li>
|
| 169 |
-
<li><a href="https://huggingface.co/gaia-benchmark" target="_blank">GAIA</a> - Hugging Face & Meta AI</li>
|
| 170 |
-
</ul>
|
| 171 |
"""
|
| 172 |
)
|
| 173 |
gr.Markdown("---", elem_classes="divider-line")
|
| 174 |
|
| 175 |
-
# --- Section
|
| 176 |
gr.HTML(
|
| 177 |
"""
|
| 178 |
<h2>Citation</h2>
|
| 179 |
-
<p>
|
| 180 |
-
If you reference the OpenHands Index in your work, please cite:
|
| 181 |
-
</p>
|
| 182 |
<pre class="citation-block">
|
| 183 |
@misc{openhandsindex2025,
|
| 184 |
title={OpenHands Index: A Comprehensive Leaderboard for AI Coding Agents},
|
| 185 |
-
author={
|
| 186 |
year={2025},
|
| 187 |
howpublished={https://huggingface.co/spaces/OpenHands/openhands-index}
|
| 188 |
}</pre>
|
|
|
|
| 3 |
|
| 4 |
def build_page():
|
| 5 |
with gr.Column(elem_id="about-page-content-wrapper"):
|
| 6 |
+
# --- Section 1: About ---
|
| 7 |
gr.HTML(
|
| 8 |
"""
|
| 9 |
+
<h2>About</h2>
|
| 10 |
<p>
|
| 11 |
+
OpenHands Index tracks AI coding agent performance across software engineering benchmarks, providing a unified view of both accuracy and cost efficiency.
|
| 12 |
</p>
|
| 13 |
"""
|
| 14 |
)
|
| 15 |
gr.Markdown("---", elem_classes="divider-line")
|
| 16 |
|
| 17 |
+
# --- Section 2: Benchmarks ---
|
| 18 |
gr.HTML(
|
| 19 |
"""
|
| 20 |
+
<h2>Benchmarks</h2>
|
| 21 |
+
<p>We evaluate agents across five categories:</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
<ul class="info-list">
|
| 23 |
+
<li><strong>Issue Resolution:</strong> <a href="https://www.swebench.com/" target="_blank">SWE-bench</a>, <a href="https://github.com/OpenHands/SWE-bench-multimodal" target="_blank">SWE-bench Multimodal</a></li>
|
| 24 |
+
<li><strong>Greenfield:</strong> <a href="https://github.com/commit-0/commit0" target="_blank">Commit0</a></li>
|
| 25 |
+
<li><strong>Frontend:</strong> <a href="https://github.com/pwnslinger/multi-swe-bench" target="_blank">Multi-SWE-bench</a></li>
|
| 26 |
+
<li><strong>Testing:</strong> <a href="https://github.com/logic-star-ai/swt-bench" target="_blank">SWT-bench</a></li>
|
| 27 |
+
<li><strong>Information Gathering:</strong> <a href="https://huggingface.co/gaia-benchmark" target="_blank">GAIA</a></li>
|
| 28 |
</ul>
|
| 29 |
<p>
|
| 30 |
+
All models are evaluated using the <a href="https://github.com/OpenHands/software-agent-sdk" target="_blank">OpenHands Software Agent SDK</a>.
|
|
|
|
|
|
|
|
|
|
| 31 |
</p>
|
| 32 |
"""
|
| 33 |
)
|
| 34 |
gr.Markdown("---", elem_classes="divider-line")
|
| 35 |
|
| 36 |
+
# --- Section 3: Scoring ---
|
| 37 |
gr.HTML(
|
| 38 |
"""
|
| 39 |
+
<h2>Scoring</h2>
|
|
|
|
|
|
|
|
|
|
| 40 |
<ul class="info-list">
|
| 41 |
+
<li><strong>Average score:</strong> Macro-average across benchmarks (equal weighting)</li>
|
| 42 |
+
<li><strong>Cost:</strong> USD per task; agents without cost data shown separately in plots</li>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
</ul>
|
| 44 |
"""
|
| 45 |
)
|
| 46 |
gr.Markdown("---", elem_classes="divider-line")
|
| 47 |
|
| 48 |
+
# --- Section 4: Raw Data ---
|
| 49 |
gr.HTML(
|
| 50 |
"""
|
| 51 |
+
<h2>Raw Data</h2>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
<p>
|
| 53 |
+
All evaluation results are available at <a href="https://github.com/OpenHands/openhands-index-results" target="_blank">github.com/OpenHands/openhands-index-results</a>.
|
| 54 |
</p>
|
| 55 |
"""
|
| 56 |
)
|
| 57 |
gr.Markdown("---", elem_classes="divider-line")
|
| 58 |
|
| 59 |
+
# --- Section 5: Contact ---
|
| 60 |
gr.HTML(
|
| 61 |
"""
|
| 62 |
+
<h2>Contact</h2>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
<p>
|
| 64 |
+
Questions or feedback? Join us on <a href="https://dub.sh/openhands" target="_blank">Slack</a>.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
</p>
|
| 66 |
"""
|
| 67 |
)
|
| 68 |
gr.Markdown("---", elem_classes="divider-line")
|
| 69 |
|
| 70 |
+
# --- Section 6: Acknowledgements ---
|
| 71 |
gr.HTML(
|
| 72 |
"""
|
| 73 |
<h2>Acknowledgements</h2>
|
| 74 |
<p>
|
| 75 |
+
The leaderboard interface is adapted from the
|
| 76 |
+
<a href="https://huggingface.co/spaces/allenai/asta-bench-leaderboard" target="_blank">AstaBench Leaderboard</a>
|
| 77 |
+
by Allen Institute for AI.
|
|
|
|
|
|
|
|
|
|
| 78 |
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
"""
|
| 80 |
)
|
| 81 |
gr.Markdown("---", elem_classes="divider-line")
|
| 82 |
|
| 83 |
+
# --- Section 7: Citation ---
|
| 84 |
gr.HTML(
|
| 85 |
"""
|
| 86 |
<h2>Citation</h2>
|
|
|
|
|
|
|
|
|
|
| 87 |
<pre class="citation-block">
|
| 88 |
@misc{openhandsindex2025,
|
| 89 |
title={OpenHands Index: A Comprehensive Leaderboard for AI Coding Agents},
|
| 90 |
+
author={OpenHands Team},
|
| 91 |
year={2025},
|
| 92 |
howpublished={https://huggingface.co/spaces/OpenHands/openhands-index}
|
| 93 |
}</pre>
|