openhands commited on
Commit
7904c2d
·
1 Parent(s): aa07520

Replace all AstaBench references with OpenHands Index

Browse files

- Updated branding across all files (about.py, content.py, etc.)
- Rewrote about.py to focus on software engineering benchmarks
- Updated descriptions to reflect OpenHands' focus on coding agents
- Maintained original design and layout structure

about.py CHANGED
@@ -3,50 +3,52 @@ import gradio as gr
3
 
4
  def build_page():
5
  with gr.Column(elem_id="about-page-content-wrapper"):
6
- # --- Section 1: About AstaBench ---
7
  gr.HTML(
8
  """
9
- <h2>About AstaBench</h2>
10
  <p>
11
- AstaBench is a novel AI agents evaluation framework, providing a challenging new test for AI agents: the first benchmark challenge that evaluates agents’ scientific abilities on a broad spectrum of research skills, including literature understanding, data analysis, planning, tool use, coding, and search. Asta’s set of standard tools makes it easy to build general-purpose science agents and to compare their performance in an apples-to-apples manner.
12
  </p>
13
  """
14
  )
15
  gr.Markdown("---", elem_classes="divider-line")
16
 
17
- # --- Section 2: Why AstaBench? ---
18
  gr.HTML(
19
  """
20
- <h2>Why AstaBench?</h2>
21
  <p>
22
- Most current benchmarks test agentic AI and isolated aspects of scientific reasoning, but rarely evaluate AI agentic behavior rigorously or capture the full skill set scientific research requires. Agents can appear effective despite inconsistent results and high compute use, often outperforming others by consuming more resources. Advancing scientific AI requires evaluations that emphasize reproducibility, efficiency, and the real complexity of research.
23
  </p>
24
  <br>
25
  <p>
26
- AstaBench fills this gap: an agents evaluation framework and suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. AstaBench helps scientists identify which agents best support their needs through task-relevant leaderboards, while giving AI developers a standard execution environment and tools to test the scientific reasoning capabilities of their agents compared to well-known baselines from the literature, including both open and closed LLM foundation models.
27
  </p>
28
  """
29
  )
30
  gr.Markdown("---", elem_classes="divider-line")
31
 
32
- # --- Section 3: What Does AstaBench Include? ---
33
  gr.HTML(
34
  """
35
- <h2>What Does AstaBench Include?</h2>
36
  <p>
37
- AstaBench includes a rigorous agents evaluation framework and a suite of benchmarks consisting of over 2,400 problems across 11 benchmarks, organized into four core categories:
38
  </p>
39
  <ul class="info-list">
40
- <li>Literature Understanding</li>
41
- <li>Code & Execution</li>
42
- <li>Data Analysis</li>
43
- <li>End-to-End Discovery</li>
 
 
44
  </ul>
45
  <p>
46
- Plus: a large suite of integrated agents and leaderboards with results from extensive evaluation of agents and models.
47
  </p>
48
  <p>
49
- 🔍 Learn more in the <a href="https://allenai.org/blog/astabench" target="_blank" class="primary-link-button">AstaBench technical blog post</a>
50
  </p>
51
  """
52
  )
@@ -57,18 +59,19 @@ def build_page():
57
  """
58
  <h2>Understanding the Leaderboards</h2>
59
  <p>
60
- The AstaBench Overall Leaderboard provides a high-level view of overall agent performance and efficiency:
61
  </p>
62
  <ul class="info-list">
63
- <li>Overall score: A macro-average of the four category-level averages (equal weighting)</li>
64
- <li>Overall cost: Average cost per task, aggregated only across benchmarks with reported cost</li>
65
  </ul>
66
  <p>
67
- Each category leaderboard provides:
68
  </p>
69
  <ul class="info-list">
70
- <li>Average score and cost for that category (macro-averaged across the benchmarks in the category)</li>
71
- <li>A breakdown by individual benchmarks</li>
 
72
  </ul>
73
  """
74
  )
@@ -79,66 +82,43 @@ def build_page():
79
  """
80
  <h2>Scoring & Aggregation</h2>
81
  <p>
82
- AstaBench encourages careful, transparent evaluation. Here's how we handle scoring, cost, and partial results:
83
  </p>
84
 
85
  <h3>Scores</h3>
86
  <ul class="info-list">
87
- <li>Each benchmark returns an average score based on per-problem scores</li>
88
- <li>All scores are aggregated upward using macro-averaging</li>
89
- <li>Partial completions are included (even with poor performance)</li>
90
  </ul>
91
 
92
  <h3>Cost</h3>
93
  <ul class="info-list">
94
- <li>Costs are reported in USD per task.</li>
95
  <li>Benchmarks without cost data are excluded from cost averages</li>
96
- <li>In scatter plots, agents without cost are plotted to the far right and clearly marked.</li>
97
  </ul>
98
 
99
  <p>
100
- <em>Note: Cost values reflect pricing and infrastructure conditions at a fixed point in time. We recognize that compute costs may change over time and vary by provider, and are actively working on methods to keep costs up-to-date and normalized for fair comparisons.</em>
101
- </p>
102
-
103
- <h3>Coverage</h3>
104
- <ul class="info-list">
105
- <li>Main leaderboard: category coverage (X/4)</li>
106
- <li>Category view: benchmark coverage (X/Y)</li>
107
- <li>Incomplete coverage is flagged visually</li>
108
- </ul>
109
-
110
- <p>
111
- These design choices ensure fair comparison while penalizing cherry-picking and omissions.
112
  </p>
113
  """
114
  )
115
  gr.Markdown("---", elem_classes="divider-line")
116
 
117
- # --- Section 6: Learn More ---
118
  gr.HTML(
119
  """
120
- <div class="learn-more-section">
121
- <h2>Learn More</h2>
122
- <div class="link-buttons-container">
123
-
124
- <a href="https://allenai.org/blog/astabench" target="_blank" class="link-button">
125
- <span style="color:#0fcb8c;">AstaBench technical blog post</span>
126
- <span class="external-link-icon">↗</span>
127
- </a>
128
-
129
- <a href="/submit" target="_blank" class="link-button">
130
- <span style="color:#0fcb8c;">Submit an agent for evaluation</span>
131
- <span class="external-link-icon">↗</span>
132
- </a>
133
-
134
- </div>
135
- </div>
136
  """
137
  )
138
- # Floating feedback button
139
- floating_feedback_button_html = """
140
- <div>
141
- <a id="feedback-button" href="https://docs.google.com/forms/d/e/1FAIpQLSfJdVkD62aPYh8XehN2FrSeHUWt488Ejc-QdtuZn5NZ3eNoxA/viewform">Have feedback?</a>
142
- </div>
143
- """
144
- gr.HTML(floating_feedback_button_html)
 
3
 
4
  def build_page():
5
  with gr.Column(elem_id="about-page-content-wrapper"):
6
+ # --- Section 1: About OpenHands Index ---
7
  gr.HTML(
8
  """
9
+ <h2>About OpenHands Index</h2>
10
  <p>
11
+ OpenHands Index is a comprehensive leaderboard that tracks the performance of AI coding agents across multiple software engineering benchmarks. It provides a unified view of agent capabilities in areas like code generation, bug fixing, repository-level tasks, and complex reasoning challenges. The index makes it easy to compare agents' performance in an apples-to-apples manner across diverse evaluation scenarios.
12
  </p>
13
  """
14
  )
15
  gr.Markdown("---", elem_classes="divider-line")
16
 
17
+ # --- Section 2: Why OpenHands Index? ---
18
  gr.HTML(
19
  """
20
+ <h2>Why OpenHands Index?</h2>
21
  <p>
22
+ Software engineering benchmarks are scattered across different platforms and evaluation frameworks, making it difficult to compare agent performance holistically. Agents may excel at one type of task but struggle with others. Understanding the true capabilities of coding agents requires comprehensive evaluation across multiple dimensions.
23
  </p>
24
  <br>
25
  <p>
26
+ OpenHands Index fills this gap by providing a unified leaderboard aggregating results from diverse software engineering benchmarks. It helps developers and researchers identify which agents best suit their needs, while providing standardized metrics for comparing agent performance across tasks like repository-level editing, multimodal understanding, and commit message generation.
27
  </p>
28
  """
29
  )
30
  gr.Markdown("---", elem_classes="divider-line")
31
 
32
+ # --- Section 3: What Does OpenHands Index Include? ---
33
  gr.HTML(
34
  """
35
+ <h2>What Does OpenHands Index Include?</h2>
36
  <p>
37
+ OpenHands Index aggregates results from 6 key benchmarks for evaluating AI coding agents:
38
  </p>
39
  <ul class="info-list">
40
+ <li><strong>SWE-bench</strong>: Repository-level bug fixing from real GitHub issues</li>
41
+ <li><strong>Multi-SWE-bench</strong>: Multi-repository software engineering tasks</li>
42
+ <li><strong>SWE-bench Multimodal</strong>: Bug fixing with visual context</li>
43
+ <li><strong>SWT-bench</strong>: Web development and testing tasks</li>
44
+ <li><strong>Commit0</strong>: Commit message generation and code understanding</li>
45
+ <li><strong>GAIA</strong>: General AI assistant tasks requiring reasoning and tool use</li>
46
  </ul>
47
  <p>
48
+ Plus: comprehensive leaderboards showing performance across models, agents, and configurations.
49
  </p>
50
  <p>
51
+ 🔍 Learn more at <a href="https://github.com/OpenHands/OpenHands" target="_blank" class="primary-link-button">github.com/OpenHands/OpenHands</a>
52
  </p>
53
  """
54
  )
 
59
  """
60
  <h2>Understanding the Leaderboards</h2>
61
  <p>
62
+ The OpenHands Index Overall Leaderboard provides a high-level view of agent performance and efficiency:
63
  </p>
64
  <ul class="info-list">
65
+ <li><strong>Overall score</strong>: A macro-average across all benchmarks (equal weighting)</li>
66
+ <li><strong>Overall cost</strong>: Average cost per task in USD, aggregated across benchmarks with reported cost</li>
67
  </ul>
68
  <p>
69
+ Individual benchmark pages provide:
70
  </p>
71
  <ul class="info-list">
72
+ <li>Detailed scores and metrics for that specific benchmark</li>
73
+ <li>Cost breakdowns per agent</li>
74
+ <li>Links to submission details and logs</li>
75
  </ul>
76
  """
77
  )
 
82
  """
83
  <h2>Scoring & Aggregation</h2>
84
  <p>
85
+ OpenHands Index provides transparent, standardized evaluation metrics:
86
  </p>
87
 
88
  <h3>Scores</h3>
89
  <ul class="info-list">
90
+ <li>Each benchmark returns an average score based on per-task performance</li>
91
+ <li>All scores are aggregated using macro-averaging (equal weight per benchmark)</li>
92
+ <li>Metrics vary by benchmark (e.g., resolve rate, pass@1, accuracy)</li>
93
  </ul>
94
 
95
  <h3>Cost</h3>
96
  <ul class="info-list">
97
+ <li>Costs are reported in USD per task</li>
98
  <li>Benchmarks without cost data are excluded from cost averages</li>
99
+ <li>In scatter plots, agents without cost data are clearly marked</li>
100
  </ul>
101
 
102
  <p>
103
+ <em>Note: Cost values reflect API pricing at evaluation time and may vary based on provider, infrastructure, and usage patterns.</em>
 
 
 
 
 
 
 
 
 
 
 
104
  </p>
105
  """
106
  )
107
  gr.Markdown("---", elem_classes="divider-line")
108
 
109
+ # --- Section 6: Citation ---
110
  gr.HTML(
111
  """
112
+ <h2>Citation</h2>
113
+ <p>
114
+ If you use OpenHands or reference the OpenHands Index in your work, please cite:
115
+ </p>
116
+ <pre class="citation-block">
117
+ @misc{openhands2024,
118
+ title={OpenHands: An Open Platform for AI Software Developers as Generalist Agents},
119
+ author={OpenHands Team},
120
+ year={2024},
121
+ howpublished={https://github.com/OpenHands/OpenHands}
122
+ }</pre>
 
 
 
 
 
123
  """
124
  )
 
 
 
 
 
 
 
category_page_builder.py CHANGED
@@ -17,7 +17,7 @@ def build_category_page(CATEGORY_NAME, PAGE_DESCRIPTION):
17
  with gr.Row(elem_id="intro-row"):
18
 
19
  with gr.Column(scale=1):
20
- gr.HTML(f'<h2>AstaBench {CATEGORY_NAME} Leaderboard <span style="font-weight: normal; color: inherit;">(Aggregate)</span></h2>', elem_id="main-header")
21
  with gr.Column(elem_id="validation_nav_container", visible=False) as validation_nav_container:
22
  create_sub_navigation_bar(validation_tag_map, CATEGORY_NAME, validation=True)
23
 
 
17
  with gr.Row(elem_id="intro-row"):
18
 
19
  with gr.Column(scale=1):
20
+ gr.HTML(f'<h2>OpenHands Index {CATEGORY_NAME} Leaderboard <span style="font-weight: normal; color: inherit;">(Aggregate)</span></h2>', elem_id="main-header")
21
  with gr.Column(elem_id="validation_nav_container", visible=False) as validation_nav_container:
22
  create_sub_navigation_bar(validation_tag_map, CATEGORY_NAME, validation=True)
23
 
content.py CHANGED
@@ -13,11 +13,11 @@ def create_gradio_anchor_id(text: str, validation) -> str:
13
  return f"h-{text}-leaderboard"
14
 
15
 
16
- TITLE = """<h1 align="left" id="space-title">AstaBench Leaderboard</h1>"""
17
 
18
  INTRO_PARAGRAPH = """
19
  <p>
20
- <strong>AstaBench</strong> provides an aggregated view of agent performance and efficiency across all benchmarks in all four categories. We report:
21
  </p>
22
 
23
  <ul class="info-list">
@@ -48,7 +48,7 @@ For detailed results, use the links above to explore individual benchmarks.
48
  <br>
49
  """
50
  CODE_EXECUTION_DESCRIPTION = """
51
- The **Code & Execution** category in AstaBench includes tasks that evaluate an agent’s ability to write, modify, and run code in realistic research scenarios. Unlike literature tasks—which only require read-only tools and can sometimes even be solved by a language model alone—these problems often require the agent to manipulate a machine environment with tools: reading input files, executing code, and writing outputs to specific files in the required format.
52
  <br><br>
53
  The scores in this category are aggregated from three distinct benchmarks, each targeting different facets of scientific coding and execution. Together, these benchmarks evaluate whether an agent can function as a hands-on scientific assistant—not just by reasoning about code, but by running it in real-world contexts.
54
  <br><br>
@@ -68,7 +68,7 @@ Scores in this category are aggregated from two benchmarks, providing the first
68
  <br>
69
  """
70
  SUBMISSION_CONFIRMATION = """
71
- **Your agent has been submitted to AstaBench for evaluation.**
72
  <br><br>
73
  🙏 Thanks for contributing!
74
  <br><br>
@@ -172,8 +172,8 @@ def get_benchmark_description(benchmark_name, validation):
172
 
173
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
174
  CITATION_BUTTON_TEXT = r"""@article{asta-bench,
175
- title={AstaBench},
176
- author={AstaBench folks},
177
  year={2025},
178
  eprint={TBD.TBD},
179
  archivePrefix={arXiv},
@@ -184,11 +184,11 @@ CITATION_BUTTON_TEXT = r"""@article{asta-bench,
184
  LEGAL_DISCLAIMER_TEXT = """
185
  <h2>Terms and Conditions</h2>
186
  <p>
187
- The Allen Institute for Artificial Intelligence (Ai2) maintains this repository for agent evaluation submissions to AstaBench. To keep AstaBench fair and auditable, all evaluation logs and associated submission files will be made publicly available. This includes your benchmark inputs, model output responses, and other data and information related to your submission as needed to verify the results.
188
  </p>
189
  <br>
190
  <p>
191
- Your submissions to AstaBench will be posted, scored, and ranked on the leaderboard at <a href="https://huggingface.co/spaces/allenai/asta-bench-leaderboard" target="_blank" rel="noopener noreferrer">https://huggingface.co/spaces/allenai/asta-bench-leaderboard</a>. You agree you have the rights to the materials you submit and that you will not share any personal, sensitive, proprietary, or confidential information.
192
  </p>
193
  """
194
 
 
13
  return f"h-{text}-leaderboard"
14
 
15
 
16
+ TITLE = """<h1 align="left" id="space-title">OpenHands Index</h1>"""
17
 
18
  INTRO_PARAGRAPH = """
19
  <p>
20
+ <strong>OpenHands Index</strong> provides an aggregated view of agent performance and efficiency across all benchmarks in all categories. We report:
21
  </p>
22
 
23
  <ul class="info-list">
 
48
  <br>
49
  """
50
  CODE_EXECUTION_DESCRIPTION = """
51
+ The **Code & Execution** category in OpenHands Index includes tasks that evaluate an agent’s ability to write, modify, and run code in realistic research scenarios. Unlike literature tasks—which only require read-only tools and can sometimes even be solved by a language model alone—these problems often require the agent to manipulate a machine environment with tools: reading input files, executing code, and writing outputs to specific files in the required format.
52
  <br><br>
53
  The scores in this category are aggregated from three distinct benchmarks, each targeting different facets of scientific coding and execution. Together, these benchmarks evaluate whether an agent can function as a hands-on scientific assistant—not just by reasoning about code, but by running it in real-world contexts.
54
  <br><br>
 
68
  <br>
69
  """
70
  SUBMISSION_CONFIRMATION = """
71
+ **Your agent has been submitted to OpenHands Index for evaluation.**
72
  <br><br>
73
  🙏 Thanks for contributing!
74
  <br><br>
 
172
 
173
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
174
  CITATION_BUTTON_TEXT = r"""@article{asta-bench,
175
+ title={OpenHands Index},
176
+ author={OpenHands Index folks},
177
  year={2025},
178
  eprint={TBD.TBD},
179
  archivePrefix={arXiv},
 
184
  LEGAL_DISCLAIMER_TEXT = """
185
  <h2>Terms and Conditions</h2>
186
  <p>
187
+ The Allen Institute for Artificial Intelligence (Ai2) maintains this repository for agent evaluation submissions to OpenHands Index. To keep OpenHands Index fair and auditable, all evaluation logs and associated submission files will be made publicly available. This includes your benchmark inputs, model output responses, and other data and information related to your submission as needed to verify the results.
188
  </p>
189
  <br>
190
  <p>
191
+ Your submissions to OpenHands Index will be posted, scored, and ranked on the leaderboard at <a href="https://huggingface.co/spaces/allenai/asta-bench-leaderboard" target="_blank" rel="noopener noreferrer">https://huggingface.co/spaces/allenai/asta-bench-leaderboard</a>. You agree you have the rights to the materials you submit and that you will not share any personal, sensitive, proprietary, or confidential information.
192
  </p>
193
  """
194
 
leaderboard_transformer.py CHANGED
@@ -535,7 +535,7 @@ def _plot_scatter_plotly(
535
 
536
  fig.update_layout(
537
  template="plotly_white",
538
- title=f"AstaBench {name} Leaderboard",
539
  xaxis=xaxis_config, # Use the updated config
540
  yaxis=dict(title="Average (mean) score", range=[-0.2, None]),
541
  legend=dict(
 
535
 
536
  fig.update_layout(
537
  template="plotly_white",
538
+ title=f"OpenHands Index {name} Leaderboard",
539
  xaxis=xaxis_config, # Use the updated config
540
  yaxis=dict(title="Average (mean) score", range=[-0.2, None]),
541
  legend=dict(
main_page.py CHANGED
@@ -34,7 +34,7 @@ def build_page():
34
  # --- Leaderboard Display Section ---
35
  gr.Markdown("---")
36
  CATEGORY_NAME = "Overall"
37
- gr.HTML(f'<h2>AstaBench {CATEGORY_NAME} Leaderboard <span style="font-weight: normal; color: inherit;">(Aggregate)</span></h2>', elem_id="main-header")
38
 
39
  with gr.Tabs() as tabs:
40
  with gr.Tab("Results: Test Set") as test_tab:
 
34
  # --- Leaderboard Display Section ---
35
  gr.Markdown("---")
36
  CATEGORY_NAME = "Overall"
37
+ gr.HTML(f'<h2>OpenHands Index {CATEGORY_NAME} Leaderboard <span style="font-weight: normal; color: inherit;">(Aggregate)</span></h2>', elem_id="main-header")
38
 
39
  with gr.Tabs() as tabs:
40
  with gr.Tab("Results: Test Set") as test_tab:
submission.py CHANGED
@@ -357,7 +357,7 @@ agent_tooling_label_html = f"""<div>
357
 
358
  heading_html = """
359
  <h2>🚀 Submit an agent for evaluation</h2>
360
- <p>Submit your agent to AstaBench for evaluation on real-world scientific tasks. Once submitted, your run will be reviewed by our team. If there are any issues, we’ll reach out within 5–7 business days. We’re working toward full automation, but in the meantime, human review helps ensure quality and trust.</p>
361
  <h3>How to run an evaluation</h3>
362
  <p>Please follow the steps in our <a href="https://github.com/allenai/asta-bench?tab=readme-ov-file#usage" target="_blank">README</a>. You’ll upload your run file at the end of this form.</p>
363
  """
@@ -372,7 +372,7 @@ def build_page():
372
  gr.HTML(value="""<h3>Username</h3>""", elem_classes="form-label")
373
  username_tb = gr.Textbox(label="This will show on the leaderboard. By default, we’ll use your Hugging Face username; but you can enter your organization name instead (e.g., university, company, or lab).")
374
  gr.HTML(value="""<h3>Role</h3>""", elem_classes="form-label")
375
- role = gr.Dropdown(label="Please select the role that most closely matches your current position. Helps us improve AstaBench for different user types. Not displayed on the leaderboard.",
376
  interactive=True,
377
  choices=[
378
  "Undergraduate Student",
 
357
 
358
  heading_html = """
359
  <h2>🚀 Submit an agent for evaluation</h2>
360
+ <p>Submit your agent to OpenHands Index for evaluation on real-world scientific tasks. Once submitted, your run will be reviewed by our team. If there are any issues, we’ll reach out within 5–7 business days. We’re working toward full automation, but in the meantime, human review helps ensure quality and trust.</p>
361
  <h3>How to run an evaluation</h3>
362
  <p>Please follow the steps in our <a href="https://github.com/allenai/asta-bench?tab=readme-ov-file#usage" target="_blank">README</a>. You’ll upload your run file at the end of this form.</p>
363
  """
 
372
  gr.HTML(value="""<h3>Username</h3>""", elem_classes="form-label")
373
  username_tb = gr.Textbox(label="This will show on the leaderboard. By default, we’ll use your Hugging Face username; but you can enter your organization name instead (e.g., university, company, or lab).")
374
  gr.HTML(value="""<h3>Role</h3>""", elem_classes="form-label")
375
+ role = gr.Dropdown(label="Please select the role that most closely matches your current position. Helps us improve OpenHands Index for different user types. Not displayed on the leaderboard.",
376
  interactive=True,
377
  choices=[
378
  "Undergraduate Student",