openhands commited on
Commit
35aa299
·
1 Parent(s): 7abbbe1

Remove asta/astabench references

Browse files

- Update utm_source from asta_leaderboard to openhands_index
- Remove 'Asta' from toolset description
- Update citation text to use openhands-index
- Update legal disclaimer with OpenHands URL
- Simplify README Hugging Face integration section
- Delete unused agenteval_backup.json file

Keep acknowledgement to AstaBench in about.py as it credits the source.

Files changed (3) hide show
  1. README.md +5 -19
  2. content.py +6 -6
  3. data/1.0.0-dev1/agenteval_backup.json +0 -308
README.md CHANGED
@@ -37,29 +37,15 @@ python app.py
37
  This will start a local server that you can access in your web browser at `http://localhost:7860`.
38
 
39
  ## Hugging Face Integration
40
- The repo backs two Hugging Face leaderboard spaces:
41
- - https://huggingface.co/spaces/allenai/asta-bench-internal-leaderboard
42
- - https://huggingface.co/spaces/allenai/asta-bench-leaderboard
43
 
44
- Please follow the steps below to push changes to the leaderboards on Hugging Face.
45
 
46
- Before pushing, make sure to merge your changes to the `main` branch of this repository. (following the standard GitHub workflow of creating a branch, making changes, and then merging it back to `main`).
47
 
48
- Before pushing for the first time, you'll need to add the Hugging Face remote repositories if you haven't done so already. You can do this by running the following commands:
49
 
50
  ```bash
51
- git remote add huggingface https://huggingface.co/spaces/allenai/asta-bench-internal-leaderboard
52
- git remote add huggingface-public https://huggingface.co/spaces/allenai/asta-bench-leaderboard
53
- ```
54
- You can verify that the remotes have been added by running:
55
-
56
- ```bash
57
- git remote -v
58
- ```
59
- Then, to push the changes to the Hugging Face leaderboards, you can use the following commands:
60
-
61
- ```bash
62
- git push huggingface main:main
63
- git push huggingface-public main:main
64
  ```
65
 
 
37
  This will start a local server that you can access in your web browser at `http://localhost:7860`.
38
 
39
  ## Hugging Face Integration
40
+ The repo backs the Hugging Face space at https://huggingface.co/spaces/OpenHands/openhands-index
 
 
41
 
42
+ Please follow the steps below to push changes to the leaderboard on Hugging Face.
43
 
44
+ Before pushing, make sure to merge your changes to the `main` branch of this repository (following the standard GitHub workflow of creating a branch, making changes, and then merging it back to `main`).
45
 
46
+ Then, to push the changes to the Hugging Face leaderboard:
47
 
48
  ```bash
49
+ git push origin main
 
 
 
 
 
 
 
 
 
 
 
 
50
  ```
51
 
content.py CHANGED
@@ -92,7 +92,7 @@ DISCOVERY_BENCH_URL = "https://www.semanticscholar.org/paper/DiscoveryBench%3A-T
92
 
93
  # Helper function to create external links
94
  def external_link(url, text, is_s2_url=False):
95
- url = f"{url}?utm_source=asta_leaderboard" if is_s2_url else url
96
  return f"<a href='{url}' target='_blank' rel='noopener noreferrer'>{text}</a>"
97
 
98
  def internal_leaderboard_link(text, validation):
@@ -122,7 +122,7 @@ def get_benchmark_description(benchmark_name, validation):
122
  f"{external_link(LITQA2_URL, 'LitQA2', is_s2_url=True)}, a benchmark introduced by FutureHouse, gauges a model's ability to answer questions that require document retrieval from the scientific literature. "
123
  "It consists of multiple-choice questions that necessitate finding a unique paper and analyzing its detailed full text to spot precise information; these questions cannot be answered from a paper’s abstract. "
124
  "While the original version of the benchmark provided for each question the title of the paper in which the answer can be found, it did not specify the overall collection to search over. In our version, "
125
- "we search over the index we provide as part of the Asta standard toolset. The “-FullText” suffix indicates we consider only the subset of LitQA2 questions for which "
126
  "the full-text version of the answering paper is open source and available in our index."
127
  ),
128
  'ArxivDIGESTables-Clean': (
@@ -175,9 +175,9 @@ def get_benchmark_description(benchmark_name, validation):
175
  return descriptions.get(benchmark_name, "")
176
 
177
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
178
- CITATION_BUTTON_TEXT = r"""@article{asta-bench,
179
  title={OpenHands Index},
180
- author={OpenHands Index folks},
181
  year={2025},
182
  eprint={TBD.TBD},
183
  archivePrefix={arXiv},
@@ -188,11 +188,11 @@ CITATION_BUTTON_TEXT = r"""@article{asta-bench,
188
  LEGAL_DISCLAIMER_TEXT = """
189
  <h2>Terms and Conditions</h2>
190
  <p>
191
- The Allen Institute for Artificial Intelligence (Ai2) maintains this repository for agent evaluation submissions to OpenHands Index. To keep OpenHands Index fair and auditable, all evaluation logs and associated submission files will be made publicly available. This includes your benchmark inputs, model output responses, and other data and information related to your submission as needed to verify the results.
192
  </p>
193
  <br>
194
  <p>
195
- Your submissions to OpenHands Index will be posted, scored, and ranked on the leaderboard at <a href="https://huggingface.co/spaces/allenai/asta-bench-leaderboard" target="_blank" rel="noopener noreferrer">https://huggingface.co/spaces/allenai/asta-bench-leaderboard</a>. You agree you have the rights to the materials you submit and that you will not share any personal, sensitive, proprietary, or confidential information.
196
  </p>
197
  """
198
 
 
92
 
93
  # Helper function to create external links
94
  def external_link(url, text, is_s2_url=False):
95
+ url = f"{url}?utm_source=openhands_index" if is_s2_url else url
96
  return f"<a href='{url}' target='_blank' rel='noopener noreferrer'>{text}</a>"
97
 
98
  def internal_leaderboard_link(text, validation):
 
122
  f"{external_link(LITQA2_URL, 'LitQA2', is_s2_url=True)}, a benchmark introduced by FutureHouse, gauges a model's ability to answer questions that require document retrieval from the scientific literature. "
123
  "It consists of multiple-choice questions that necessitate finding a unique paper and analyzing its detailed full text to spot precise information; these questions cannot be answered from a paper’s abstract. "
124
  "While the original version of the benchmark provided for each question the title of the paper in which the answer can be found, it did not specify the overall collection to search over. In our version, "
125
+ "we search over the index we provide as part of the standard toolset. The “-FullText” suffix indicates we consider only the subset of LitQA2 questions for which "
126
  "the full-text version of the answering paper is open source and available in our index."
127
  ),
128
  'ArxivDIGESTables-Clean': (
 
175
  return descriptions.get(benchmark_name, "")
176
 
177
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
178
+ CITATION_BUTTON_TEXT = r"""@article{openhands-index,
179
  title={OpenHands Index},
180
+ author={OpenHands Team},
181
  year={2025},
182
  eprint={TBD.TBD},
183
  archivePrefix={arXiv},
 
188
  LEGAL_DISCLAIMER_TEXT = """
189
  <h2>Terms and Conditions</h2>
190
  <p>
191
+ OpenHands maintains this repository for agent evaluation submissions to OpenHands Index. To keep OpenHands Index fair and auditable, all evaluation logs and associated submission files will be made publicly available. This includes your benchmark inputs, model output responses, and other data and information related to your submission as needed to verify the results.
192
  </p>
193
  <br>
194
  <p>
195
+ Your submissions to OpenHands Index will be posted, scored, and ranked on the leaderboard at <a href="https://huggingface.co/spaces/OpenHands/openhands-index" target="_blank" rel="noopener noreferrer">https://huggingface.co/spaces/OpenHands/openhands-index</a>. You agree you have the rights to the materials you submit and that you will not share any personal, sensitive, proprietary, or confidential information.
196
  </p>
197
  """
198
 
data/1.0.0-dev1/agenteval_backup.json DELETED
@@ -1,308 +0,0 @@
1
- {
2
- "suite_config": {
3
- "name": "openhands-index",
4
- "version": "1.0.0-dev1",
5
- "splits": [
6
- {
7
- "name": "validation",
8
- "tasks": [
9
- {
10
- "name": "swe-bench",
11
- "path": "openhands/swe-bench",
12
- "primary_metric": "resolved/mean",
13
- "tags": [
14
- "swe-bench"
15
- ]
16
- },
17
- {
18
- "name": "multi-swe-bench",
19
- "path": "openhands/multi-swe-bench",
20
- "primary_metric": "resolved/mean",
21
- "tags": [
22
- "multi-swe-bench"
23
- ]
24
- },
25
- {
26
- "name": "swe-bench-multimodal",
27
- "path": "openhands/swe-bench-multimodal",
28
- "primary_metric": "resolved/mean",
29
- "tags": [
30
- "swe-bench-multimodal"
31
- ]
32
- },
33
- {
34
- "name": "swt-bench",
35
- "path": "openhands/swt-bench",
36
- "primary_metric": "generated/mean",
37
- "tags": [
38
- "swt-bench"
39
- ]
40
- },
41
- {
42
- "name": "commit0",
43
- "path": "openhands/commit0",
44
- "primary_metric": "tests_passed/mean",
45
- "tags": [
46
- "commit0"
47
- ]
48
- },
49
- {
50
- "name": "gaia",
51
- "path": "openhands/gaia",
52
- "primary_metric": "correct/mean",
53
- "tags": [
54
- "gaia"
55
- ]
56
- }
57
- ]
58
- },
59
- {
60
- "name": "test",
61
- "tasks": [
62
- {
63
- "name": "swe-bench",
64
- "path": "openhands/swe-bench",
65
- "primary_metric": "resolved/mean",
66
- "tags": [
67
- "swe-bench"
68
- ]
69
- },
70
- {
71
- "name": "multi-swe-bench",
72
- "path": "openhands/multi-swe-bench",
73
- "primary_metric": "resolved/mean",
74
- "tags": [
75
- "multi-swe-bench"
76
- ]
77
- },
78
- {
79
- "name": "arxivdigestables_test",
80
- "path": "astabench/arxivdigestables_test",
81
- "primary_metric": "score_tables/mean",
82
- "tags": [
83
- "lit"
84
- ]
85
- },
86
- {
87
- "name": "litqa2_test",
88
- "path": "astabench/litqa2_test",
89
- "primary_metric": "is_correct/accuracy",
90
- "tags": [
91
- "lit"
92
- ]
93
- },
94
- {
95
- "name": "discoverybench_test",
96
- "path": "astabench/discoverybench_test",
97
- "primary_metric": "score_discoverybench/mean",
98
- "tags": [
99
- "data"
100
- ]
101
- },
102
- {
103
- "name": "core_bench_test",
104
- "path": "astabench/core_bench_test",
105
- "primary_metric": "evaluate_task_questions/accuracy",
106
- "tags": [
107
- "code"
108
- ]
109
- },
110
- {
111
- "name": "ds1000_test",
112
- "path": "astabench/ds1000_test",
113
- "primary_metric": "ds1000_scorer/accuracy",
114
- "tags": [
115
- "code"
116
- ]
117
- },
118
- {
119
- "name": "e2e_discovery_test",
120
- "path": "astabench/e2e_discovery_test",
121
- "primary_metric": "score_rubric/accuracy",
122
- "tags": [
123
- "discovery"
124
- ]
125
- },
126
- {
127
- "name": "super_test",
128
- "path": "astabench/super_test",
129
- "primary_metric": "check_super_execution/entrypoints",
130
- "tags": [
131
- "code"
132
- ]
133
- }
134
- ]
135
- }
136
- ]
137
- },
138
- "split": "validation",
139
- "results": [
140
- {
141
- "task_name": "sqa_dev",
142
- "metrics": [
143
- {
144
- "name": "global_avg/mean",
145
- "value": 0.6215245045241414
146
- },
147
- {
148
- "name": "global_avg/stderr",
149
- "value": 0.02088486499225903
150
- },
151
- {
152
- "name": "ingredient_recall/mean",
153
- "value": 0.6029178145087237
154
- },
155
- {
156
- "name": "ingredient_recall/stderr",
157
- "value": 0.026215888361291618
158
- },
159
- {
160
- "name": "answer_precision/mean",
161
- "value": 0.7960436785436785
162
- },
163
- {
164
- "name": "answer_precision/stderr",
165
- "value": 0.027692773517249983
166
- },
167
- {
168
- "name": "citation_precision/mean",
169
- "value": 0.697849041353826
170
- },
171
- {
172
- "name": "citation_precision/stderr",
173
- "value": 0.026784164936602798
174
- },
175
- {
176
- "name": "citation_recall/mean",
177
- "value": 0.3892874836903378
178
- },
179
- {
180
- "name": "citation_recall/stderr",
181
- "value": 0.015094770200171756
182
- }
183
- ],
184
- "model_costs": [
185
- 1.3829150000000001,
186
- 0.9759700000000001,
187
- 2.2324650000000004,
188
- 0.76631,
189
- 0.9277900000000001,
190
- 2.6388600000000006,
191
- 0.8114100000000002,
192
- 2.3263174999999996,
193
- 2.5423725,
194
- 1.2398675000000001,
195
- 1.7387300000000003,
196
- 1.2176599999999997,
197
- 0.564655,
198
- 0.9726750000000001,
199
- 0.7675700000000001,
200
- 1.5198850000000002,
201
- 1.4726625000000002,
202
- 2.1937650000000004,
203
- 0.6907700000000001,
204
- 1.39835,
205
- 1.2598175,
206
- 2.5373550000000002,
207
- 2.19239,
208
- 1.2508875000000006,
209
- 2.2650550000000007,
210
- 1.6047725,
211
- 0.6525125000000003,
212
- 1.4262200000000003,
213
- 1.0533299999999999,
214
- 1.7252375,
215
- 1.407145,
216
- 1.5408700000000004,
217
- 2.8073224999999993,
218
- 1.0448125000000006,
219
- 1.7037300000000004,
220
- 0.8650500000000001,
221
- 1.0171225000000002,
222
- 0.5697925000000001,
223
- 2.7851025,
224
- 1.0551425,
225
- 2.9213775,
226
- 1.7772975000000004,
227
- 1.2753225000000001,
228
- 0.8108325000000001,
229
- 0.6958375000000001,
230
- 0.8840950000000003,
231
- 1.2028724999999998,
232
- 1.2490475000000003,
233
- 2.4272,
234
- 1.95026,
235
- 1.5352475,
236
- 2.11181,
237
- 2.3612249999999997,
238
- 1.8619225000000004,
239
- 0.7431075000000001,
240
- 1.5189675000000002,
241
- 1.089575,
242
- 1.6103700000000003,
243
- 1.4201450000000002,
244
- 2.397835,
245
- 1.469175,
246
- 1.0723550000000004,
247
- 0.7964050000000003,
248
- 3.3733175,
249
- 4.197085,
250
- 4.2637675,
251
- 1.2982124999999998,
252
- 0.66146,
253
- 1.1130475000000002,
254
- 2.4393974999999997,
255
- 2.582,
256
- 1.7381725000000001,
257
- 0.415025,
258
- 1.6777325,
259
- 1.0507825000000002,
260
- 2.4627125000000003,
261
- 1.017005,
262
- 1.9210250000000002,
263
- 1.5009025000000003,
264
- 0.8283125000000001,
265
- 2.9854425,
266
- 0.4633375000000001,
267
- 0.397685,
268
- 1.2803425,
269
- 3.0388200000000003,
270
- 1.2610875000000004,
271
- 1.798365,
272
- 3.427287500000001,
273
- 0.29307750000000005,
274
- 0.37101249999999997,
275
- 2.8046925000000003,
276
- 0.35557000000000005,
277
- 3.5481700000000007,
278
- 1.1073975,
279
- 1.5280825,
280
- 1.1714900000000001,
281
- 3.1791275000000003,
282
- 3.8214725000000005,
283
- 1.8440275,
284
- 1.730515,
285
- 1.9350675000000002,
286
- 1.6592125000000002,
287
- 1.9227124999999998,
288
- 1.202885,
289
- 1.2688150000000002,
290
- 0.8819875000000001,
291
- 0.6989325,
292
- 1.965635,
293
- 1.7467800000000002,
294
- 1.6940625000000002
295
- ]
296
- }
297
- ],
298
- "submission": {
299
- "submit_time": "2025-06-09T20:55:35.869831Z",
300
- "username": "miked-ai",
301
- "agent_name": "Basic ReAct",
302
- "agent_description": null,
303
- "agent_url": null,
304
- "logs_url": "hf://datasets/allenai/asta-bench-internal-submissions/1.0.0-dev1/validation/miked-ai_Basic_ReAct__task_tools__report_editor__2025-06-09T20-55-35",
305
- "logs_url_public": null,
306
- "summary_url": null
307
- }
308
- }