Spaces:

OpenHands
/

openhands-index

Running

openhands commited on Jan 27

Commit

35aa299

1 Parent(s): 7abbbe1

Remove asta/astabench references

- Update utm_source from asta_leaderboard to openhands_index
- Remove 'Asta' from toolset description
- Update citation text to use openhands-index
- Update legal disclaimer with OpenHands URL
- Simplify README Hugging Face integration section
- Delete unused agenteval_backup.json file

Keep acknowledgement to AstaBench in about.py as it credits the source.

Files changed (3) hide show

README.md +5 -19
content.py +6 -6
data/1.0.0-dev1/agenteval_backup.json +0 -308

README.md CHANGED Viewed

@@ -37,29 +37,15 @@ python app.py
 This will start a local server that you can access in your web browser at `http://localhost:7860`.
 ## Hugging Face Integration
-The repo backs two Hugging Face leaderboard spaces:
-- https://huggingface.co/spaces/allenai/asta-bench-internal-leaderboard
-- https://huggingface.co/spaces/allenai/asta-bench-leaderboard
-Please follow the steps below to push changes to the leaderboards on Hugging Face.
-Before pushing, make sure to merge your changes to the `main` branch of this repository. (following the standard GitHub workflow of creating a branch, making changes, and then merging it back to `main`).
-Before pushing for the first time, you'll need to add the Hugging Face remote repositories if you haven't done so already. You can do this by running the following commands:
 ```bash
-git remote add huggingface https://huggingface.co/spaces/allenai/asta-bench-internal-leaderboard
-git remote add huggingface-public https://huggingface.co/spaces/allenai/asta-bench-leaderboard
-```
-You can verify that the remotes have been added by running:
-```bash
-git remote -v
-```
-Then, to push the changes to the Hugging Face leaderboards, you can use the following commands:
-```bash
-git push huggingface main:main
-git push huggingface-public main:main
 ```

 This will start a local server that you can access in your web browser at `http://localhost:7860`.
 ## Hugging Face Integration
+The repo backs the Hugging Face space at https://huggingface.co/spaces/OpenHands/openhands-index
+Please follow the steps below to push changes to the leaderboard on Hugging Face.
+Before pushing, make sure to merge your changes to the `main` branch of this repository (following the standard GitHub workflow of creating a branch, making changes, and then merging it back to `main`).
+Then, to push the changes to the Hugging Face leaderboard:
 ```bash
+git push origin main
 ```

content.py CHANGED Viewed

@@ -92,7 +92,7 @@ DISCOVERY_BENCH_URL = "https://www.semanticscholar.org/paper/DiscoveryBench%3A-T
 # Helper function to create external links
 def external_link(url, text, is_s2_url=False):
-    url = f"{url}?utm_source=asta_leaderboard" if is_s2_url else url
     return f"<a href='{url}' target='_blank' rel='noopener noreferrer'>{text}</a>"
 def internal_leaderboard_link(text, validation):
@@ -122,7 +122,7 @@ def get_benchmark_description(benchmark_name, validation):
         f"{external_link(LITQA2_URL, 'LitQA2', is_s2_url=True)}, a benchmark introduced by FutureHouse, gauges a model's ability to answer questions that require document retrieval from the scientific literature. "
         "It consists of multiple-choice questions that necessitate finding a unique paper and analyzing its detailed full text to spot precise information; these questions cannot be answered from a paper’s abstract. "
         "While the original version of the benchmark provided for each question the title of the paper in which the answer can be found, it did not specify the overall collection to search over. In our version, "
-        "we search over the index we provide as part of the Asta standard toolset. The “-FullText” suffix indicates we consider only the subset of LitQA2 questions for which "
         "the full-text version of the answering paper is open source and available in our index."
     ),
     'ArxivDIGESTables-Clean': (
@@ -175,9 +175,9 @@ def get_benchmark_description(benchmark_name, validation):
     return descriptions.get(benchmark_name, "")
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
-CITATION_BUTTON_TEXT = r"""@article{asta-bench,
     title={OpenHands Index},
-    author={OpenHands Index folks},
     year={2025},
     eprint={TBD.TBD},
     archivePrefix={arXiv},
@@ -188,11 +188,11 @@ CITATION_BUTTON_TEXT = r"""@article{asta-bench,
 LEGAL_DISCLAIMER_TEXT = """
 <h2>Terms and Conditions</h2>
 <p>
-    The Allen Institute for Artificial Intelligence (Ai2) maintains this repository for agent evaluation submissions to OpenHands Index. To keep OpenHands Index fair and auditable, all evaluation logs and associated submission files will be made publicly available. This includes your benchmark inputs, model output responses, and other data and information related to your submission as needed to verify the results.
 </p>
 <br>
 <p>
-    Your submissions to OpenHands Index will be posted, scored, and ranked on the leaderboard at <a href="https://huggingface.co/spaces/allenai/asta-bench-leaderboard" target="_blank" rel="noopener noreferrer">https://huggingface.co/spaces/allenai/asta-bench-leaderboard</a>. You agree you have the rights to the materials you submit and that you will not share any personal, sensitive, proprietary, or confidential information.
 </p>
 """

 # Helper function to create external links
 def external_link(url, text, is_s2_url=False):
+    url = f"{url}?utm_source=openhands_index" if is_s2_url else url
     return f"<a href='{url}' target='_blank' rel='noopener noreferrer'>{text}</a>"
 def internal_leaderboard_link(text, validation):
         f"{external_link(LITQA2_URL, 'LitQA2', is_s2_url=True)}, a benchmark introduced by FutureHouse, gauges a model's ability to answer questions that require document retrieval from the scientific literature. "
         "It consists of multiple-choice questions that necessitate finding a unique paper and analyzing its detailed full text to spot precise information; these questions cannot be answered from a paper’s abstract. "
         "While the original version of the benchmark provided for each question the title of the paper in which the answer can be found, it did not specify the overall collection to search over. In our version, "
+        "we search over the index we provide as part of the standard toolset. The “-FullText” suffix indicates we consider only the subset of LitQA2 questions for which "
         "the full-text version of the answering paper is open source and available in our index."
     ),
     'ArxivDIGESTables-Clean': (
     return descriptions.get(benchmark_name, "")
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
+CITATION_BUTTON_TEXT = r"""@article{openhands-index,
     title={OpenHands Index},
+    author={OpenHands Team},
     year={2025},
     eprint={TBD.TBD},
     archivePrefix={arXiv},
 LEGAL_DISCLAIMER_TEXT = """
 <h2>Terms and Conditions</h2>
 <p>
+    OpenHands maintains this repository for agent evaluation submissions to OpenHands Index. To keep OpenHands Index fair and auditable, all evaluation logs and associated submission files will be made publicly available. This includes your benchmark inputs, model output responses, and other data and information related to your submission as needed to verify the results.
 </p>
 <br>
 <p>
+    Your submissions to OpenHands Index will be posted, scored, and ranked on the leaderboard at <a href="https://huggingface.co/spaces/OpenHands/openhands-index" target="_blank" rel="noopener noreferrer">https://huggingface.co/spaces/OpenHands/openhands-index</a>. You agree you have the rights to the materials you submit and that you will not share any personal, sensitive, proprietary, or confidential information.
 </p>
 """

data/1.0.0-dev1/agenteval_backup.json DELETED Viewed

@@ -1,308 +0,0 @@
-{
-  "suite_config": {
-    "name": "openhands-index",
-    "version": "1.0.0-dev1",
-    "splits": [
-      {
-        "name": "validation",
-        "tasks": [
-          {
-            "name": "swe-bench",
-            "path": "openhands/swe-bench",
-            "primary_metric": "resolved/mean",
-            "tags": [
-              "swe-bench"
-            ]
-          },
-          {
-            "name": "multi-swe-bench",
-            "path": "openhands/multi-swe-bench",
-            "primary_metric": "resolved/mean",
-            "tags": [
-              "multi-swe-bench"
-            ]
-          },
-          {
-            "name": "swe-bench-multimodal",
-            "path": "openhands/swe-bench-multimodal",
-            "primary_metric": "resolved/mean",
-            "tags": [
-              "swe-bench-multimodal"
-            ]
-          },
-          {
-            "name": "swt-bench",
-            "path": "openhands/swt-bench",
-            "primary_metric": "generated/mean",
-            "tags": [
-              "swt-bench"
-            ]
-          },
-          {
-            "name": "commit0",
-            "path": "openhands/commit0",
-            "primary_metric": "tests_passed/mean",
-            "tags": [
-              "commit0"
-            ]
-          },
-          {
-            "name": "gaia",
-            "path": "openhands/gaia",
-            "primary_metric": "correct/mean",
-            "tags": [
-              "gaia"
-            ]
-          }
-        ]
-      },
-      {
-        "name": "test",
-        "tasks": [
-          {
-            "name": "swe-bench",
-            "path": "openhands/swe-bench",
-            "primary_metric": "resolved/mean",
-            "tags": [
-              "swe-bench"
-            ]
-          },
-          {
-            "name": "multi-swe-bench",
-            "path": "openhands/multi-swe-bench",
-            "primary_metric": "resolved/mean",
-            "tags": [
-              "multi-swe-bench"
-            ]
-          },
-          {
-            "name": "arxivdigestables_test",
-            "path": "astabench/arxivdigestables_test",
-            "primary_metric": "score_tables/mean",
-            "tags": [
-              "lit"
-            ]
-          },
-          {
-            "name": "litqa2_test",
-            "path": "astabench/litqa2_test",
-            "primary_metric": "is_correct/accuracy",
-            "tags": [
-              "lit"
-            ]
-          },
-          {
-            "name": "discoverybench_test",
-            "path": "astabench/discoverybench_test",
-            "primary_metric": "score_discoverybench/mean",
-            "tags": [
-              "data"
-            ]
-          },
-          {
-            "name": "core_bench_test",
-            "path": "astabench/core_bench_test",
-            "primary_metric": "evaluate_task_questions/accuracy",
-            "tags": [
-              "code"
-            ]
-          },
-          {
-            "name": "ds1000_test",
-            "path": "astabench/ds1000_test",
-            "primary_metric": "ds1000_scorer/accuracy",
-            "tags": [
-              "code"
-            ]
-          },
-          {
-            "name": "e2e_discovery_test",
-            "path": "astabench/e2e_discovery_test",
-            "primary_metric": "score_rubric/accuracy",
-            "tags": [
-              "discovery"
-            ]
-          },
-          {
-            "name": "super_test",
-            "path": "astabench/super_test",
-            "primary_metric": "check_super_execution/entrypoints",
-            "tags": [
-              "code"
-            ]
-          }
-        ]
-      }
-    ]
-  },
-  "split": "validation",
-  "results": [
-    {
-      "task_name": "sqa_dev",
-      "metrics": [
-        {
-          "name": "global_avg/mean",
-          "value": 0.6215245045241414
-        },
-        {
-          "name": "global_avg/stderr",
-          "value": 0.02088486499225903
-        },
-        {
-          "name": "ingredient_recall/mean",
-          "value": 0.6029178145087237
-        },
-        {
-          "name": "ingredient_recall/stderr",
-          "value": 0.026215888361291618
-        },
-        {
-          "name": "answer_precision/mean",
-          "value": 0.7960436785436785
-        },
-        {
-          "name": "answer_precision/stderr",
-          "value": 0.027692773517249983
-        },
-        {
-          "name": "citation_precision/mean",
-          "value": 0.697849041353826
-        },
-        {
-          "name": "citation_precision/stderr",
-          "value": 0.026784164936602798
-        },
-        {
-          "name": "citation_recall/mean",
-          "value": 0.3892874836903378
-        },
-        {
-          "name": "citation_recall/stderr",
-          "value": 0.015094770200171756
-        }
-      ],
-      "model_costs": [
-        1.3829150000000001,
-        0.9759700000000001,
-        2.2324650000000004,
-        0.76631,
-        0.9277900000000001,
-        2.6388600000000006,
-        0.8114100000000002,
-        2.3263174999999996,
-        2.5423725,
-        1.2398675000000001,
-        1.7387300000000003,
-        1.2176599999999997,
-        0.564655,
-        0.9726750000000001,
-        0.7675700000000001,
-        1.5198850000000002,
-        1.4726625000000002,
-        2.1937650000000004,
-        0.6907700000000001,
-        1.39835,
-        1.2598175,
-        2.5373550000000002,
-        2.19239,
-        1.2508875000000006,
-        2.2650550000000007,
-        1.6047725,
-        0.6525125000000003,
-        1.4262200000000003,
-        1.0533299999999999,
-        1.7252375,
-        1.407145,
-        1.5408700000000004,
-        2.8073224999999993,
-        1.0448125000000006,
-        1.7037300000000004,
-        0.8650500000000001,
-        1.0171225000000002,
-        0.5697925000000001,
-        2.7851025,
-        1.0551425,
-        2.9213775,
-        1.7772975000000004,
-        1.2753225000000001,
-        0.8108325000000001,
-        0.6958375000000001,
-        0.8840950000000003,
-        1.2028724999999998,
-        1.2490475000000003,
-        2.4272,
-        1.95026,
-        1.5352475,
-        2.11181,
-        2.3612249999999997,
-        1.8619225000000004,
-        0.7431075000000001,
-        1.5189675000000002,
-        1.089575,
-        1.6103700000000003,
-        1.4201450000000002,
-        2.397835,
-        1.469175,
-        1.0723550000000004,
-        0.7964050000000003,
-        3.3733175,
-        4.197085,
-        4.2637675,
-        1.2982124999999998,
-        0.66146,
-        1.1130475000000002,
-        2.4393974999999997,
-        2.582,
-        1.7381725000000001,
-        0.415025,
-        1.6777325,
-        1.0507825000000002,
-        2.4627125000000003,
-        1.017005,
-        1.9210250000000002,
-        1.5009025000000003,
-        0.8283125000000001,
-        2.9854425,
-        0.4633375000000001,
-        0.397685,
-        1.2803425,
-        3.0388200000000003,
-        1.2610875000000004,
-        1.798365,
-        3.427287500000001,
-        0.29307750000000005,
-        0.37101249999999997,
-        2.8046925000000003,
-        0.35557000000000005,
-        3.5481700000000007,
-        1.1073975,
-        1.5280825,
-        1.1714900000000001,
-        3.1791275000000003,
-        3.8214725000000005,
-        1.8440275,
-        1.730515,
-        1.9350675000000002,
-        1.6592125000000002,
-        1.9227124999999998,
-        1.202885,
-        1.2688150000000002,
-        0.8819875000000001,
-        0.6989325,
-        1.965635,
-        1.7467800000000002,
-        1.6940625000000002
-      ]
-    }
-  ],
-  "submission": {
-    "submit_time": "2025-06-09T20:55:35.869831Z",
-    "username": "miked-ai",
-    "agent_name": "Basic ReAct",
-    "agent_description": null,
-    "agent_url": null,
-    "logs_url": "hf://datasets/allenai/asta-bench-internal-submissions/1.0.0-dev1/validation/miked-ai_Basic_ReAct__task_tools__report_editor__2025-06-09T20-55-35",
-    "logs_url_public": null,
-    "summary_url": null
-  }
-}