Spaces:

ST-WebAgentBench
/

st-webagentbench-leaderboard

Sleeping

dolev31 commited on 11 days ago

Commit

a75cabf

1 Parent(s): 629c2ae

Remove hardcoded stats: load all config dynamically from test.raw.json

- Include test.raw.json in Space deployment (was excluded by .gitignore)
- Replace hardcoded fallbacks with FileNotFoundError in both schema.py files
- Replace hardcoded "3 web applications" with len(WEB_APPLICATIONS)
- Update README stats to match actual data (375 tasks, 3005 policies)

Files changed (4) hide show

README.md +2 -2
app.py +1 -1
data/test.raw.json +0 -0
validation/schema.py +4 -2

README.md CHANGED Viewed

@@ -23,7 +23,7 @@ short_description: "Safety & Trustworthiness Leaderboard for Web Agents"
 **Evaluating Safety & Trustworthiness in Web Agents — ICLR 2025**
-295 tasks | 2,552 policies | 6 safety dimensions | 3 web applications
 ## Key Metrics
@@ -37,7 +37,7 @@ short_description: "Safety & Trustworthiness Leaderboard for Web Agents"
 ## How to Submit
-1. Run the full benchmark on all 295 tasks
 2. Generate your submission:
 ```bash

 **Evaluating Safety & Trustworthiness in Web Agents — ICLR 2025**
+375 tasks | 3,005 policies | 6 safety dimensions | 3 web applications
 ## Key Metrics
 ## How to Submit
+1. Run the full benchmark on all 375 tasks
 2. Generate your submission:
 ```bash

app.py CHANGED Viewed

@@ -2114,7 +2114,7 @@ contact details.
                 gr.Markdown(
                     f"## About ST-WebAgentBench\n\n"
                     f"**{EXPECTED_TASK_COUNT} tasks** | **{EXPECTED_POLICY_COUNT:,} policies** "
-                    f"| **{len(SAFETY_DIMENSIONS)} safety dimensions** | **3 web applications**\n\n"
                     "**Accepted at ICLR 2025** — ST-WebAgentBench evaluates web agents on both "
                     "task completion **and** safety policy adherence — the first benchmark to "
                     "systematically measure the safety-performance tradeoff in autonomous web agents.\n\n"

                 gr.Markdown(
                     f"## About ST-WebAgentBench\n\n"
                     f"**{EXPECTED_TASK_COUNT} tasks** | **{EXPECTED_POLICY_COUNT:,} policies** "
+                    f"| **{len(SAFETY_DIMENSIONS)} safety dimensions** | **{len(WEB_APPLICATIONS)} web applications**\n\n"
                     "**Accepted at ICLR 2025** — ST-WebAgentBench evaluates web agents on both "
                     "task completion **and** safety policy adherence — the first benchmark to "
                     "systematically measure the safety-performance tradeoff in autonomous web agents.\n\n"

data/test.raw.json ADDED Viewed

The diff for this file is too large to render. See raw diff

validation/schema.py CHANGED Viewed

@@ -34,8 +34,10 @@ def _load_benchmark_config() -> tuple:
              web_applications, tier_config).
     """
     if not _TASKS_DATA_PATH.exists():
-        logger.warning("test.raw.json not found at %s, using defaults", _TASKS_DATA_PATH)
-        return 375, 3005, [], {}, [], {}
     with open(_TASKS_DATA_PATH) as f:
         tasks = json.load(f)

              web_applications, tier_config).
     """
     if not _TASKS_DATA_PATH.exists():
+        raise FileNotFoundError(
+            f"test.raw.json not found at {_TASKS_DATA_PATH}. "
+            "This file must be included in the Space deployment."
+        )
     with open(_TASKS_DATA_PATH) as f:
         tasks = json.load(f)