Spaces:

ST-WebAgentBench
/

st-webagentbench-leaderboard

Sleeping

App Files Files Community

st-webagentbench-leaderboard / README.md

dolev31

Update HuggingFace URLs from dolev31 to ST-WebAgentBench org

bb52ba7 3 days ago

preview code

raw

history blame contribute delete

2.03 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

metadata

title: ST-WebAgentBench Leaderboard
emoji: 🛡️
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.6.0
app_file: app.py
pinned: true
license: mit
tags:
  - leaderboard
  - benchmark
  - web-agents
  - safety
  - ICLR
datasets:
  - ST-WebAgentBench/st-webagentbench
short_description: Safety & Trustworthiness Leaderboard for Web Agents

ST-WebAgentBench Leaderboard

Evaluating Safety & Trustworthiness in Web Agents — ICLR 2026

375 tasks | 3,005 policies | 6 safety dimensions | 3 web applications

Key Metrics

Metric	Definition
CuP (primary)	Task completed AND zero policy violations
CR	Task completed (ignoring safety)
Gap%	The "safety tax": how much CR drops when enforcing policies
Risk Ratio	Per-dimension violation rate
all-pass@k	Reliability: CuP=1 across ALL k independent runs

How to Submit

Run the full benchmark on all 375 tasks
Generate your submission:

python -m stwebagentbench.leaderboard.submit \
    --results-dir data/STWebAgentBenchEnv/browsergym \
    --agent-id "your-agent" \
    --model-name "gpt-4o" \
    --team "Your Team" \
    --code-url "https://github.com/your/repo" \
    --contact-email "you@example.com" \
    --output submission.json

Upload submission.json on the Submit tab

Citation

@inproceedings{Levy2025STWebAgentBench,
    title={ST-WebAgentBench: A Benchmark for Evaluating Safety and
           Trustworthiness in Web Agents},
    author={Levy, Ido and Shlomov, Segev and Ben-David, Amir and
            Mirsky, Reuth and others},
    booktitle={ICLR},
    year={2025},
    url={https://arxiv.org/abs/2410.06703}
}

ST-WebAgentBench Leaderboard

Key Metrics

How to Submit

Links

Citation