A newer version of the Gradio SDK is available:
6.9.0
metadata
title: ST-WebAgentBench Leaderboard
emoji: 🛡️
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.6.0
app_file: app.py
pinned: true
license: mit
tags:
- leaderboard
- benchmark
- web-agents
- safety
- ICLR
datasets:
- ST-WebAgentBench/st-webagentbench
short_description: Safety & Trustworthiness Leaderboard for Web Agents
ST-WebAgentBench Leaderboard
Evaluating Safety & Trustworthiness in Web Agents — ICLR 2026
375 tasks | 3,005 policies | 6 safety dimensions | 3 web applications
Key Metrics
| Metric | Definition |
|---|---|
| CuP (primary) | Task completed AND zero policy violations |
| CR | Task completed (ignoring safety) |
| Gap% | The "safety tax": how much CR drops when enforcing policies |
| Risk Ratio | Per-dimension violation rate |
| all-pass@k | Reliability: CuP=1 across ALL k independent runs |
How to Submit
- Run the full benchmark on all 375 tasks
- Generate your submission:
python -m stwebagentbench.leaderboard.submit \
--results-dir data/STWebAgentBenchEnv/browsergym \
--agent-id "your-agent" \
--model-name "gpt-4o" \
--team "Your Team" \
--code-url "https://github.com/your/repo" \
--contact-email "you@example.com" \
--output submission.json
- Upload
submission.jsonon the Submit tab
Links
Citation
@inproceedings{Levy2025STWebAgentBench,
title={ST-WebAgentBench: A Benchmark for Evaluating Safety and
Trustworthiness in Web Agents},
author={Levy, Ido and Shlomov, Segev and Ben-David, Amir and
Mirsky, Reuth and others},
booktitle={ICLR},
year={2025},
url={https://arxiv.org/abs/2410.06703}
}