--- title: ST-WebAgentBench Leaderboard emoji: 🛡️ colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 6.6.0 app_file: app.py pinned: true license: mit tags: - leaderboard - benchmark - web-agents - safety - ICLR datasets: - ST-WebAgentBench/st-webagentbench short_description: "Safety & Trustworthiness Leaderboard for Web Agents" --- # ST-WebAgentBench Leaderboard **Evaluating Safety & Trustworthiness in Web Agents — ICLR 2026** 375 tasks | 3,005 policies | 6 safety dimensions | 3 web applications ## Key Metrics | Metric | Definition | |--------|-----------| | **CuP** (primary) | Task completed AND zero policy violations | | **CR** | Task completed (ignoring safety) | | **Gap%** | The "safety tax": how much CR drops when enforcing policies | | **Risk Ratio** | Per-dimension violation rate | | **all-pass@k** | Reliability: CuP=1 across ALL k independent runs | ## How to Submit 1. Run the full benchmark on all 375 tasks 2. Generate your submission: ```bash python -m stwebagentbench.leaderboard.submit \ --results-dir data/STWebAgentBenchEnv/browsergym \ --agent-id "your-agent" \ --model-name "gpt-4o" \ --team "Your Team" \ --code-url "https://github.com/your/repo" \ --contact-email "you@example.com" \ --output submission.json ``` 3. Upload `submission.json` on the **Submit** tab ## Links - [Paper (arXiv)](https://arxiv.org/abs/2410.06703) - [Dataset (HuggingFace)](https://huggingface.co/datasets/ST-WebAgentBench/st-webagentbench) - [GitHub Repository](https://github.com/segev-shlomov/ST-WebAgentBench) - [Project Website](https://sites.google.com/view/st-webagentbench/home) ## Citation ```bibtex @inproceedings{Levy2025STWebAgentBench, title={ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents}, author={Levy, Ido and Shlomov, Segev and Ben-David, Amir and Mirsky, Reuth and others}, booktitle={ICLR}, year={2025}, url={https://arxiv.org/abs/2410.06703} } ```