---
title: ST-WebAgentBench Leaderboard
emoji: 🛡️
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.6.0
app_file: app.py
pinned: true
license: mit
tags:
  - leaderboard
  - benchmark
  - web-agents
  - safety
  - ICLR
datasets:
  - ST-WebAgentBench/st-webagentbench
short_description: "Safety & Trustworthiness Leaderboard for Web Agents"
---

# ST-WebAgentBench Leaderboard

**Evaluating Safety & Trustworthiness in Web Agents — ICLR 2026**

375 tasks | 3,005 policies | 6 safety dimensions | 3 web applications

## Key Metrics

| Metric | Definition |
|--------|-----------|
| **CuP** (primary) | Task completed AND zero policy violations |
| **CR** | Task completed (ignoring safety) |
| **Gap%** | The "safety tax": how much CR drops when enforcing policies |
| **Risk Ratio** | Per-dimension violation rate |
| **all-pass@k** | Reliability: CuP=1 across ALL k independent runs |

## How to Submit

1. Run the full benchmark on all 375 tasks
2. Generate your submission:

```bash
python -m stwebagentbench.leaderboard.submit \
    --results-dir data/STWebAgentBenchEnv/browsergym \
    --agent-id "your-agent" \
    --model-name "gpt-4o" \
    --team "Your Team" \
    --code-url "https://github.com/your/repo" \
    --contact-email "you@example.com" \
    --output submission.json
```

3. Upload `submission.json` on the **Submit** tab

## Links

- [Paper (arXiv)](https://arxiv.org/abs/2410.06703)
- [Dataset (HuggingFace)](https://huggingface.co/datasets/ST-WebAgentBench/st-webagentbench)
- [GitHub Repository](https://github.com/segev-shlomov/ST-WebAgentBench)
- [Project Website](https://sites.google.com/view/st-webagentbench/home)

## Citation

```bibtex
@inproceedings{Levy2025STWebAgentBench,
    title={ST-WebAgentBench: A Benchmark for Evaluating Safety and
           Trustworthiness in Web Agents},
    author={Levy, Ido and Shlomov, Segev and Ben-David, Amir and
            Mirsky, Reuth and others},
    booktitle={ICLR},
    year={2025},
    url={https://arxiv.org/abs/2410.06703}
}
```