Sync Space INTRO+TABLE_INTRO+citation: Intercepted is sort key, full author list in BibTeX
Browse files
app.py
CHANGED
|
@@ -16,22 +16,24 @@ RESULTS_URL = (
|
|
| 16 |
)
|
| 17 |
|
| 18 |
|
| 19 |
-
CITATION = """@
|
| 20 |
-
title
|
| 21 |
-
author
|
| 22 |
-
year
|
| 23 |
-
eprint
|
| 24 |
archivePrefix = {arXiv},
|
|
|
|
|
|
|
| 25 |
}"""
|
| 26 |
|
| 27 |
INTRO = """# π ClawBench β Web Agent Benchmark
|
| 28 |
|
| 29 |
-
**Can AI agents complete everyday online tasks?** ClawBench scores agents on real, live websites (booking flights, ordering groceries, submitting job applications). Two corpora: **V1** β 153 tasks across 144 websites Β· **V2** β 130 newer tasks across 63 platforms. Every run is graded twice: a deterministic HTTP-request *interception* check (Stage 1
|
| 30 |
|
| 31 |
[**π Paper**](https://arxiv.org/abs/2604.08523) Β· [**π» GitHub**](https://github.com/reacher-z/ClawBench) Β· [**π Dataset**](https://huggingface.co/datasets/TIGER-Lab/ClawBench) Β· [**π Traces V1**](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace) Β· [**π Traces V2**](https://huggingface.co/datasets/TIGER-Lab/ClawBenchV2Trace) Β· [**π Site**](https://claw-bench.com)
|
| 32 |
"""
|
| 33 |
|
| 34 |
-
TABLE_INTRO = """**Intercepted** = agent's final HTTP request matched the task's URL/method schema. **Reward** =
|
| 35 |
|
| 36 |
ABOUT = """## About ClawBench
|
| 37 |
|
|
|
|
| 16 |
)
|
| 17 |
|
| 18 |
|
| 19 |
+
CITATION = """@misc{zhang2026clawbench,
|
| 20 |
+
title = {ClawBench: Can AI Agents Complete Everyday Online Tasks?},
|
| 21 |
+
author = {Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
|
| 22 |
+
year = {2026},
|
| 23 |
+
eprint = {2604.08523},
|
| 24 |
archivePrefix = {arXiv},
|
| 25 |
+
primaryClass = {cs.AI},
|
| 26 |
+
url = {https://arxiv.org/abs/2604.08523}
|
| 27 |
}"""
|
| 28 |
|
| 29 |
INTRO = """# π ClawBench β Web Agent Benchmark
|
| 30 |
|
| 31 |
+
**Can AI agents complete everyday online tasks?** ClawBench scores agents on real, live websites (booking flights, ordering groceries, submitting job applications). Two corpora: **V1** β 153 tasks across 144 websites Β· **V2** β 130 newer tasks across 63 platforms. Every run is graded twice: a deterministic HTTP-request *interception* check (Stage 1, the sort key) β then an LLM *judge* on the intercepted payload (Stage 2 = `Reward`).
|
| 32 |
|
| 33 |
[**π Paper**](https://arxiv.org/abs/2604.08523) Β· [**π» GitHub**](https://github.com/reacher-z/ClawBench) Β· [**π Dataset**](https://huggingface.co/datasets/TIGER-Lab/ClawBench) Β· [**π Traces V1**](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace) Β· [**π Traces V2**](https://huggingface.co/datasets/TIGER-Lab/ClawBenchV2Trace) Β· [**π Site**](https://claw-bench.com)
|
| 34 |
"""
|
| 35 |
|
| 36 |
+
TABLE_INTRO = """**Intercepted** (sort key) = agent's final HTTP request matched the task's URL/method schema β Stage 1, deterministic, no judge. **Reward** = additionally requires the LLM judge (default `deepseek/deepseek-v4-pro`) to confirm the payload fulfilled the instruction β Stage 2. Rows are ranked by Intercepted (corpus-normalized: `intercepted / 130` for V2 so partials don't outrank complete batches) with Reward as tiebreak. `β` = no Stage-2 data yet."""
|
| 37 |
|
| 38 |
ABOUT = """## About ClawBench
|
| 39 |
|