AgPerry commited on
Commit
a478c75
Β·
verified Β·
1 Parent(s): e77a483

Sync Space INTRO+TABLE_INTRO+citation: Intercepted is sort key, full author list in BibTeX

Browse files
Files changed (1) hide show
  1. app.py +9 -7
app.py CHANGED
@@ -16,22 +16,24 @@ RESULTS_URL = (
16
  )
17
 
18
 
19
- CITATION = """@article{zhang2026clawbench,
20
- title = {ClawBench: Can AI Agents Complete Everyday Online Tasks?},
21
- author = {Zhang, Xiaochen and others},
22
- year = {2026},
23
- eprint = {2604.08523},
24
  archivePrefix = {arXiv},
 
 
25
  }"""
26
 
27
  INTRO = """# πŸ† ClawBench β€” Web Agent Benchmark
28
 
29
- **Can AI agents complete everyday online tasks?** ClawBench scores agents on real, live websites (booking flights, ordering groceries, submitting job applications). Two corpora: **V1** β€” 153 tasks across 144 websites Β· **V2** β€” 130 newer tasks across 63 platforms. Every run is graded twice: a deterministic HTTP-request *interception* check (Stage 1), then an LLM *judge* on the intercepted payload (Stage 2 β€” the headline `Reward`).
30
 
31
  [**πŸ“– Paper**](https://arxiv.org/abs/2604.08523) Β· [**πŸ’» GitHub**](https://github.com/reacher-z/ClawBench) Β· [**πŸ—‚ Dataset**](https://huggingface.co/datasets/TIGER-Lab/ClawBench) Β· [**🎞 Traces V1**](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace) Β· [**🎞 Traces V2**](https://huggingface.co/datasets/TIGER-Lab/ClawBenchV2Trace) Β· [**🌐 Site**](https://claw-bench.com)
32
  """
33
 
34
- TABLE_INTRO = """**Intercepted** = agent's final HTTP request matched the task's URL/method schema. **Reward** = AND passed the LLM judge on the payload (default judge: `deepseek/deepseek-v4-pro`). Rows are ranked by Reward, then Intercepted as tiebreak. `β€”` means no Stage-2 data available."""
35
 
36
  ABOUT = """## About ClawBench
37
 
 
16
  )
17
 
18
 
19
+ CITATION = """@misc{zhang2026clawbench,
20
+ title = {ClawBench: Can AI Agents Complete Everyday Online Tasks?},
21
+ author = {Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
22
+ year = {2026},
23
+ eprint = {2604.08523},
24
  archivePrefix = {arXiv},
25
+ primaryClass = {cs.AI},
26
+ url = {https://arxiv.org/abs/2604.08523}
27
  }"""
28
 
29
  INTRO = """# πŸ† ClawBench β€” Web Agent Benchmark
30
 
31
+ **Can AI agents complete everyday online tasks?** ClawBench scores agents on real, live websites (booking flights, ordering groceries, submitting job applications). Two corpora: **V1** β€” 153 tasks across 144 websites Β· **V2** β€” 130 newer tasks across 63 platforms. Every run is graded twice: a deterministic HTTP-request *interception* check (Stage 1, the sort key) β€” then an LLM *judge* on the intercepted payload (Stage 2 = `Reward`).
32
 
33
  [**πŸ“– Paper**](https://arxiv.org/abs/2604.08523) Β· [**πŸ’» GitHub**](https://github.com/reacher-z/ClawBench) Β· [**πŸ—‚ Dataset**](https://huggingface.co/datasets/TIGER-Lab/ClawBench) Β· [**🎞 Traces V1**](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace) Β· [**🎞 Traces V2**](https://huggingface.co/datasets/TIGER-Lab/ClawBenchV2Trace) Β· [**🌐 Site**](https://claw-bench.com)
34
  """
35
 
36
+ TABLE_INTRO = """**Intercepted** (sort key) = agent's final HTTP request matched the task's URL/method schema β€” Stage 1, deterministic, no judge. **Reward** = additionally requires the LLM judge (default `deepseek/deepseek-v4-pro`) to confirm the payload fulfilled the instruction β€” Stage 2. Rows are ranked by Intercepted (corpus-normalized: `intercepted / 130` for V2 so partials don't outrank complete batches) with Reward as tiebreak. `β€”` = no Stage-2 data yet."""
37
 
38
  ABOUT = """## About ClawBench
39