Spaces:

TIGER-Lab
/

ClawBench

Running

App Files Files Community

AgPerry commited on 15 days ago

Commit

41e181d

verified ·

1 Parent(s): dbb3bdb

Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +27 -8
app.py +195 -0
requirements.txt +2 -0

README.md CHANGED Viewed

@@ -1,13 +1,32 @@
 ---
-title: ClawBench
-emoji: 😻
-colorFrom: yellow
-colorTo: indigo
 sdk: gradio
-sdk_version: 6.14.0
-python_version: '3.13'
 app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: ClawBench Leaderboard
+emoji: 🦀
+colorFrom: indigo
+colorTo: purple
 sdk: gradio
+sdk_version: 5.15.0
 app_file: app.py
+pinned: true
+license: apache-2.0
+short_description: Can AI agents complete everyday online tasks?
+tags:
+  - leaderboard
+  - benchmark
+  - web-agents
+  - browser-automation
+  - agent-evaluation
+  - llm-evaluation
 ---
+# ClawBench — Leaderboard
+Live results for the [ClawBench](https://huggingface.co/datasets/TIGER-Lab/ClawBench) web-agent benchmark — backed by [`leaderboard/results.csv`](https://huggingface.co/datasets/TIGER-Lab/ClawBench/blob/main/leaderboard/results.csv) in the dataset repo. Submit your model by opening a PR there.
+| Resource | Link |
+|---|---|
+| 📖 Paper | https://arxiv.org/abs/2604.08523 |
+| 💻 GitHub | https://github.com/reacher-z/ClawBench |
+| 🗂 Dataset | https://huggingface.co/datasets/TIGER-Lab/ClawBench |
+| 🎞 Traces (V1) | https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace |
+| 🎞 Traces (V2) | https://huggingface.co/datasets/TIGER-Lab/ClawBenchV2Trace |
+| 🌐 Website | https://claw-bench.com |

app.py ADDED Viewed

	@@ -0,0 +1,195 @@

+"""ClawBench leaderboard Space — reads results.csv from the TIGER-Lab/ClawBench dataset.
+Two-stage scoring per https://github.com/reacher-z/ClawBench/blob/main/eval/scoring.md:
+  - Intercepted (Stage 1) = fraction of runs whose final HTTP request hit the per-task URL/method schema.
+  - Reward     (Stage 2) = fraction that also passed an LLM judge on the intercepted payload. Headline metric.
+"""
+import io
+import urllib.request
+import gradio as gr
+import pandas as pd
+RESULTS_URL = (
+    "https://huggingface.co/datasets/TIGER-Lab/ClawBench/resolve/main/leaderboard/results.csv"
+)
+CITATION = """@article{zhang2026clawbench,
+  title  = {ClawBench: Can AI Agents Complete Everyday Online Tasks?},
+  author = {Zhang, Xiaochen and others},
+  year   = {2026},
+  eprint = {2604.08523},
+  archivePrefix = {arXiv},
+}"""
+INTRO = """# 🏆 ClawBench — Web Agent Benchmark
+**Can AI agents complete everyday online tasks?** ClawBench scores agents on real, live websites (booking flights, ordering groceries, submitting job applications). Two corpora: **V1** — 153 tasks across 144 websites · **V2** — 130 newer tasks across 63 platforms. Every run is graded twice: a deterministic HTTP-request *interception* check (Stage 1), then an LLM *judge* on the intercepted payload (Stage 2 — the headline `Reward`).
+[**📖 Paper**](https://arxiv.org/abs/2604.08523) · [**💻 GitHub**](https://github.com/reacher-z/ClawBench) · [**🗂 Dataset**](https://huggingface.co/datasets/TIGER-Lab/ClawBench) · [**🎞 Traces V1**](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace) · [**🎞 Traces V2**](https://huggingface.co/datasets/TIGER-Lab/ClawBenchV2Trace) · [**🌐 Site**](https://claw-bench.com)
+"""
+TABLE_INTRO = """**Intercepted** = agent's final HTTP request matched the task's URL/method schema. **Reward** = AND passed the LLM judge on the payload (default judge: `deepseek/deepseek-v4-pro`). Rows are ranked by Reward, then Intercepted as tiebreak. `—` means no Stage-2 data available."""
+ABOUT = """## About ClawBench
+### Why a new benchmark?
+Existing browser-agent benchmarks either run on synthetic / sandboxed websites (WebArena, VisualWebArena) or only check whether the agent *reached* the endpoint (WebVoyager). ClawBench runs on **live, real-world websites** and verifies the *payload* the agent submitted — so an agent that types the wrong delivery address into Uber Eats fails, even if its last HTTP request hit the correct endpoint.
+### Two corpora
+- **V1** — 153 tasks across 144 real websites (the paper).
+- **V2** — 130 newer everyday tasks across 63 platforms, expanded coverage of e-commerce / form-filling / authentication-walled flows.
+### Two-stage scoring
+| Stage | What it checks | Output |
+|---|---|---|
+| 1. **Interception** | Did the final HTTP request match the task's URL + method + canonical body schema? | `intercepted ∈ {true, false}` |
+| 2. **Judge** | Given the natural-language instruction and the intercepted payload, did the agent submit the *right* thing? | `match ∈ {true, false, null}` |
+`Reward = Intercepted ∧ Match`. Full prompt + judge model details: [eval/scoring.md ↗](https://github.com/reacher-z/ClawBench/blob/main/eval/scoring.md)
+### What ships with every run
+A **5-layer trace bundle** (downloadable from the Traces datasets above):
+- `recording.mp4` — full browser session video
+- `actions.jsonl` — every click / type / scroll
+- `agent-messages.jsonl` — model inputs & outputs (incl. reasoning)
+- `requests.jsonl` — every HTTP request the page made
+- `interception.json` — graded final request
+- `run-meta.json` — model, harness, scores, timing
+### Reproducing
+```bash
+pip install clawbench-eval
+clawbench run --model <your-model> --harness hermes --corpus v2
+python scripts/clawbench_rescore.py --judge-model deepseek-v4-pro --only-batch <your-batch-dir>
+```
+"""
+SUBMIT = """## 🚀 Submit your model
+Submissions are accepted as **PRs to the leaderboard CSV** in the dataset repo:
+[**Open the CSV in the dataset repo ↗**](https://huggingface.co/datasets/TIGER-Lab/ClawBench/blob/main/leaderboard/results.csv)
+### Required steps
+1. **Run the benchmark** — install `pip install clawbench-eval`, then `clawbench run --model <your-model> --harness hermes --corpus v2` (or `v1`). Use the included harnesses (hermes / openclaw) so traces follow the standard 5-layer format.
+2. **Score** — `python scripts/clawbench_rescore.py --judge-model deepseek-v4-pro --only-batch <your-batch-dir>` produces `rescore-summary.json` with the cells you'll need.
+3. **Upload traces** (recommended) — push the 5-layer run bundles to `TIGER-Lab/ClawBenchV2Trace` (or `NAIL-Group/ClawBenchV1Trace`) so others can audit.
+4. **Open a PR** — add one row per `(model, harness, corpus)` to `leaderboard/results.csv` with columns: `model,harness,dataset,passed,total,pass_rate,reward_rate,wall_hours`. Link the trace bundle in the PR description.
+We re-run a sample of your submitted traces with our judge before merging — to keep the table honest.
+For step-by-step instructions, see [`eval/scoring.md`](https://github.com/reacher-z/ClawBench/blob/main/eval/scoring.md).
+"""
+def _format_pct(v) -> str:
+    return "—" if pd.isna(v) else f"{v:.2f}%"
+def _format_wall(v) -> str:
+    return "—" if pd.isna(v) else f"{v:.2f}"
+def load_results() -> pd.DataFrame:
+    raw = urllib.request.urlopen(RESULTS_URL, timeout=30).read()
+    df = pd.read_csv(io.BytesIO(raw))
+    if "reward_rate" not in df.columns:
+        df["reward_rate"] = pd.NA
+    df = df.sort_values(
+        ["dataset", "reward_rate", "pass_rate"],
+        ascending=[True, False, False],
+        na_position="last",
+    ).reset_index(drop=True)
+    df.insert(0, "rank", df.groupby("dataset").cumcount() + 1)
+    df["pass_rate"] = df["pass_rate"].map(_format_pct)
+    df["reward_rate"] = df["reward_rate"].map(_format_pct)
+    df["wall_hours"] = df["wall_hours"].map(_format_wall)
+    df.rename(
+        columns={
+            "model": "Model",
+            "harness": "Harness",
+            "dataset": "Corpus",
+            "passed": "Pass",
+            "total": "Total",
+            "pass_rate": "Intercepted",
+            "reward_rate": "Reward",
+            "wall_hours": "Wall (h)",
+            "rank": "Rank",
+        },
+        inplace=True,
+    )
+    return df[["Rank", "Model", "Harness", "Corpus", "Intercepted", "Reward", "Pass", "Total", "Wall (h)"]]
+def filter_df(query: str, corpus: str, harness_filter: list[str]):
+    df = load_results()
+    if corpus and corpus != "all":
+        df = df[df["Corpus"].str.lower() == corpus.lower()]
+    if harness_filter:
+        df = df[df["Harness"].isin(harness_filter)]
+    if query:
+        q = query.strip().lower()
+        df = df[df["Model"].str.lower().str.contains(q, na=False)]
+    return df.reset_index(drop=True)
+def all_harnesses() -> list[str]:
+    try:
+        df = load_results()
+        return sorted(df["Harness"].dropna().unique().tolist())
+    except Exception:
+        return ["hermes", "openclaw"]
+with gr.Blocks(title="ClawBench Leaderboard", theme=gr.themes.Soft()) as demo:
+    gr.Markdown(INTRO)
+    with gr.Tabs():
+        with gr.TabItem("📊 Leaderboard"):
+            with gr.Row():
+                with gr.Accordion("Citation", open=False):
+                    gr.Textbox(value=CITATION, label="BibTeX", lines=8, interactive=False)
+            gr.Markdown(TABLE_INTRO)
+            with gr.Row():
+                search_bar = gr.Textbox(placeholder="Search models…", show_label=False, scale=3)
+                corpus_choice = gr.Radio(choices=["all", "v2", "v1"], value="v2", label="Corpus", scale=2)
+            harness_choice = gr.CheckboxGroup(
+                choices=all_harnesses(),
+                value=all_harnesses(),
+                label="Harness",
+            )
+            df_init = filter_df("", "v2", all_harnesses())
+            table = gr.Dataframe(
+                value=df_init,
+                interactive=False,
+                wrap=True,
+                column_widths=["60px", "260px", "100px", "70px", "110px", "100px", "60px", "60px", "80px"],
+            )
+            refresh = gr.Button("🔄 Refresh from dataset")
+            for control in (search_bar, corpus_choice, harness_choice):
+                control.change(
+                    fn=filter_df,
+                    inputs=[search_bar, corpus_choice, harness_choice],
+                    outputs=table,
+                )
+            refresh.click(fn=filter_df, inputs=[search_bar, corpus_choice, harness_choice], outputs=table)
+        with gr.TabItem("📝 About"):
+            gr.Markdown(ABOUT)
+        with gr.TabItem("🚀 Submit here"):
+            gr.Markdown(SUBMIT)
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ gradio==5.15.0
2	+ pandas