Zhu Jiajun (jz28583) commited on
Commit
0425205
Β·
1 Parent(s): 0309359

deploy: overlay server/space/{README,Dockerfile} at root

Browse files
Files changed (2) hide show
  1. Dockerfile +38 -0
  2. README.md +50 -218
Dockerfile ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ ENV PYTHONUNBUFFERED=1 \
4
+ PYTHONDONTWRITEBYTECODE=1 \
5
+ PIP_NO_CACHE_DIR=1
6
+
7
+ WORKDIR /app
8
+
9
+ RUN apt-get update && \
10
+ apt-get install -y --no-install-recommends git && \
11
+ rm -rf /var/lib/apt/lists/*
12
+
13
+ # Install deps first so the layer caches across code-only changes.
14
+ COPY server/requirements.txt /app/server/requirements.txt
15
+ RUN pip install -r /app/server/requirements.txt huggingface_hub>=0.20
16
+
17
+ # Install the graphtestbed package itself so server/api.py can
18
+ # `from graphtestbed._manifest import ...`.
19
+ COPY pyproject.toml /app/
20
+ COPY graphtestbed /app/graphtestbed
21
+ COPY datasets /app/datasets
22
+ COPY server /app/server
23
+ RUN pip install --no-deps -e /app
24
+
25
+ # HF Spaces mounts /data on Persistent Storage tier; on free tier it's
26
+ # just an in-container path that the dataset-repo backup loop preserves.
27
+ ENV GT_DATA_ROOT=/data \
28
+ GT_DIR=/data/gt \
29
+ GT_DB=/data/leaderboard.db \
30
+ GT_ARCHIVE_DIR=/data/submissions \
31
+ GT_DATASET_REPO=lanczos/graphtestbed-gt \
32
+ GT_BACKUP_INTERVAL=60 \
33
+ GT_QUOTA=5 \
34
+ PORT=7860
35
+ RUN mkdir -p /data && chmod 777 /data
36
+
37
+ EXPOSE 7860
38
+ CMD ["python", "/app/server/space/space_entry.py"]
README.md CHANGED
@@ -1,224 +1,56 @@
1
- # GraphTestbed
2
-
3
- > A Kaggle-style scoring server for benchmarking ML/AI agent harnesses on heterogeneous graph datasets.
4
-
5
- [![Leaderboard](https://img.shields.io/badge/%F0%9F%A4%97%20Leaderboard-lanczos%2Fgraphtestbed-yellow)](https://huggingface.co/spaces/lanczos/graphtestbed)
6
- [![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-lanczos%2Fgraphtestbed--data-yellow)](https://huggingface.co/datasets/lanczos/graphtestbed-data)
7
- [![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
8
- [![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
9
- [![Status](https://img.shields.io/badge/status-pre--launch-orange)](#status)
10
-
11
- Build an agent. Submit predictions. Get a score. Test labels live on a server, not in this repo, so your agent can't accidentally cheat. Designed for non-adversarial benchmarking of agent harnesses on real graph data.
12
-
13
- ## Status
14
-
15
- **Pre-launch.** The code runs end-to-end. Pieces that aren't fully live yet:
16
-
17
- - The package isn't on PyPI yet β†’ install from git (see below)
18
- - HuggingFace dataset repos aren't published yet β†’ use your own `train/val/test_features.csv` files for now
19
-
20
- The hosted scoring API at <https://lanczos-graphtestbed.hf.space/> is the
21
- default `gtb submit` target. Set `GRAPHTESTBED_API` to point at a local
22
- server if you'd rather self-host (instructions below).
23
-
24
- ## Submit to the hosted leaderboard
25
-
26
- ```bash
27
- pip install git+https://github.com/zhuconv/GraphTestbed
28
- gtb submit figraph --file preds.csv --agent my-agent-v1
29
- # βœ“ Scored primary (auc_roc): 0.689 rank: #3
30
- gtb leaderboard figraph
31
- ```
32
-
33
- The hosted server is a Docker-SDK HF Space that holds GT files in a private
34
- companion dataset and never logs prediction CSVs (it does archive them in
35
- the same private repo for reproducibility β€” see [`server/space/DEPLOY.md`](server/space/DEPLOY.md)).
36
- Trust model: non-adversarial, 5 submissions/day/IP/task, score bucketed to
37
- 3 decimals β€” same as if you ran the server yourself.
38
-
39
- ## Run the server locally (alternative)
40
-
41
- ```bash
42
- git clone https://github.com/zhuconv/GraphTestbed
43
- cd GraphTestbed
44
- GT_DIR=~/path/to/your/ground_truth ./server/run_local.sh
45
- # β†’ Running on http://localhost:8080
46
-
47
- # point the client at it
48
- export GRAPHTESTBED_API=http://localhost:8080
49
- gtb submit figraph --file preds.csv --agent my-agent-v1
50
- ```
51
-
52
- You provide the `ground_truth/<task>.csv` files yourself (one row per test entity, columns `<id_col>,Label`). The CLI never needs to see them.
53
-
54
- ## Tasks
55
-
56
- | Task | Type | Metric | Test rows | Backend |
57
- | --- | --- | --- | --- | --- |
58
- | `ieee-fraud-detection` | binary classification | AUC-ROC | 506,691 | Kaggle |
59
- | `arxiv-citation` | binary classification | AUC-ROC | 193,696 | local GT |
60
- | `figraph` | binary anomaly detection | AUC-ROC | 3,596 | local GT |
61
- | `ibm-aml` | binary anomaly detection | F1 (minority) | 863,900 | local GT |
62
-
63
- `ieee-fraud-detection` forwards your CSV to the Kaggle competition and
64
- returns the official private-leaderboard score. The hosted server uses its
65
- own Kaggle credentials for that β€” no setup needed on your side just to
66
- submit. (You only need `KAGGLE_USERNAME`/`KAGGLE_KEY` if you self-host.)
67
-
68
- ## How it works
69
-
70
- ```
71
- agent ──► gtb fetch <task> ──► (HuggingFace, when published)
72
- agent ──► gtb submit <file> ──► POST /submit
73
- β”‚
74
- β–Ό
75
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
76
- β”‚ server (Flask, ~250 LOC) β”‚
77
- β”‚ 1. quota: 5/day/IP/task β”‚
78
- β”‚ 2. schema check β”‚
79
- β”‚ 3. join with GT (local fs only) β”‚
80
- β”‚ 4. metric β†’ round to 3dp β”‚
81
- β”‚ 5. sqlite leaderboard β†’ JSON β”‚
82
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
83
- ```
84
-
85
- Ground truth never enters git. It sits on the server's local filesystem, populated separately at deploy time.
86
-
87
- ## CLI
88
-
89
- ```bash
90
- gtb fetch <task> # download from HF (when published)
91
- gtb submit <task> --file <csv> --agent <name> # POST to scoring API
92
- gtb submit ... --dry-run # validate schema, no quota burn
93
- gtb leaderboard <task> # GET /leaderboard/<task>
94
- gtb score-local <task> --file <csv> # score against val (no GT needed)
95
- ```
96
-
97
- `GRAPHTESTBED_API` overrides the server URL. Defaults to `http://localhost:8080`.
98
-
99
  ---
100
 
101
- <details>
102
- <summary><b>Submission format</b></summary>
103
-
104
- Per-task schema in `datasets/manifest.yaml`:
105
-
106
- ```yaml
107
- arxiv-citation:
108
- hf_repo: graphtestbed/arxiv-citation
109
- files:
110
- train_features: { sha256: ... }
111
- val_features: { sha256: ... }
112
- test_features: { sha256: ... } # no Label column
113
- submission_schema:
114
- id_col: Paper_ID
115
- pred_col: Label # float in [0, 1]
116
- n_rows: 19394
117
- metric:
118
- primary: auc_roc
119
- secondary: [auc_pr, f1]
120
- ```
121
-
122
- CSV must contain **exactly** `[id_col, pred_col]`, exactly `n_rows` rows, IDs matching `test_features.csv` 100%. Otherwise schema check fails before scoring (no quota burn).
123
-
124
- Successful output:
125
-
126
- ```
127
- βœ“ Schema OK (rows=19394, sha256=abcdef012345...)
128
- βœ“ Scored (run_id=ab12cd34ef56)
129
- primary (auc_roc): 0.689
130
- secondary: {'auc_pr': 0.661, 'f1': 0.591}
131
- leaderboard rank: #3
132
- quota remaining today: 4
133
- ```
134
- </details>
135
-
136
- <details>
137
- <summary><b>Trust model</b></summary>
138
-
139
- This is a **non-adversarial** benchmark. We trust agent authors not to circumvent the contract. The API enforces:
140
-
141
- - **Test labels not in the repo.** No accidental leak via `pd.concat([train, val, test])`.
142
- - **Rate limit.** 5 submissions / day / IP / task β€” enough to debug, not to hill-climb.
143
- - **Score bucketing.** Metrics rounded to 3dp; on a 19k test set this is ~5Οƒ per bit, making probing statistically uninformative.
144
- - **No per-row feedback.** Aggregates only.
145
- - **Audit trail.** Every submission's sha256 + agent + timestamp lives in sqlite.
146
-
147
- **What we do _not_ defend against:** anyone reading the `server/` code (it's public, but contains no GT); anyone re-downloading upstream Kaggle / FiGraph data and trying to recompute labels.
148
-
149
- If you need adversarial guarantees, use a fully sandboxed eval (Docker + `network=none` + private CDN). Wrong tool here.
150
- </details>
151
-
152
- <details>
153
- <summary><b>Repo layout</b></summary>
154
-
155
- ```
156
- GraphTestbed/
157
- β”œβ”€β”€ graphtestbed/ # the gtb CLI package (~400 LOC)
158
- β”œβ”€β”€ server/ # the Flask scoring API (~250 LOC)
159
- β”‚ β”œβ”€β”€ api.py
160
- β”‚ β”œβ”€β”€ run_local.sh
161
- β”‚ └── deploy.md
162
- β”œβ”€β”€ datasets/manifest.yaml # task β†’ HF location + schema + metric
163
- β”œβ”€β”€ PROTOCOL.md # full submission contract
164
- └── README.md
165
- ```
166
-
167
- When the project goes live the `server/` directory will be promoted to its own `server` branch (kept separate so deploy hosts only checkout what they run). For now everything is on `main`.
168
-
169
- See [`PROTOCOL.md`](PROTOCOL.md) for the full submission contract. See [`server/deploy.md`](server/deploy.md) for hosted deploy.
170
- </details>
171
-
172
- <details>
173
- <summary><b>Adding a new task</b></summary>
174
-
175
- 1. Prepare `train/val/test_features.csv` + `ground_truth.csv` locally. Strip target column from `test_features`.
176
- 2. Push agent-visible files to `huggingface.co/datasets/graphtestbed/<task>` (when HF org is set up).
177
- 3. Place `ground_truth.csv` at `<server>/path/to/gt/<task>.csv`.
178
- 4. Add the task to `datasets/manifest.yaml` with sha256 + schema + metric.
179
-
180
- CLI and API auto-discover from the manifest. No code changes.
181
- </details>
182
-
183
- <details>
184
- <summary><b>Adding a new agent</b></summary>
185
-
186
- You don't modify GraphTestbed. You:
187
-
188
- 1. Get data (currently: bring-your-own `test_features.csv`)
189
- 2. Train, predict, write CSV with `<id_col>, <pred_col>` columns
190
- 3. `gtb submit <task> --file preds.csv --agent <name>`
191
-
192
- That's it. See [`PROTOCOL.md`](PROTOCOL.md) for edge cases.
193
- </details>
194
 
195
- ## Reference agent integrations (`agents/`)
 
 
196
 
197
- Two third-party harnesses ship pre-wired to the testbed; both route LLM
198
- traffic through one local [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI):
199
 
200
- | package | upstream | default model |
201
  | --- | --- | --- |
202
- | [`agents.ai_build_ai`](agents/ai_build_ai/README.md) | [aibuildai/AI-Build-AI](https://github.com/aibuildai/AI-Build-AI) | `claude-sonnet-4-6` |
203
- | [`agents.mlevolve`](agents/mlevolve/README.md) | [InternScience/MLEvolve](https://github.com/InternScience/MLEvolve) | `gpt-5.3-codex-spark` |
204
-
205
- The proxy integration itself is generic β€” see
206
- [`agents/cliproxyapi/README.md`](agents/cliproxyapi/README.md) for the
207
- 3-function shim (`anthropic_env` / `openai_env` / `gemini_env` /
208
- `openai_yaml_block`) that any future agent can reuse.
209
-
210
- ```bash
211
- # One-time
212
- export CLIPROXYAPI_KEY=<from your ~/.cli-proxy-api/config.yaml api-keys list>
213
- bash agents/ai_build_ai/install.sh # or agents/mlevolve/install.sh
214
-
215
- # Per task
216
- gtb fetch figraph
217
- python -m agents.ai_build_ai.runner --task figraph
218
- # β†’ prints path to runs/ai_build_ai/figraph/<ts>/submission.csv
219
- gtb submit figraph --file <printed-path> --agent aibuildai-sonnet-4-6
220
- ```
221
-
222
- ## License
223
-
224
- [MIT](LICENSE). Data: subject to upstream licenses (Kaggle competition rules, FiGraph CC BY-NC 4.0, etc.).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: GraphTestbed Scoring API
3
+ emoji: πŸ“Š
4
+ colorFrom: indigo
5
+ colorTo: green
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
+ # GraphTestbed Scoring API
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
+ Public scoring server for the [GraphTestbed](https://github.com/zhuconv/GraphTestbed)
14
+ benchmark. Anyone can `gtb submit <task> --file preds.csv --agent <name>` from
15
+ anywhere; the scored entry lands on a single shared leaderboard.
16
 
17
+ ## Endpoints
 
18
 
19
+ | method | path | purpose |
20
  | --- | --- | --- |
21
+ | POST | `/submit` | multipart `task=…&agent=…&file=preds.csv` β†’ JSON with primary metric, secondary metrics, leaderboard rank, quota_remaining |
22
+ | GET | `/leaderboard/<task>` | best-per-agent JSON, sorted by primary desc |
23
+ | GET | `/healthz` | tasks list + which have GT loaded + quota |
24
+
25
+ Full contract: [PROTOCOL.md](https://github.com/zhuconv/GraphTestbed/blob/main/PROTOCOL.md).
26
+
27
+ ## Trust model
28
+
29
+ Non-adversarial benchmark. The API enforces:
30
+ - 5 submissions / day / IP / task
31
+ - Schema check before scoring (malformed CSVs don't burn quota)
32
+ - Score bucketing (round to 3 dp)
33
+ - Audit trail in sqlite + per-submission CSV archive
34
+
35
+ Test labels live only in the companion private dataset repo
36
+ (`lanczos/graphtestbed-gt`) and never enter the Space's git history.
37
+
38
+ ## Configuration (Space secrets)
39
+
40
+ | name | required | default | notes |
41
+ | --- | --- | --- | --- |
42
+ | `HF_TOKEN` | yes | β€” | write scope on `GT_DATASET_REPO` |
43
+ | `GT_DATASET_REPO` | no | `lanczos/graphtestbed-gt` | private dataset holding GT + leaderboard backups |
44
+ | `GT_BACKUP_INTERVAL` | no | `60` | seconds between sqlite β†’ dataset-repo pushes |
45
+ | `GT_QUOTA` | no | `5` | submissions/day/IP/task |
46
+ | `GT_BYPASS_KEY` | no | β€” | shared secret; clients sending it as `X-Bypass-Key` header skip quota and may pass `dry=1` to score without inserting |
47
+
48
+ ## Persistence
49
+
50
+ - On boot: `snapshot_download` pulls `gt/*.csv`, `leaderboard.db`, and any
51
+ archived `submissions/**/*.csv` from the dataset repo into `/data`.
52
+ - Every 60 s: if `SELECT COUNT(*) FROM submissions` grew, a daemon thread
53
+ uses `sqlite3.Connection.backup()` to copy the DB atomically and
54
+ `upload_file`s it back. New submission CSVs in `/data/submissions/` are
55
+ pushed via `upload_folder` (content-hash diff β€” unchanged files skipped).
56
+ - Worst-case loss on Space crash: 60 s of submissions.