Spaces:
Running
Running
Zhu Jiajun (jz28583) commited on
Commit Β·
0425205
1
Parent(s): 0309359
deploy: overlay server/space/{README,Dockerfile} at root
Browse files- Dockerfile +38 -0
- README.md +50 -218
Dockerfile
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.11-slim
|
| 2 |
+
|
| 3 |
+
ENV PYTHONUNBUFFERED=1 \
|
| 4 |
+
PYTHONDONTWRITEBYTECODE=1 \
|
| 5 |
+
PIP_NO_CACHE_DIR=1
|
| 6 |
+
|
| 7 |
+
WORKDIR /app
|
| 8 |
+
|
| 9 |
+
RUN apt-get update && \
|
| 10 |
+
apt-get install -y --no-install-recommends git && \
|
| 11 |
+
rm -rf /var/lib/apt/lists/*
|
| 12 |
+
|
| 13 |
+
# Install deps first so the layer caches across code-only changes.
|
| 14 |
+
COPY server/requirements.txt /app/server/requirements.txt
|
| 15 |
+
RUN pip install -r /app/server/requirements.txt huggingface_hub>=0.20
|
| 16 |
+
|
| 17 |
+
# Install the graphtestbed package itself so server/api.py can
|
| 18 |
+
# `from graphtestbed._manifest import ...`.
|
| 19 |
+
COPY pyproject.toml /app/
|
| 20 |
+
COPY graphtestbed /app/graphtestbed
|
| 21 |
+
COPY datasets /app/datasets
|
| 22 |
+
COPY server /app/server
|
| 23 |
+
RUN pip install --no-deps -e /app
|
| 24 |
+
|
| 25 |
+
# HF Spaces mounts /data on Persistent Storage tier; on free tier it's
|
| 26 |
+
# just an in-container path that the dataset-repo backup loop preserves.
|
| 27 |
+
ENV GT_DATA_ROOT=/data \
|
| 28 |
+
GT_DIR=/data/gt \
|
| 29 |
+
GT_DB=/data/leaderboard.db \
|
| 30 |
+
GT_ARCHIVE_DIR=/data/submissions \
|
| 31 |
+
GT_DATASET_REPO=lanczos/graphtestbed-gt \
|
| 32 |
+
GT_BACKUP_INTERVAL=60 \
|
| 33 |
+
GT_QUOTA=5 \
|
| 34 |
+
PORT=7860
|
| 35 |
+
RUN mkdir -p /data && chmod 777 /data
|
| 36 |
+
|
| 37 |
+
EXPOSE 7860
|
| 38 |
+
CMD ["python", "/app/server/space/space_entry.py"]
|
README.md
CHANGED
|
@@ -1,224 +1,56 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
[](#status)
|
| 10 |
-
|
| 11 |
-
Build an agent. Submit predictions. Get a score. Test labels live on a server, not in this repo, so your agent can't accidentally cheat. Designed for non-adversarial benchmarking of agent harnesses on real graph data.
|
| 12 |
-
|
| 13 |
-
## Status
|
| 14 |
-
|
| 15 |
-
**Pre-launch.** The code runs end-to-end. Pieces that aren't fully live yet:
|
| 16 |
-
|
| 17 |
-
- The package isn't on PyPI yet β install from git (see below)
|
| 18 |
-
- HuggingFace dataset repos aren't published yet β use your own `train/val/test_features.csv` files for now
|
| 19 |
-
|
| 20 |
-
The hosted scoring API at <https://lanczos-graphtestbed.hf.space/> is the
|
| 21 |
-
default `gtb submit` target. Set `GRAPHTESTBED_API` to point at a local
|
| 22 |
-
server if you'd rather self-host (instructions below).
|
| 23 |
-
|
| 24 |
-
## Submit to the hosted leaderboard
|
| 25 |
-
|
| 26 |
-
```bash
|
| 27 |
-
pip install git+https://github.com/zhuconv/GraphTestbed
|
| 28 |
-
gtb submit figraph --file preds.csv --agent my-agent-v1
|
| 29 |
-
# β Scored primary (auc_roc): 0.689 rank: #3
|
| 30 |
-
gtb leaderboard figraph
|
| 31 |
-
```
|
| 32 |
-
|
| 33 |
-
The hosted server is a Docker-SDK HF Space that holds GT files in a private
|
| 34 |
-
companion dataset and never logs prediction CSVs (it does archive them in
|
| 35 |
-
the same private repo for reproducibility β see [`server/space/DEPLOY.md`](server/space/DEPLOY.md)).
|
| 36 |
-
Trust model: non-adversarial, 5 submissions/day/IP/task, score bucketed to
|
| 37 |
-
3 decimals β same as if you ran the server yourself.
|
| 38 |
-
|
| 39 |
-
## Run the server locally (alternative)
|
| 40 |
-
|
| 41 |
-
```bash
|
| 42 |
-
git clone https://github.com/zhuconv/GraphTestbed
|
| 43 |
-
cd GraphTestbed
|
| 44 |
-
GT_DIR=~/path/to/your/ground_truth ./server/run_local.sh
|
| 45 |
-
# β Running on http://localhost:8080
|
| 46 |
-
|
| 47 |
-
# point the client at it
|
| 48 |
-
export GRAPHTESTBED_API=http://localhost:8080
|
| 49 |
-
gtb submit figraph --file preds.csv --agent my-agent-v1
|
| 50 |
-
```
|
| 51 |
-
|
| 52 |
-
You provide the `ground_truth/<task>.csv` files yourself (one row per test entity, columns `<id_col>,Label`). The CLI never needs to see them.
|
| 53 |
-
|
| 54 |
-
## Tasks
|
| 55 |
-
|
| 56 |
-
| Task | Type | Metric | Test rows | Backend |
|
| 57 |
-
| --- | --- | --- | --- | --- |
|
| 58 |
-
| `ieee-fraud-detection` | binary classification | AUC-ROC | 506,691 | Kaggle |
|
| 59 |
-
| `arxiv-citation` | binary classification | AUC-ROC | 193,696 | local GT |
|
| 60 |
-
| `figraph` | binary anomaly detection | AUC-ROC | 3,596 | local GT |
|
| 61 |
-
| `ibm-aml` | binary anomaly detection | F1 (minority) | 863,900 | local GT |
|
| 62 |
-
|
| 63 |
-
`ieee-fraud-detection` forwards your CSV to the Kaggle competition and
|
| 64 |
-
returns the official private-leaderboard score. The hosted server uses its
|
| 65 |
-
own Kaggle credentials for that β no setup needed on your side just to
|
| 66 |
-
submit. (You only need `KAGGLE_USERNAME`/`KAGGLE_KEY` if you self-host.)
|
| 67 |
-
|
| 68 |
-
## How it works
|
| 69 |
-
|
| 70 |
-
```
|
| 71 |
-
agent βββΊ gtb fetch <task> βββΊ (HuggingFace, when published)
|
| 72 |
-
agent βββΊ gtb submit <file> βββΊ POST /submit
|
| 73 |
-
β
|
| 74 |
-
βΌ
|
| 75 |
-
βββββββββββββββββββββββββββββββββββββββ
|
| 76 |
-
β server (Flask, ~250 LOC) β
|
| 77 |
-
β 1. quota: 5/day/IP/task β
|
| 78 |
-
β 2. schema check β
|
| 79 |
-
β 3. join with GT (local fs only) β
|
| 80 |
-
β 4. metric β round to 3dp β
|
| 81 |
-
β 5. sqlite leaderboard β JSON β
|
| 82 |
-
βββββββββββββββββββββββββββββββββββββββ
|
| 83 |
-
```
|
| 84 |
-
|
| 85 |
-
Ground truth never enters git. It sits on the server's local filesystem, populated separately at deploy time.
|
| 86 |
-
|
| 87 |
-
## CLI
|
| 88 |
-
|
| 89 |
-
```bash
|
| 90 |
-
gtb fetch <task> # download from HF (when published)
|
| 91 |
-
gtb submit <task> --file <csv> --agent <name> # POST to scoring API
|
| 92 |
-
gtb submit ... --dry-run # validate schema, no quota burn
|
| 93 |
-
gtb leaderboard <task> # GET /leaderboard/<task>
|
| 94 |
-
gtb score-local <task> --file <csv> # score against val (no GT needed)
|
| 95 |
-
```
|
| 96 |
-
|
| 97 |
-
`GRAPHTESTBED_API` overrides the server URL. Defaults to `http://localhost:8080`.
|
| 98 |
-
|
| 99 |
---
|
| 100 |
|
| 101 |
-
|
| 102 |
-
<summary><b>Submission format</b></summary>
|
| 103 |
-
|
| 104 |
-
Per-task schema in `datasets/manifest.yaml`:
|
| 105 |
-
|
| 106 |
-
```yaml
|
| 107 |
-
arxiv-citation:
|
| 108 |
-
hf_repo: graphtestbed/arxiv-citation
|
| 109 |
-
files:
|
| 110 |
-
train_features: { sha256: ... }
|
| 111 |
-
val_features: { sha256: ... }
|
| 112 |
-
test_features: { sha256: ... } # no Label column
|
| 113 |
-
submission_schema:
|
| 114 |
-
id_col: Paper_ID
|
| 115 |
-
pred_col: Label # float in [0, 1]
|
| 116 |
-
n_rows: 19394
|
| 117 |
-
metric:
|
| 118 |
-
primary: auc_roc
|
| 119 |
-
secondary: [auc_pr, f1]
|
| 120 |
-
```
|
| 121 |
-
|
| 122 |
-
CSV must contain **exactly** `[id_col, pred_col]`, exactly `n_rows` rows, IDs matching `test_features.csv` 100%. Otherwise schema check fails before scoring (no quota burn).
|
| 123 |
-
|
| 124 |
-
Successful output:
|
| 125 |
-
|
| 126 |
-
```
|
| 127 |
-
β Schema OK (rows=19394, sha256=abcdef012345...)
|
| 128 |
-
β Scored (run_id=ab12cd34ef56)
|
| 129 |
-
primary (auc_roc): 0.689
|
| 130 |
-
secondary: {'auc_pr': 0.661, 'f1': 0.591}
|
| 131 |
-
leaderboard rank: #3
|
| 132 |
-
quota remaining today: 4
|
| 133 |
-
```
|
| 134 |
-
</details>
|
| 135 |
-
|
| 136 |
-
<details>
|
| 137 |
-
<summary><b>Trust model</b></summary>
|
| 138 |
-
|
| 139 |
-
This is a **non-adversarial** benchmark. We trust agent authors not to circumvent the contract. The API enforces:
|
| 140 |
-
|
| 141 |
-
- **Test labels not in the repo.** No accidental leak via `pd.concat([train, val, test])`.
|
| 142 |
-
- **Rate limit.** 5 submissions / day / IP / task β enough to debug, not to hill-climb.
|
| 143 |
-
- **Score bucketing.** Metrics rounded to 3dp; on a 19k test set this is ~5Ο per bit, making probing statistically uninformative.
|
| 144 |
-
- **No per-row feedback.** Aggregates only.
|
| 145 |
-
- **Audit trail.** Every submission's sha256 + agent + timestamp lives in sqlite.
|
| 146 |
-
|
| 147 |
-
**What we do _not_ defend against:** anyone reading the `server/` code (it's public, but contains no GT); anyone re-downloading upstream Kaggle / FiGraph data and trying to recompute labels.
|
| 148 |
-
|
| 149 |
-
If you need adversarial guarantees, use a fully sandboxed eval (Docker + `network=none` + private CDN). Wrong tool here.
|
| 150 |
-
</details>
|
| 151 |
-
|
| 152 |
-
<details>
|
| 153 |
-
<summary><b>Repo layout</b></summary>
|
| 154 |
-
|
| 155 |
-
```
|
| 156 |
-
GraphTestbed/
|
| 157 |
-
βββ graphtestbed/ # the gtb CLI package (~400 LOC)
|
| 158 |
-
βββ server/ # the Flask scoring API (~250 LOC)
|
| 159 |
-
β βββ api.py
|
| 160 |
-
β βββ run_local.sh
|
| 161 |
-
β βββ deploy.md
|
| 162 |
-
βββ datasets/manifest.yaml # task β HF location + schema + metric
|
| 163 |
-
βββ PROTOCOL.md # full submission contract
|
| 164 |
-
βββ README.md
|
| 165 |
-
```
|
| 166 |
-
|
| 167 |
-
When the project goes live the `server/` directory will be promoted to its own `server` branch (kept separate so deploy hosts only checkout what they run). For now everything is on `main`.
|
| 168 |
-
|
| 169 |
-
See [`PROTOCOL.md`](PROTOCOL.md) for the full submission contract. See [`server/deploy.md`](server/deploy.md) for hosted deploy.
|
| 170 |
-
</details>
|
| 171 |
-
|
| 172 |
-
<details>
|
| 173 |
-
<summary><b>Adding a new task</b></summary>
|
| 174 |
-
|
| 175 |
-
1. Prepare `train/val/test_features.csv` + `ground_truth.csv` locally. Strip target column from `test_features`.
|
| 176 |
-
2. Push agent-visible files to `huggingface.co/datasets/graphtestbed/<task>` (when HF org is set up).
|
| 177 |
-
3. Place `ground_truth.csv` at `<server>/path/to/gt/<task>.csv`.
|
| 178 |
-
4. Add the task to `datasets/manifest.yaml` with sha256 + schema + metric.
|
| 179 |
-
|
| 180 |
-
CLI and API auto-discover from the manifest. No code changes.
|
| 181 |
-
</details>
|
| 182 |
-
|
| 183 |
-
<details>
|
| 184 |
-
<summary><b>Adding a new agent</b></summary>
|
| 185 |
-
|
| 186 |
-
You don't modify GraphTestbed. You:
|
| 187 |
-
|
| 188 |
-
1. Get data (currently: bring-your-own `test_features.csv`)
|
| 189 |
-
2. Train, predict, write CSV with `<id_col>, <pred_col>` columns
|
| 190 |
-
3. `gtb submit <task> --file preds.csv --agent <name>`
|
| 191 |
-
|
| 192 |
-
That's it. See [`PROTOCOL.md`](PROTOCOL.md) for edge cases.
|
| 193 |
-
</details>
|
| 194 |
|
| 195 |
-
|
|
|
|
|
|
|
| 196 |
|
| 197 |
-
|
| 198 |
-
traffic through one local [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI):
|
| 199 |
|
| 200 |
-
|
|
| 201 |
| --- | --- | --- |
|
| 202 |
-
|
|
| 203 |
-
|
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
[
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: GraphTestbed Scoring API
|
| 3 |
+
emoji: π
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: green
|
| 6 |
+
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
+
pinned: false
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# GraphTestbed Scoring API
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
+
Public scoring server for the [GraphTestbed](https://github.com/zhuconv/GraphTestbed)
|
| 14 |
+
benchmark. Anyone can `gtb submit <task> --file preds.csv --agent <name>` from
|
| 15 |
+
anywhere; the scored entry lands on a single shared leaderboard.
|
| 16 |
|
| 17 |
+
## Endpoints
|
|
|
|
| 18 |
|
| 19 |
+
| method | path | purpose |
|
| 20 |
| --- | --- | --- |
|
| 21 |
+
| POST | `/submit` | multipart `task=β¦&agent=β¦&file=preds.csv` β JSON with primary metric, secondary metrics, leaderboard rank, quota_remaining |
|
| 22 |
+
| GET | `/leaderboard/<task>` | best-per-agent JSON, sorted by primary desc |
|
| 23 |
+
| GET | `/healthz` | tasks list + which have GT loaded + quota |
|
| 24 |
+
|
| 25 |
+
Full contract: [PROTOCOL.md](https://github.com/zhuconv/GraphTestbed/blob/main/PROTOCOL.md).
|
| 26 |
+
|
| 27 |
+
## Trust model
|
| 28 |
+
|
| 29 |
+
Non-adversarial benchmark. The API enforces:
|
| 30 |
+
- 5 submissions / day / IP / task
|
| 31 |
+
- Schema check before scoring (malformed CSVs don't burn quota)
|
| 32 |
+
- Score bucketing (round to 3 dp)
|
| 33 |
+
- Audit trail in sqlite + per-submission CSV archive
|
| 34 |
+
|
| 35 |
+
Test labels live only in the companion private dataset repo
|
| 36 |
+
(`lanczos/graphtestbed-gt`) and never enter the Space's git history.
|
| 37 |
+
|
| 38 |
+
## Configuration (Space secrets)
|
| 39 |
+
|
| 40 |
+
| name | required | default | notes |
|
| 41 |
+
| --- | --- | --- | --- |
|
| 42 |
+
| `HF_TOKEN` | yes | β | write scope on `GT_DATASET_REPO` |
|
| 43 |
+
| `GT_DATASET_REPO` | no | `lanczos/graphtestbed-gt` | private dataset holding GT + leaderboard backups |
|
| 44 |
+
| `GT_BACKUP_INTERVAL` | no | `60` | seconds between sqlite β dataset-repo pushes |
|
| 45 |
+
| `GT_QUOTA` | no | `5` | submissions/day/IP/task |
|
| 46 |
+
| `GT_BYPASS_KEY` | no | β | shared secret; clients sending it as `X-Bypass-Key` header skip quota and may pass `dry=1` to score without inserting |
|
| 47 |
+
|
| 48 |
+
## Persistence
|
| 49 |
+
|
| 50 |
+
- On boot: `snapshot_download` pulls `gt/*.csv`, `leaderboard.db`, and any
|
| 51 |
+
archived `submissions/**/*.csv` from the dataset repo into `/data`.
|
| 52 |
+
- Every 60 s: if `SELECT COUNT(*) FROM submissions` grew, a daemon thread
|
| 53 |
+
uses `sqlite3.Connection.backup()` to copy the DB atomically and
|
| 54 |
+
`upload_file`s it back. New submission CSVs in `/data/submissions/` are
|
| 55 |
+
pushed via `upload_folder` (content-hash diff β unchanged files skipped).
|
| 56 |
+
- Worst-case loss on Space crash: 60 s of submissions.
|