Spaces:

lanczos
/

graphtestbed

Running

App Files Files Community

Zhu Jiajun (jz28583) commited on Apr 21

Commit

0425205

1 Parent(s): 0309359

deploy: overlay server/space/{README,Dockerfile} at root

Browse files

Files changed (2) hide show

Dockerfile +38 -0
README.md +50 -218

Dockerfile ADDED Viewed

	@@ -0,0 +1,38 @@

+FROM python:3.11-slim
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PIP_NO_CACHE_DIR=1
+WORKDIR /app
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Install deps first so the layer caches across code-only changes.
+COPY server/requirements.txt /app/server/requirements.txt
+RUN pip install -r /app/server/requirements.txt huggingface_hub>=0.20
+# Install the graphtestbed package itself so server/api.py can
+# `from graphtestbed._manifest import ...`.
+COPY pyproject.toml /app/
+COPY graphtestbed /app/graphtestbed
+COPY datasets /app/datasets
+COPY server /app/server
+RUN pip install --no-deps -e /app
+# HF Spaces mounts /data on Persistent Storage tier; on free tier it's
+# just an in-container path that the dataset-repo backup loop preserves.
+ENV GT_DATA_ROOT=/data \
+    GT_DIR=/data/gt \
+    GT_DB=/data/leaderboard.db \
+    GT_ARCHIVE_DIR=/data/submissions \
+    GT_DATASET_REPO=lanczos/graphtestbed-gt \
+    GT_BACKUP_INTERVAL=60 \
+    GT_QUOTA=5 \
+    PORT=7860
+RUN mkdir -p /data && chmod 777 /data
+EXPOSE 7860
+CMD ["python", "/app/server/space/space_entry.py"]

README.md CHANGED Viewed

@@ -1,224 +1,56 @@
-# GraphTestbed
-> A Kaggle-style scoring server for benchmarking ML/AI agent harnesses on heterogeneous graph datasets.
-[![Leaderboard](https://img.shields.io/badge/%F0%9F%A4%97%20Leaderboard-lanczos%2Fgraphtestbed-yellow)](https://huggingface.co/spaces/lanczos/graphtestbed)
-[![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-lanczos%2Fgraphtestbed--data-yellow)](https://huggingface.co/datasets/lanczos/graphtestbed-data)
-[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
-[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
-[![Status](https://img.shields.io/badge/status-pre--launch-orange)](#status)
-Build an agent. Submit predictions. Get a score. Test labels live on a server, not in this repo, so your agent can't accidentally cheat. Designed for non-adversarial benchmarking of agent harnesses on real graph data.
-## Status
-**Pre-launch.** The code runs end-to-end. Pieces that aren't fully live yet:
-- The package isn't on PyPI yet → install from git (see below)
-- HuggingFace dataset repos aren't published yet → use your own `train/val/test_features.csv` files for now
-The hosted scoring API at <https://lanczos-graphtestbed.hf.space/> is the
-default `gtb submit` target. Set `GRAPHTESTBED_API` to point at a local
-server if you'd rather self-host (instructions below).
-## Submit to the hosted leaderboard
-```bash
-pip install git+https://github.com/zhuconv/GraphTestbed
-gtb submit figraph --file preds.csv --agent my-agent-v1
-# ✓ Scored  primary (auc_roc): 0.689  rank: #3
-gtb leaderboard figraph
-```
-The hosted server is a Docker-SDK HF Space that holds GT files in a private
-companion dataset and never logs prediction CSVs (it does archive them in
-the same private repo for reproducibility — see [`server/space/DEPLOY.md`](server/space/DEPLOY.md)).
-Trust model: non-adversarial, 5 submissions/day/IP/task, score bucketed to
-3 decimals — same as if you ran the server yourself.
-## Run the server locally (alternative)
-```bash
-git clone https://github.com/zhuconv/GraphTestbed
-cd GraphTestbed
-GT_DIR=~/path/to/your/ground_truth ./server/run_local.sh
-# → Running on http://localhost:8080
-# point the client at it
-export GRAPHTESTBED_API=http://localhost:8080
-gtb submit figraph --file preds.csv --agent my-agent-v1
-```
-You provide the `ground_truth/<task>.csv` files yourself (one row per test entity, columns `<id_col>,Label`). The CLI never needs to see them.
-## Tasks
-| Task | Type | Metric | Test rows | Backend |
-| --- | --- | --- | --- | --- |
-| `ieee-fraud-detection` | binary classification | AUC-ROC | 506,691 | Kaggle |
-| `arxiv-citation` | binary classification | AUC-ROC | 193,696 | local GT |
-| `figraph` | binary anomaly detection | AUC-ROC | 3,596 | local GT |
-| `ibm-aml` | binary anomaly detection | F1 (minority) | 863,900 | local GT |
-`ieee-fraud-detection` forwards your CSV to the Kaggle competition and
-returns the official private-leaderboard score. The hosted server uses its
-own Kaggle credentials for that — no setup needed on your side just to
-submit. (You only need `KAGGLE_USERNAME`/`KAGGLE_KEY` if you self-host.)
-## How it works
-```
-agent ──► gtb fetch <task>   ──► (HuggingFace, when published)
-agent ──► gtb submit <file>  ──► POST /submit
-                                   │
-                                   ▼
-              ┌─────────────────────────────────────┐
-              │ server (Flask, ~250 LOC)            │
-              │  1. quota: 5/day/IP/task            │
-              │  2. schema check                    │
-              │  3. join with GT (local fs only)    │
-              │  4. metric → round to 3dp           │
-              │  5. sqlite leaderboard → JSON       │
-              └─────────────────────────────────────┘
-```
-Ground truth never enters git. It sits on the server's local filesystem, populated separately at deploy time.
-## CLI
-```bash
-gtb fetch <task>                                  # download from HF (when published)
-gtb submit <task> --file <csv> --agent <name>     # POST to scoring API
-gtb submit ... --dry-run                          # validate schema, no quota burn
-gtb leaderboard <task>                            # GET /leaderboard/<task>
-gtb score-local <task> --file <csv>               # score against val (no GT needed)
-```
-`GRAPHTESTBED_API` overrides the server URL. Defaults to `http://localhost:8080`.
 ---
-<details>
-<summary><b>Submission format</b></summary>
-Per-task schema in `datasets/manifest.yaml`:
-```yaml
-arxiv-citation:
-  hf_repo: graphtestbed/arxiv-citation
-  files:
-    train_features: { sha256: ... }
-    val_features:   { sha256: ... }
-    test_features:  { sha256: ... }       # no Label column
-  submission_schema:
-    id_col: Paper_ID
-    pred_col: Label                       # float in [0, 1]
-    n_rows: 19394
-  metric:
-    primary: auc_roc
-    secondary: [auc_pr, f1]
-```
-CSV must contain **exactly** `[id_col, pred_col]`, exactly `n_rows` rows, IDs matching `test_features.csv` 100%. Otherwise schema check fails before scoring (no quota burn).
-Successful output:
-```
-✓ Schema OK   (rows=19394, sha256=abcdef012345...)
-✓ Scored      (run_id=ab12cd34ef56)
-  primary (auc_roc): 0.689
-  secondary: {'auc_pr': 0.661, 'f1': 0.591}
-  leaderboard rank: #3
-  quota remaining today: 4
-```
-</details>
-<details>
-<summary><b>Trust model</b></summary>
-This is a **non-adversarial** benchmark. We trust agent authors not to circumvent the contract. The API enforces:
-- **Test labels not in the repo.** No accidental leak via `pd.concat([train, val, test])`.
-- **Rate limit.** 5 submissions / day / IP / task — enough to debug, not to hill-climb.
-- **Score bucketing.** Metrics rounded to 3dp; on a 19k test set this is ~5σ per bit, making probing statistically uninformative.
-- **No per-row feedback.** Aggregates only.
-- **Audit trail.** Every submission's sha256 + agent + timestamp lives in sqlite.
-**What we do _not_ defend against:** anyone reading the `server/` code (it's public, but contains no GT); anyone re-downloading upstream Kaggle / FiGraph data and trying to recompute labels.
-If you need adversarial guarantees, use a fully sandboxed eval (Docker + `network=none` + private CDN). Wrong tool here.
-</details>
-<details>
-<summary><b>Repo layout</b></summary>
-```
-GraphTestbed/
-├── graphtestbed/         # the gtb CLI package (~400 LOC)
-├── server/               # the Flask scoring API (~250 LOC)
-│   ├── api.py
-│   ├── run_local.sh
-│   └── deploy.md
-├── datasets/manifest.yaml  # task → HF location + schema + metric
-├── PROTOCOL.md           # full submission contract
-└── README.md
-```
-When the project goes live the `server/` directory will be promoted to its own `server` branch (kept separate so deploy hosts only checkout what they run). For now everything is on `main`.
-See [`PROTOCOL.md`](PROTOCOL.md) for the full submission contract. See [`server/deploy.md`](server/deploy.md) for hosted deploy.
-</details>
-<details>
-<summary><b>Adding a new task</b></summary>
-1. Prepare `train/val/test_features.csv` + `ground_truth.csv` locally. Strip target column from `test_features`.
-2. Push agent-visible files to `huggingface.co/datasets/graphtestbed/<task>` (when HF org is set up).
-3. Place `ground_truth.csv` at `<server>/path/to/gt/<task>.csv`.
-4. Add the task to `datasets/manifest.yaml` with sha256 + schema + metric.
-CLI and API auto-discover from the manifest. No code changes.
-</details>
-<details>
-<summary><b>Adding a new agent</b></summary>
-You don't modify GraphTestbed. You:
-1. Get data (currently: bring-your-own `test_features.csv`)
-2. Train, predict, write CSV with `<id_col>, <pred_col>` columns
-3. `gtb submit <task> --file preds.csv --agent <name>`
-That's it. See [`PROTOCOL.md`](PROTOCOL.md) for edge cases.
-</details>
-## Reference agent integrations (`agents/`)
-Two third-party harnesses ship pre-wired to the testbed; both route LLM
-traffic through one local [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI):
-| package | upstream | default model |
 | --- | --- | --- |
-| [`agents.ai_build_ai`](agents/ai_build_ai/README.md) | [aibuildai/AI-Build-AI](https://github.com/aibuildai/AI-Build-AI) | `claude-sonnet-4-6` |
-| [`agents.mlevolve`](agents/mlevolve/README.md) | [InternScience/MLEvolve](https://github.com/InternScience/MLEvolve) | `gpt-5.3-codex-spark` |
-The proxy integration itself is generic — see
-[`agents/cliproxyapi/README.md`](agents/cliproxyapi/README.md) for the
-3-function shim (`anthropic_env` / `openai_env` / `gemini_env` /
-`openai_yaml_block`) that any future agent can reuse.
-```bash
-# One-time
-export CLIPROXYAPI_KEY=<from your ~/.cli-proxy-api/config.yaml api-keys list>
-bash agents/ai_build_ai/install.sh        # or agents/mlevolve/install.sh
-# Per task
-gtb fetch figraph
-python -m agents.ai_build_ai.runner --task figraph
-# → prints path to runs/ai_build_ai/figraph/<ts>/submission.csv
-gtb submit figraph --file <printed-path> --agent aibuildai-sonnet-4-6
-```
-## License
-[MIT](LICENSE). Data: subject to upstream licenses (Kaggle competition rules, FiGraph CC BY-NC 4.0, etc.).

+---
+title: GraphTestbed Scoring API
+emoji: 📊
+colorFrom: indigo
+colorTo: green
+sdk: docker
+app_port: 7860
+pinned: false
 ---
+# GraphTestbed Scoring API
+Public scoring server for the [GraphTestbed](https://github.com/zhuconv/GraphTestbed)
+benchmark. Anyone can `gtb submit <task> --file preds.csv --agent <name>` from
+anywhere; the scored entry lands on a single shared leaderboard.
+## Endpoints
+| method | path | purpose |
 | --- | --- | --- |
+| POST | `/submit` | multipart `task=…&agent=…&file=preds.csv` → JSON with primary metric, secondary metrics, leaderboard rank, quota_remaining |
+| GET | `/leaderboard/<task>` | best-per-agent JSON, sorted by primary desc |
+| GET | `/healthz` | tasks list + which have GT loaded + quota |
+Full contract: [PROTOCOL.md](https://github.com/zhuconv/GraphTestbed/blob/main/PROTOCOL.md).
+## Trust model
+Non-adversarial benchmark. The API enforces:
+- 5 submissions / day / IP / task
+- Schema check before scoring (malformed CSVs don't burn quota)
+- Score bucketing (round to 3 dp)
+- Audit trail in sqlite + per-submission CSV archive
+Test labels live only in the companion private dataset repo
+(`lanczos/graphtestbed-gt`) and never enter the Space's git history.
+## Configuration (Space secrets)
+| name | required | default | notes |
+| --- | --- | --- | --- |
+| `HF_TOKEN` | yes | — | write scope on `GT_DATASET_REPO` |
+| `GT_DATASET_REPO` | no | `lanczos/graphtestbed-gt` | private dataset holding GT + leaderboard backups |
+| `GT_BACKUP_INTERVAL` | no | `60` | seconds between sqlite → dataset-repo pushes |
+| `GT_QUOTA` | no | `5` | submissions/day/IP/task |
+| `GT_BYPASS_KEY` | no | — | shared secret; clients sending it as `X-Bypass-Key` header skip quota and may pass `dry=1` to score without inserting |
+## Persistence
+- On boot: `snapshot_download` pulls `gt/*.csv`, `leaderboard.db`, and any
+  archived `submissions/**/*.csv` from the dataset repo into `/data`.
+- Every 60 s: if `SELECT COUNT(*) FROM submissions` grew, a daemon thread
+  uses `sqlite3.Connection.backup()` to copy the DB atomically and
+  `upload_file`s it back. New submission CSVs in `/data/submissions/` are
+  pushed via `upload_folder` (content-hash diff — unchanged files skipped).
+- Worst-case loss on Space crash: 60 s of submissions.