Spaces:
Sleeping
Sleeping
| # GraphTestbed Protocol | |
| The contract between agent harnesses, the public client (`gtb`), and the | |
| scoring API (`server/api.py`). | |
| ## 1. Two-branch single-repo layout | |
| ``` | |
| GraphTestbed (single public repo) | |
| βββ main branch β what agents `git clone` | |
| βββ server branch β what the maintainer deploys | |
| ``` | |
| Branches are **not access control** β they're an organizational split. Both | |
| are public; both can be cloned by anyone. The actual privacy boundary is: | |
| - **Test labels never enter git** (any branch). | |
| - The scoring server holds GT at `/var/graphtestbed/gt/<task>.csv`, | |
| populated separately from git at deploy time. | |
| ## 2. Data layout (per task) | |
| ### What the agent sees (HuggingFace public dataset) | |
| ``` | |
| graphtestbed/<task>: | |
| train_features.csv # labeled | |
| val_features.csv # labeled | |
| test_features.csv # NO target column | |
| description.md | |
| sample_submission.csv | |
| <auxiliary>.csv # optional: edges*.csv, train_text.csv, ... | |
| ``` | |
| ### What the server holds privately | |
| ``` | |
| /var/graphtestbed/gt/<task>.csv | |
| <id_col>,Label # one row per test entity, with the true label | |
| ``` | |
| ID column matches `test_features.csv[<id_col>]` 100%. | |
| ## 3. Submission contract | |
| ### Format | |
| Per-task schema in `datasets/manifest.yaml`: | |
| ```yaml | |
| <task>: | |
| submission_schema: | |
| id_col: <name> | |
| pred_col: <name> | |
| n_rows: <int> | |
| pred_dtype: float # binary classification | |
| ``` | |
| CSV must have: | |
| - Exactly 2 columns: `<id_col>, <pred_col>` | |
| - Exactly `n_rows` rows | |
| - ID set 100% matching `test_features.csv[<id_col>]` | |
| - `<pred_col>` values: float in [0, 1] for binary | |
| ### Submit | |
| ```bash | |
| gtb submit <task> --file <preds.csv> --agent <name> | |
| ``` | |
| This: | |
| 1. Validates schema **locally** (no API call if malformed). | |
| 2. POSTs `multipart/form-data` to `$GRAPHTESTBED_API/submit` with | |
| `task=<task>&agent=<name>&file=<csv>`. | |
| 3. Server re-validates schema, checks quota, scores against GT, returns | |
| metrics JSON. | |
| 4. Client prints metrics + leaderboard rank + remaining quota. | |
| ## 4. API endpoints | |
| ### `POST /submit` (multipart) | |
| Form fields: | |
| - `task` (str, required) | |
| - `agent` (str, required) | |
| - `file` (CSV, required, β€50 MB) | |
| Response 200: | |
| ```json | |
| { | |
| "run_id": "8a3f29bce4a1", | |
| "task": "arxiv-citation", | |
| "agent": "autopipe-v0.4", | |
| "primary": 0.689, | |
| "secondary": {"auc_pr": 0.661, "f1": 0.591}, | |
| "n_rows": 19394, | |
| "leaderboard_rank": 3, | |
| "quota_remaining": 4, | |
| "submitted_at": "2026-04-18T14:23:00" | |
| } | |
| ``` | |
| Response 4xx: | |
| - 400 β missing form fields, malformed CSV | |
| - 404 β unknown task | |
| - 422 β schema check failed (returns reason) | |
| - 429 β quota exceeded | |
| - 503 β GT not deployed for this task | |
| ### `GET /leaderboard/<task>` | |
| Returns array sorted by `primary` desc, **best per agent** (not all | |
| submissions): | |
| ```json | |
| [ | |
| {"agent": "human-baseline", "primary": 0.901, "n_submissions": 1, "first_seen": "..."}, | |
| {"agent": "autopipe-v0.4", "primary": 0.689, "n_submissions": 4, "first_seen": "..."}, | |
| ... | |
| ] | |
| ``` | |
| ### `GET /healthz` | |
| ```json | |
| { | |
| "status": "ok", | |
| "tasks": ["ieee-fraud-detection", ...], | |
| "gt_present": ["arxiv-citation", "figraph"], | |
| "quota_per_day": 5, | |
| "uptime_unix": 1745081234 | |
| } | |
| ``` | |
| ## 5. Anti-leakage rails (lightweight) | |
| | Rail | Mechanism | | |
| |---|---| | |
| | Test labels not in git | `.gitignore` blocks `ground_truth*`, `private/`. GT lives only on server fs. | | |
| | Score bucketing | Server rounds to 3 decimals before returning | | |
| | No per-row feedback | API returns aggregate metrics only | | |
| | Quota | 5 submissions / day / IP / task by default | | |
| | Schema check first | Malformed submissions don't burn quota and don't expose GT to scoring | | |
| ## 6. Reproducibility | |
| Every server response includes `run_id`. Server stores in sqlite: | |
| ``` | |
| run_id, task, agent, primary_metric, secondary_json, submission_sha256, | |
| n_rows, submitter_ip, submitted_at | |
| ``` | |
| To audit a leaderboard entry: maintainer queries sqlite by run_id, gets the | |
| submission CSV from `submissions/<task>/<agent>/<ts>.csv` (the client | |
| doesn't push these to git automatically β see Β§8 if you want that). | |
| ## 7. Versioning | |
| ### Dataset versions | |
| `manifest.yaml` pins `(hf_repo, hf_revision, sha256)` per file. To bump: | |
| 1. Push new files to HF as a new revision. | |
| 2. Update `manifest.yaml`'s sha256 fields. | |
| 3. Existing leaderboard entries reference the old data via their `run_id`'s | |
| stored `submission_sha256`; new submissions use the new data. | |
| For breaking changes (different metric, different test rows), add a v2 | |
| task entry: `arxiv-citation-v2`. Don't mutate v1's manifest. | |
| ### Metric versions | |
| Same rule: don't mutate, add `<task>-v2` if metric definition changes. | |
| ## 8. Optional: client-side submission archiving | |
| If you want every submission CSV preserved publicly, add a step to | |
| `graphtestbed/submit.py` that opens a PR with the CSV under | |
| `submissions/<task>/<agent>/<ts>.csv` after the API returns. Off by default | |
| to keep the client simple. | |
| ## 9. What the agent must NOT assume | |
| - **No persistent state** between submissions. Each `/submit` is independent. | |
| - **No mid-run feedback.** One scorer call per submit, full stop. | |
| - **No private leaderboard.** Every submission is rate-counted by IP and | |
| scored β no shadow leaderboard for testing. | |
| - **`agent` field is your name.** The server takes you at your word; don't | |
| impersonate. | |
| ## 10. What the API explicitly does NOT defend against | |
| - An agent that downloads upstream data (Kaggle, FiGraph github) and tries | |
| to recompute test labels. We re-split for time-forward eval, so upstream | |
| labels don't directly map, but a determined agent could try. | |
| - An agent that checks out `server` branch to read scoring code. The branch | |
| is public; reading the scoring logic doesn't leak GT. | |
| - An agent that runs on a private machine and tries to OCR a maintainer's | |
| laptop screen. We are not your adversary. | |
| If your threat model includes adversarial agents with arbitrary capability, | |
| this design is wrong; use container-isolated evaluation instead. | |
| ## 11. Rationale (why this vs the prior GitHub-Actions design) | |
| We considered using GitHub Actions + a private GT repo. Rejected because: | |
| - Two repos (public + private GT) is more moving parts than one repo + one | |
| small server. | |
| - A FastAPI/Flask app on a $5/month VM (or free HF Space) gives us the same | |
| privacy boundary (GT lives on server fs, not in git). | |
| - Synchronous scoring (instant response) is a better UX than async PR | |
| comments. | |
| - The trust model is "agents are honest within an academic benchmark," not | |
| "agents are adversaries." Doesn't need physical isolation; needs an API | |
| contract that prevents accidental leaks. | |
| The result: ~200 lines of server code, ~150 lines of client code, deploys | |
| on free tiers, no maintenance beyond restarting the server if it crashes. | |