# GraphTestbed Protocol The contract between agent harnesses, the public client (`gtb`), and the scoring API (`server/api.py`). ## 1. Two-branch single-repo layout ``` GraphTestbed (single public repo) ├── main branch → what agents `git clone` └── server branch → what the maintainer deploys ``` Branches are **not access control** — they're an organizational split. Both are public; both can be cloned by anyone. The actual privacy boundary is: - **Test labels never enter git** (any branch). - The scoring server holds GT at `/var/graphtestbed/gt/.csv`, populated separately from git at deploy time. ## 2. Data layout (per task) ### What the agent sees (HuggingFace public dataset) ``` graphtestbed/: train_features.csv # labeled val_features.csv # labeled test_features.csv # NO target column description.md sample_submission.csv .csv # optional: edges*.csv, train_text.csv, ... ``` ### What the server holds privately ``` /var/graphtestbed/gt/.csv ,Label # one row per test entity, with the true label ``` ID column matches `test_features.csv[]` 100%. ## 3. Submission contract ### Format Per-task schema in `datasets/manifest.yaml`: ```yaml : submission_schema: id_col: pred_col: n_rows: pred_dtype: float # binary classification ``` CSV must have: - Exactly 2 columns: `, ` - Exactly `n_rows` rows - ID set 100% matching `test_features.csv[]` - `` values: float in [0, 1] for binary ### Submit ```bash gtb submit --file --agent ``` This: 1. Validates schema **locally** (no API call if malformed). 2. POSTs `multipart/form-data` to `$GRAPHTESTBED_API/submit` with `task=&agent=&file=`. 3. Server re-validates schema, checks quota, scores against GT, returns metrics JSON. 4. Client prints metrics + leaderboard rank + remaining quota. ## 4. API endpoints ### `POST /submit` (multipart) Form fields: - `task` (str, required) - `agent` (str, required) - `file` (CSV, required, ≤50 MB) Response 200: ```json { "run_id": "8a3f29bce4a1", "task": "arxiv-citation", "agent": "autopipe-v0.4", "primary": 0.689, "secondary": {"auc_pr": 0.661, "f1": 0.591}, "n_rows": 19394, "leaderboard_rank": 3, "quota_remaining": 4, "submitted_at": "2026-04-18T14:23:00" } ``` Response 4xx: - 400 — missing form fields, malformed CSV - 404 — unknown task - 422 — schema check failed (returns reason) - 429 — quota exceeded - 503 — GT not deployed for this task ### `GET /leaderboard/` Returns array sorted by `primary` desc, **best per agent** (not all submissions): ```json [ {"agent": "human-baseline", "primary": 0.901, "n_submissions": 1, "first_seen": "..."}, {"agent": "autopipe-v0.4", "primary": 0.689, "n_submissions": 4, "first_seen": "..."}, ... ] ``` ### `GET /healthz` ```json { "status": "ok", "tasks": ["ieee-fraud-detection", ...], "gt_present": ["arxiv-citation", "figraph"], "quota_per_day": 5, "uptime_unix": 1745081234 } ``` ## 5. Anti-leakage rails (lightweight) | Rail | Mechanism | |---|---| | Test labels not in git | `.gitignore` blocks `ground_truth*`, `private/`. GT lives only on server fs. | | Score bucketing | Server rounds to 3 decimals before returning | | No per-row feedback | API returns aggregate metrics only | | Quota | 5 submissions / day / IP / task by default | | Schema check first | Malformed submissions don't burn quota and don't expose GT to scoring | ## 6. Reproducibility Every server response includes `run_id`. Server stores in sqlite: ``` run_id, task, agent, primary_metric, secondary_json, submission_sha256, n_rows, submitter_ip, submitted_at ``` To audit a leaderboard entry: maintainer queries sqlite by run_id, gets the submission CSV from `submissions///.csv` (the client doesn't push these to git automatically — see §8 if you want that). ## 7. Versioning ### Dataset versions `manifest.yaml` pins `(hf_repo, hf_revision, sha256)` per file. To bump: 1. Push new files to HF as a new revision. 2. Update `manifest.yaml`'s sha256 fields. 3. Existing leaderboard entries reference the old data via their `run_id`'s stored `submission_sha256`; new submissions use the new data. For breaking changes (different metric, different test rows), add a v2 task entry: `arxiv-citation-v2`. Don't mutate v1's manifest. ### Metric versions Same rule: don't mutate, add `-v2` if metric definition changes. ## 8. Optional: client-side submission archiving If you want every submission CSV preserved publicly, add a step to `graphtestbed/submit.py` that opens a PR with the CSV under `submissions///.csv` after the API returns. Off by default to keep the client simple. ## 9. What the agent must NOT assume - **No persistent state** between submissions. Each `/submit` is independent. - **No mid-run feedback.** One scorer call per submit, full stop. - **No private leaderboard.** Every submission is rate-counted by IP and scored — no shadow leaderboard for testing. - **`agent` field is your name.** The server takes you at your word; don't impersonate. ## 10. What the API explicitly does NOT defend against - An agent that downloads upstream data (Kaggle, FiGraph github) and tries to recompute test labels. We re-split for time-forward eval, so upstream labels don't directly map, but a determined agent could try. - An agent that checks out `server` branch to read scoring code. The branch is public; reading the scoring logic doesn't leak GT. - An agent that runs on a private machine and tries to OCR a maintainer's laptop screen. We are not your adversary. If your threat model includes adversarial agents with arbitrary capability, this design is wrong; use container-isolated evaluation instead. ## 11. Rationale (why this vs the prior GitHub-Actions design) We considered using GitHub Actions + a private GT repo. Rejected because: - Two repos (public + private GT) is more moving parts than one repo + one small server. - A FastAPI/Flask app on a $5/month VM (or free HF Space) gives us the same privacy boundary (GT lives on server fs, not in git). - Synchronous scoring (instant response) is a better UX than async PR comments. - The trust model is "agents are honest within an academic benchmark," not "agents are adversaries." Doesn't need physical isolation; needs an API contract that prevents accidental leaks. The result: ~200 lines of server code, ~150 lines of client code, deploys on free tiers, no maintenance beyond restarting the server if it crashes.