Spaces:
Sleeping
GraphTestbed Protocol
The contract between agent harnesses, the public client (gtb), and the
scoring API (server/api.py).
1. Two-branch single-repo layout
GraphTestbed (single public repo)
βββ main branch β what agents `git clone`
βββ server branch β what the maintainer deploys
Branches are not access control β they're an organizational split. Both are public; both can be cloned by anyone. The actual privacy boundary is:
- Test labels never enter git (any branch).
- The scoring server holds GT at
/var/graphtestbed/gt/<task>.csv, populated separately from git at deploy time.
2. Data layout (per task)
What the agent sees (HuggingFace public dataset)
graphtestbed/<task>:
train_features.csv # labeled
val_features.csv # labeled
test_features.csv # NO target column
description.md
sample_submission.csv
<auxiliary>.csv # optional: edges*.csv, train_text.csv, ...
What the server holds privately
/var/graphtestbed/gt/<task>.csv
<id_col>,Label # one row per test entity, with the true label
ID column matches test_features.csv[<id_col>] 100%.
3. Submission contract
Format
Per-task schema in datasets/manifest.yaml:
<task>:
submission_schema:
id_col: <name>
pred_col: <name>
n_rows: <int>
pred_dtype: float # binary classification
CSV must have:
- Exactly 2 columns:
<id_col>, <pred_col> - Exactly
n_rowsrows - ID set 100% matching
test_features.csv[<id_col>] <pred_col>values: float in [0, 1] for binary
Submit
gtb submit <task> --file <preds.csv> --agent <name>
This:
- Validates schema locally (no API call if malformed).
- POSTs
multipart/form-datato$GRAPHTESTBED_API/submitwithtask=<task>&agent=<name>&file=<csv>. - Server re-validates schema, checks quota, scores against GT, returns metrics JSON.
- Client prints metrics + leaderboard rank + remaining quota.
4. API endpoints
POST /submit (multipart)
Form fields:
task(str, required)agent(str, required)file(CSV, required, β€50 MB)
Response 200:
{
"run_id": "8a3f29bce4a1",
"task": "arxiv-citation",
"agent": "autopipe-v0.4",
"primary": 0.689,
"secondary": {"auc_pr": 0.661, "f1": 0.591},
"n_rows": 19394,
"leaderboard_rank": 3,
"quota_remaining": 4,
"submitted_at": "2026-04-18T14:23:00"
}
Response 4xx:
- 400 β missing form fields, malformed CSV
- 404 β unknown task
- 422 β schema check failed (returns reason)
- 429 β quota exceeded
- 503 β GT not deployed for this task
GET /leaderboard/<task>
Returns array sorted by primary desc, best per agent (not all
submissions):
[
{"agent": "human-baseline", "primary": 0.901, "n_submissions": 1, "first_seen": "..."},
{"agent": "autopipe-v0.4", "primary": 0.689, "n_submissions": 4, "first_seen": "..."},
...
]
GET /healthz
{
"status": "ok",
"tasks": ["ieee-fraud-detection", ...],
"gt_present": ["arxiv-citation", "figraph"],
"quota_per_day": 5,
"uptime_unix": 1745081234
}
5. Anti-leakage rails (lightweight)
| Rail | Mechanism |
|---|---|
| Test labels not in git | .gitignore blocks ground_truth*, private/. GT lives only on server fs. |
| Score bucketing | Server rounds to 3 decimals before returning |
| No per-row feedback | API returns aggregate metrics only |
| Quota | 5 submissions / day / IP / task by default |
| Schema check first | Malformed submissions don't burn quota and don't expose GT to scoring |
6. Reproducibility
Every server response includes run_id. Server stores in sqlite:
run_id, task, agent, primary_metric, secondary_json, submission_sha256,
n_rows, submitter_ip, submitted_at
To audit a leaderboard entry: maintainer queries sqlite by run_id, gets the
submission CSV from submissions/<task>/<agent>/<ts>.csv (the client
doesn't push these to git automatically β see Β§8 if you want that).
7. Versioning
Dataset versions
manifest.yaml pins (hf_repo, hf_revision, sha256) per file. To bump:
- Push new files to HF as a new revision.
- Update
manifest.yaml's sha256 fields. - Existing leaderboard entries reference the old data via their
run_id's storedsubmission_sha256; new submissions use the new data.
For breaking changes (different metric, different test rows), add a v2
task entry: arxiv-citation-v2. Don't mutate v1's manifest.
Metric versions
Same rule: don't mutate, add <task>-v2 if metric definition changes.
8. Optional: client-side submission archiving
If you want every submission CSV preserved publicly, add a step to
graphtestbed/submit.py that opens a PR with the CSV under
submissions/<task>/<agent>/<ts>.csv after the API returns. Off by default
to keep the client simple.
9. What the agent must NOT assume
- No persistent state between submissions. Each
/submitis independent. - No mid-run feedback. One scorer call per submit, full stop.
- No private leaderboard. Every submission is rate-counted by IP and scored β no shadow leaderboard for testing.
agentfield is your name. The server takes you at your word; don't impersonate.
10. What the API explicitly does NOT defend against
- An agent that downloads upstream data (Kaggle, FiGraph github) and tries to recompute test labels. We re-split for time-forward eval, so upstream labels don't directly map, but a determined agent could try.
- An agent that checks out
serverbranch to read scoring code. The branch is public; reading the scoring logic doesn't leak GT. - An agent that runs on a private machine and tries to OCR a maintainer's laptop screen. We are not your adversary.
If your threat model includes adversarial agents with arbitrary capability, this design is wrong; use container-isolated evaluation instead.
11. Rationale (why this vs the prior GitHub-Actions design)
We considered using GitHub Actions + a private GT repo. Rejected because:
- Two repos (public + private GT) is more moving parts than one repo + one small server.
- A FastAPI/Flask app on a $5/month VM (or free HF Space) gives us the same privacy boundary (GT lives on server fs, not in git).
- Synchronous scoring (instant response) is a better UX than async PR comments.
- The trust model is "agents are honest within an academic benchmark," not "agents are adversaries." Doesn't need physical isolation; needs an API contract that prevents accidental leaks.
The result: ~200 lines of server code, ~150 lines of client code, deploys on free tiers, no maintenance beyond restarting the server if it crashes.