Spaces:

lanczos
/

graphtestbed

Sleeping

App Files Files Community

graphtestbed / PROTOCOL.md

zhuconv

Initial commit: GraphTestbed v0.1.0

ad6901d 20 days ago

preview code

raw

history blame contribute delete

6.74 kB

GraphTestbed Protocol

The contract between agent harnesses, the public client (gtb), and the scoring API (server/api.py).

1. Two-branch single-repo layout

GraphTestbed (single public repo)
├── main branch     → what agents `git clone`
└── server branch   → what the maintainer deploys

Branches are not access control — they're an organizational split. Both are public; both can be cloned by anyone. The actual privacy boundary is:

Test labels never enter git (any branch).
The scoring server holds GT at /var/graphtestbed/gt/<task>.csv, populated separately from git at deploy time.

2. Data layout (per task)

What the agent sees (HuggingFace public dataset)

graphtestbed/<task>:
  train_features.csv       # labeled
  val_features.csv         # labeled
  test_features.csv        # NO target column
  description.md
  sample_submission.csv
  <auxiliary>.csv          # optional: edges*.csv, train_text.csv, ...

What the server holds privately

/var/graphtestbed/gt/<task>.csv
  <id_col>,Label           # one row per test entity, with the true label

ID column matches test_features.csv[<id_col>] 100%.

3. Submission contract

Format

Per-task schema in datasets/manifest.yaml:

<task>:
  submission_schema:
    id_col: <name>
    pred_col: <name>
    n_rows: <int>
    pred_dtype: float    # binary classification

CSV must have:

Exactly 2 columns: <id_col>, <pred_col>
Exactly n_rows rows
ID set 100% matching test_features.csv[<id_col>]
<pred_col> values: float in [0, 1] for binary

Submit

gtb submit <task> --file <preds.csv> --agent <name>

This:

Validates schema locally (no API call if malformed).
POSTs multipart/form-data to $GRAPHTESTBED_API/submit with task=<task>&agent=<name>&file=<csv>.
Server re-validates schema, checks quota, scores against GT, returns metrics JSON.
Client prints metrics + leaderboard rank + remaining quota.

4. API endpoints

`POST /submit` (multipart)

Form fields:

task (str, required)
agent (str, required)
file (CSV, required, ≤50 MB)

Response 200:

{
  "run_id": "8a3f29bce4a1",
  "task": "arxiv-citation",
  "agent": "autopipe-v0.4",
  "primary": 0.689,
  "secondary": {"auc_pr": 0.661, "f1": 0.591},
  "n_rows": 19394,
  "leaderboard_rank": 3,
  "quota_remaining": 4,
  "submitted_at": "2026-04-18T14:23:00"
}

Response 4xx:

400 — missing form fields, malformed CSV
404 — unknown task
422 — schema check failed (returns reason)
429 — quota exceeded
503 — GT not deployed for this task

`GET /leaderboard/<task>`

Returns array sorted by primary desc, best per agent (not all submissions):

[
  {"agent": "human-baseline", "primary": 0.901, "n_submissions": 1, "first_seen": "..."},
  {"agent": "autopipe-v0.4",  "primary": 0.689, "n_submissions": 4, "first_seen": "..."},
  ...
]

`GET /healthz`

{
  "status": "ok",
  "tasks": ["ieee-fraud-detection", ...],
  "gt_present": ["arxiv-citation", "figraph"],
  "quota_per_day": 5,
  "uptime_unix": 1745081234
}

5. Anti-leakage rails (lightweight)

Rail	Mechanism
Test labels not in git	`.gitignore` blocks `ground_truth*`, `private/`. GT lives only on server fs.
Score bucketing	Server rounds to 3 decimals before returning
No per-row feedback	API returns aggregate metrics only
Quota	5 submissions / day / IP / task by default
Schema check first	Malformed submissions don't burn quota and don't expose GT to scoring

6. Reproducibility

Every server response includes run_id. Server stores in sqlite:

run_id, task, agent, primary_metric, secondary_json, submission_sha256,
n_rows, submitter_ip, submitted_at

To audit a leaderboard entry: maintainer queries sqlite by run_id, gets the submission CSV from submissions/<task>/<agent>/<ts>.csv (the client doesn't push these to git automatically — see §8 if you want that).

7. Versioning

Dataset versions

manifest.yaml pins (hf_repo, hf_revision, sha256) per file. To bump:

Push new files to HF as a new revision.
Update manifest.yaml's sha256 fields.
Existing leaderboard entries reference the old data via their run_id's stored submission_sha256; new submissions use the new data.

For breaking changes (different metric, different test rows), add a v2 task entry: arxiv-citation-v2. Don't mutate v1's manifest.

Metric versions

Same rule: don't mutate, add <task>-v2 if metric definition changes.

8. Optional: client-side submission archiving

If you want every submission CSV preserved publicly, add a step to graphtestbed/submit.py that opens a PR with the CSV under submissions/<task>/<agent>/<ts>.csv after the API returns. Off by default to keep the client simple.

9. What the agent must NOT assume

No persistent state between submissions. Each /submit is independent.
No mid-run feedback. One scorer call per submit, full stop.
No private leaderboard. Every submission is rate-counted by IP and scored — no shadow leaderboard for testing.
agent field is your name. The server takes you at your word; don't impersonate.

10. What the API explicitly does NOT defend against

An agent that downloads upstream data (Kaggle, FiGraph github) and tries to recompute test labels. We re-split for time-forward eval, so upstream labels don't directly map, but a determined agent could try.
An agent that checks out server branch to read scoring code. The branch is public; reading the scoring logic doesn't leak GT.
An agent that runs on a private machine and tries to OCR a maintainer's laptop screen. We are not your adversary.

If your threat model includes adversarial agents with arbitrary capability, this design is wrong; use container-isolated evaluation instead.

11. Rationale (why this vs the prior GitHub-Actions design)

We considered using GitHub Actions + a private GT repo. Rejected because:

Two repos (public + private GT) is more moving parts than one repo + one small server.
A FastAPI/Flask app on a $5/month VM (or free HF Space) gives us the same privacy boundary (GT lives on server fs, not in git).
Synchronous scoring (instant response) is a better UX than async PR comments.
The trust model is "agents are honest within an academic benchmark," not "agents are adversaries." Doesn't need physical isolation; needs an API contract that prevents accidental leaks.

The result: ~200 lines of server code, ~150 lines of client code, deploys on free tiers, no maintenance beyond restarting the server if it crashes.