Spaces:

lanczos
/

graphtestbed

Sleeping

App Files Files Community

graphtestbed / PROTOCOL.md

zhuconv

Initial commit: GraphTestbed v0.1.0

ad6901d 21 days ago

preview code

raw

history blame contribute delete

6.74 kB

	# GraphTestbed Protocol

	The contract between agent harnesses, the public client (`gtb`), and the
	scoring API (`server/api.py`).

	## 1. Two-branch single-repo layout

	```
	GraphTestbed (single public repo)
	├── main branch → what agents `git clone`
	└── server branch → what the maintainer deploys
	```

	Branches are not access control — they're an organizational split. Both
	are public; both can be cloned by anyone. The actual privacy boundary is:

	- Test labels never enter git (any branch).
	- The scoring server holds GT at `/var/graphtestbed/gt/<task>.csv`,
	populated separately from git at deploy time.

	## 2. Data layout (per task)

	### What the agent sees (HuggingFace public dataset)

	```
	graphtestbed/<task>:
	train_features.csv # labeled
	val_features.csv # labeled
	test_features.csv # NO target column
	description.md
	sample_submission.csv
	<auxiliary>.csv # optional: edges*.csv, train_text.csv, ...
	```

	### What the server holds privately

	```
	/var/graphtestbed/gt/<task>.csv
	<id_col>,Label # one row per test entity, with the true label
	```

	ID column matches `test_features.csv[<id_col>]` 100%.

	## 3. Submission contract

	### Format

	Per-task schema in `datasets/manifest.yaml`:

	```yaml
	<task>:
	submission_schema:
	id_col: <name>
	pred_col: <name>
	n_rows: <int>
	pred_dtype: float # binary classification
	```

	CSV must have:
	- Exactly 2 columns: `<id_col>, <pred_col>`
	- Exactly `n_rows` rows
	- ID set 100% matching `test_features.csv[<id_col>]`
	- `<pred_col>` values: float in [0, 1] for binary

	### Submit

	```bash
	gtb submit <task> --file <preds.csv> --agent <name>
	```

	This:
	1. Validates schema locally (no API call if malformed).
	2. POSTs `multipart/form-data` to `$GRAPHTESTBED_API/submit` with
	`task=<task>&agent=<name>&file=<csv>`.
	3. Server re-validates schema, checks quota, scores against GT, returns
	metrics JSON.
	4. Client prints metrics + leaderboard rank + remaining quota.

	## 4. API endpoints

	### `POST /submit` (multipart)

	Form fields:
	- `task` (str, required)
	- `agent` (str, required)
	- `file` (CSV, required, ≤50 MB)

	Response 200:
	```json
	{
	"run_id": "8a3f29bce4a1",
	"task": "arxiv-citation",
	"agent": "autopipe-v0.4",
	"primary": 0.689,
	"secondary": {"auc_pr": 0.661, "f1": 0.591},
	"n_rows": 19394,
	"leaderboard_rank": 3,
	"quota_remaining": 4,
	"submitted_at": "2026-04-18T14:23:00"
	}
	```

	Response 4xx:
	- 400 — missing form fields, malformed CSV
	- 404 — unknown task
	- 422 — schema check failed (returns reason)
	- 429 — quota exceeded
	- 503 — GT not deployed for this task

	### `GET /leaderboard/<task>`

	Returns array sorted by `primary` desc, best per agent (not all
	submissions):
	```json
	[
	{"agent": "human-baseline", "primary": 0.901, "n_submissions": 1, "first_seen": "..."},
	{"agent": "autopipe-v0.4", "primary": 0.689, "n_submissions": 4, "first_seen": "..."},
	...
	]
	```

	### `GET /healthz`

	```json
	{
	"status": "ok",
	"tasks": ["ieee-fraud-detection", ...],
	"gt_present": ["arxiv-citation", "figraph"],
	"quota_per_day": 5,
	"uptime_unix": 1745081234
	}
	```

	## 5. Anti-leakage rails (lightweight)

	\| Rail \| Mechanism \|
	\|---\|---\|
	\| Test labels not in git \| `.gitignore` blocks `ground_truth*`, `private/`. GT lives only on server fs. \|
	\| Score bucketing \| Server rounds to 3 decimals before returning \|
	\| No per-row feedback \| API returns aggregate metrics only \|
	\| Quota \| 5 submissions / day / IP / task by default \|
	\| Schema check first \| Malformed submissions don't burn quota and don't expose GT to scoring \|

	## 6. Reproducibility

	Every server response includes `run_id`. Server stores in sqlite:

	```
	run_id, task, agent, primary_metric, secondary_json, submission_sha256,
	n_rows, submitter_ip, submitted_at
	```

	To audit a leaderboard entry: maintainer queries sqlite by run_id, gets the
	submission CSV from `submissions/<task>/<agent>/<ts>.csv` (the client
	doesn't push these to git automatically — see §8 if you want that).

	## 7. Versioning

	### Dataset versions

	`manifest.yaml` pins `(hf_repo, hf_revision, sha256)` per file. To bump:
	1. Push new files to HF as a new revision.
	2. Update `manifest.yaml`'s sha256 fields.
	3. Existing leaderboard entries reference the old data via their `run_id`'s
	stored `submission_sha256`; new submissions use the new data.

	For breaking changes (different metric, different test rows), add a v2
	task entry: `arxiv-citation-v2`. Don't mutate v1's manifest.

	### Metric versions

	Same rule: don't mutate, add `<task>-v2` if metric definition changes.

	## 8. Optional: client-side submission archiving

	If you want every submission CSV preserved publicly, add a step to
	`graphtestbed/submit.py` that opens a PR with the CSV under
	`submissions/<task>/<agent>/<ts>.csv` after the API returns. Off by default
	to keep the client simple.

	## 9. What the agent must NOT assume

	- No persistent state between submissions. Each `/submit` is independent.
	- No mid-run feedback. One scorer call per submit, full stop.
	- No private leaderboard. Every submission is rate-counted by IP and
	scored — no shadow leaderboard for testing.
	- `agent` field is your name. The server takes you at your word; don't
	impersonate.

	## 10. What the API explicitly does NOT defend against

	- An agent that downloads upstream data (Kaggle, FiGraph github) and tries
	to recompute test labels. We re-split for time-forward eval, so upstream
	labels don't directly map, but a determined agent could try.
	- An agent that checks out `server` branch to read scoring code. The branch
	is public; reading the scoring logic doesn't leak GT.
	- An agent that runs on a private machine and tries to OCR a maintainer's
	laptop screen. We are not your adversary.

	If your threat model includes adversarial agents with arbitrary capability,
	this design is wrong; use container-isolated evaluation instead.

	## 11. Rationale (why this vs the prior GitHub-Actions design)

	We considered using GitHub Actions + a private GT repo. Rejected because:

	- Two repos (public + private GT) is more moving parts than one repo + one
	small server.
	- A FastAPI/Flask app on a $5/month VM (or free HF Space) gives us the same
	privacy boundary (GT lives on server fs, not in git).
	- Synchronous scoring (instant response) is a better UX than async PR
	comments.
	- The trust model is "agents are honest within an academic benchmark," not
	"agents are adversaries." Doesn't need physical isolation; needs an API
	contract that prevents accidental leaks.

	The result: ~200 lines of server code, ~150 lines of client code, deploys
	on free tiers, no maintenance beyond restarting the server if it crashes.