Spaces:
Sleeping
Sleeping
File size: 6,735 Bytes
ad6901d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 | # GraphTestbed Protocol
The contract between agent harnesses, the public client (`gtb`), and the
scoring API (`server/api.py`).
## 1. Two-branch single-repo layout
```
GraphTestbed (single public repo)
βββ main branch β what agents `git clone`
βββ server branch β what the maintainer deploys
```
Branches are **not access control** β they're an organizational split. Both
are public; both can be cloned by anyone. The actual privacy boundary is:
- **Test labels never enter git** (any branch).
- The scoring server holds GT at `/var/graphtestbed/gt/<task>.csv`,
populated separately from git at deploy time.
## 2. Data layout (per task)
### What the agent sees (HuggingFace public dataset)
```
graphtestbed/<task>:
train_features.csv # labeled
val_features.csv # labeled
test_features.csv # NO target column
description.md
sample_submission.csv
<auxiliary>.csv # optional: edges*.csv, train_text.csv, ...
```
### What the server holds privately
```
/var/graphtestbed/gt/<task>.csv
<id_col>,Label # one row per test entity, with the true label
```
ID column matches `test_features.csv[<id_col>]` 100%.
## 3. Submission contract
### Format
Per-task schema in `datasets/manifest.yaml`:
```yaml
<task>:
submission_schema:
id_col: <name>
pred_col: <name>
n_rows: <int>
pred_dtype: float # binary classification
```
CSV must have:
- Exactly 2 columns: `<id_col>, <pred_col>`
- Exactly `n_rows` rows
- ID set 100% matching `test_features.csv[<id_col>]`
- `<pred_col>` values: float in [0, 1] for binary
### Submit
```bash
gtb submit <task> --file <preds.csv> --agent <name>
```
This:
1. Validates schema **locally** (no API call if malformed).
2. POSTs `multipart/form-data` to `$GRAPHTESTBED_API/submit` with
`task=<task>&agent=<name>&file=<csv>`.
3. Server re-validates schema, checks quota, scores against GT, returns
metrics JSON.
4. Client prints metrics + leaderboard rank + remaining quota.
## 4. API endpoints
### `POST /submit` (multipart)
Form fields:
- `task` (str, required)
- `agent` (str, required)
- `file` (CSV, required, β€50 MB)
Response 200:
```json
{
"run_id": "8a3f29bce4a1",
"task": "arxiv-citation",
"agent": "autopipe-v0.4",
"primary": 0.689,
"secondary": {"auc_pr": 0.661, "f1": 0.591},
"n_rows": 19394,
"leaderboard_rank": 3,
"quota_remaining": 4,
"submitted_at": "2026-04-18T14:23:00"
}
```
Response 4xx:
- 400 β missing form fields, malformed CSV
- 404 β unknown task
- 422 β schema check failed (returns reason)
- 429 β quota exceeded
- 503 β GT not deployed for this task
### `GET /leaderboard/<task>`
Returns array sorted by `primary` desc, **best per agent** (not all
submissions):
```json
[
{"agent": "human-baseline", "primary": 0.901, "n_submissions": 1, "first_seen": "..."},
{"agent": "autopipe-v0.4", "primary": 0.689, "n_submissions": 4, "first_seen": "..."},
...
]
```
### `GET /healthz`
```json
{
"status": "ok",
"tasks": ["ieee-fraud-detection", ...],
"gt_present": ["arxiv-citation", "figraph"],
"quota_per_day": 5,
"uptime_unix": 1745081234
}
```
## 5. Anti-leakage rails (lightweight)
| Rail | Mechanism |
|---|---|
| Test labels not in git | `.gitignore` blocks `ground_truth*`, `private/`. GT lives only on server fs. |
| Score bucketing | Server rounds to 3 decimals before returning |
| No per-row feedback | API returns aggregate metrics only |
| Quota | 5 submissions / day / IP / task by default |
| Schema check first | Malformed submissions don't burn quota and don't expose GT to scoring |
## 6. Reproducibility
Every server response includes `run_id`. Server stores in sqlite:
```
run_id, task, agent, primary_metric, secondary_json, submission_sha256,
n_rows, submitter_ip, submitted_at
```
To audit a leaderboard entry: maintainer queries sqlite by run_id, gets the
submission CSV from `submissions/<task>/<agent>/<ts>.csv` (the client
doesn't push these to git automatically β see Β§8 if you want that).
## 7. Versioning
### Dataset versions
`manifest.yaml` pins `(hf_repo, hf_revision, sha256)` per file. To bump:
1. Push new files to HF as a new revision.
2. Update `manifest.yaml`'s sha256 fields.
3. Existing leaderboard entries reference the old data via their `run_id`'s
stored `submission_sha256`; new submissions use the new data.
For breaking changes (different metric, different test rows), add a v2
task entry: `arxiv-citation-v2`. Don't mutate v1's manifest.
### Metric versions
Same rule: don't mutate, add `<task>-v2` if metric definition changes.
## 8. Optional: client-side submission archiving
If you want every submission CSV preserved publicly, add a step to
`graphtestbed/submit.py` that opens a PR with the CSV under
`submissions/<task>/<agent>/<ts>.csv` after the API returns. Off by default
to keep the client simple.
## 9. What the agent must NOT assume
- **No persistent state** between submissions. Each `/submit` is independent.
- **No mid-run feedback.** One scorer call per submit, full stop.
- **No private leaderboard.** Every submission is rate-counted by IP and
scored β no shadow leaderboard for testing.
- **`agent` field is your name.** The server takes you at your word; don't
impersonate.
## 10. What the API explicitly does NOT defend against
- An agent that downloads upstream data (Kaggle, FiGraph github) and tries
to recompute test labels. We re-split for time-forward eval, so upstream
labels don't directly map, but a determined agent could try.
- An agent that checks out `server` branch to read scoring code. The branch
is public; reading the scoring logic doesn't leak GT.
- An agent that runs on a private machine and tries to OCR a maintainer's
laptop screen. We are not your adversary.
If your threat model includes adversarial agents with arbitrary capability,
this design is wrong; use container-isolated evaluation instead.
## 11. Rationale (why this vs the prior GitHub-Actions design)
We considered using GitHub Actions + a private GT repo. Rejected because:
- Two repos (public + private GT) is more moving parts than one repo + one
small server.
- A FastAPI/Flask app on a $5/month VM (or free HF Space) gives us the same
privacy boundary (GT lives on server fs, not in git).
- Synchronous scoring (instant response) is a better UX than async PR
comments.
- The trust model is "agents are honest within an academic benchmark," not
"agents are adversaries." Doesn't need physical isolation; needs an API
contract that prevents accidental leaks.
The result: ~200 lines of server code, ~150 lines of client code, deploys
on free tiers, no maintenance beyond restarting the server if it crashes.
|