File size: 6,735 Bytes
ad6901d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
# GraphTestbed Protocol

The contract between agent harnesses, the public client (`gtb`), and the
scoring API (`server/api.py`).

## 1. Two-branch single-repo layout

```
GraphTestbed (single public repo)
β”œβ”€β”€ main branch     β†’ what agents `git clone`
└── server branch   β†’ what the maintainer deploys
```

Branches are **not access control** β€” they're an organizational split. Both
are public; both can be cloned by anyone. The actual privacy boundary is:

- **Test labels never enter git** (any branch).
- The scoring server holds GT at `/var/graphtestbed/gt/<task>.csv`,
  populated separately from git at deploy time.

## 2. Data layout (per task)

### What the agent sees (HuggingFace public dataset)

```
graphtestbed/<task>:
  train_features.csv       # labeled
  val_features.csv         # labeled
  test_features.csv        # NO target column
  description.md
  sample_submission.csv
  <auxiliary>.csv          # optional: edges*.csv, train_text.csv, ...
```

### What the server holds privately

```
/var/graphtestbed/gt/<task>.csv
  <id_col>,Label           # one row per test entity, with the true label
```

ID column matches `test_features.csv[<id_col>]` 100%.

## 3. Submission contract

### Format

Per-task schema in `datasets/manifest.yaml`:

```yaml
<task>:
  submission_schema:
    id_col: <name>
    pred_col: <name>
    n_rows: <int>
    pred_dtype: float    # binary classification
```

CSV must have:
- Exactly 2 columns: `<id_col>, <pred_col>`
- Exactly `n_rows` rows
- ID set 100% matching `test_features.csv[<id_col>]`
- `<pred_col>` values: float in [0, 1] for binary

### Submit

```bash
gtb submit <task> --file <preds.csv> --agent <name>
```

This:
1. Validates schema **locally** (no API call if malformed).
2. POSTs `multipart/form-data` to `$GRAPHTESTBED_API/submit` with
   `task=<task>&agent=<name>&file=<csv>`.
3. Server re-validates schema, checks quota, scores against GT, returns
   metrics JSON.
4. Client prints metrics + leaderboard rank + remaining quota.

## 4. API endpoints

### `POST /submit` (multipart)

Form fields:
- `task` (str, required)
- `agent` (str, required)
- `file` (CSV, required, ≀50 MB)

Response 200:
```json
{
  "run_id": "8a3f29bce4a1",
  "task": "arxiv-citation",
  "agent": "autopipe-v0.4",
  "primary": 0.689,
  "secondary": {"auc_pr": 0.661, "f1": 0.591},
  "n_rows": 19394,
  "leaderboard_rank": 3,
  "quota_remaining": 4,
  "submitted_at": "2026-04-18T14:23:00"
}
```

Response 4xx:
- 400 β€” missing form fields, malformed CSV
- 404 β€” unknown task
- 422 β€” schema check failed (returns reason)
- 429 β€” quota exceeded
- 503 β€” GT not deployed for this task

### `GET /leaderboard/<task>`

Returns array sorted by `primary` desc, **best per agent** (not all
submissions):
```json
[
  {"agent": "human-baseline", "primary": 0.901, "n_submissions": 1, "first_seen": "..."},
  {"agent": "autopipe-v0.4",  "primary": 0.689, "n_submissions": 4, "first_seen": "..."},
  ...
]
```

### `GET /healthz`

```json
{
  "status": "ok",
  "tasks": ["ieee-fraud-detection", ...],
  "gt_present": ["arxiv-citation", "figraph"],
  "quota_per_day": 5,
  "uptime_unix": 1745081234
}
```

## 5. Anti-leakage rails (lightweight)

| Rail | Mechanism |
|---|---|
| Test labels not in git | `.gitignore` blocks `ground_truth*`, `private/`. GT lives only on server fs. |
| Score bucketing | Server rounds to 3 decimals before returning |
| No per-row feedback | API returns aggregate metrics only |
| Quota | 5 submissions / day / IP / task by default |
| Schema check first | Malformed submissions don't burn quota and don't expose GT to scoring |

## 6. Reproducibility

Every server response includes `run_id`. Server stores in sqlite:

```
run_id, task, agent, primary_metric, secondary_json, submission_sha256,
n_rows, submitter_ip, submitted_at
```

To audit a leaderboard entry: maintainer queries sqlite by run_id, gets the
submission CSV from `submissions/<task>/<agent>/<ts>.csv` (the client
doesn't push these to git automatically β€” see Β§8 if you want that).

## 7. Versioning

### Dataset versions

`manifest.yaml` pins `(hf_repo, hf_revision, sha256)` per file. To bump:
1. Push new files to HF as a new revision.
2. Update `manifest.yaml`'s sha256 fields.
3. Existing leaderboard entries reference the old data via their `run_id`'s
   stored `submission_sha256`; new submissions use the new data.

For breaking changes (different metric, different test rows), add a v2
task entry: `arxiv-citation-v2`. Don't mutate v1's manifest.

### Metric versions

Same rule: don't mutate, add `<task>-v2` if metric definition changes.

## 8. Optional: client-side submission archiving

If you want every submission CSV preserved publicly, add a step to
`graphtestbed/submit.py` that opens a PR with the CSV under
`submissions/<task>/<agent>/<ts>.csv` after the API returns. Off by default
to keep the client simple.

## 9. What the agent must NOT assume

- **No persistent state** between submissions. Each `/submit` is independent.
- **No mid-run feedback.** One scorer call per submit, full stop.
- **No private leaderboard.** Every submission is rate-counted by IP and
  scored β€” no shadow leaderboard for testing.
- **`agent` field is your name.** The server takes you at your word; don't
  impersonate.

## 10. What the API explicitly does NOT defend against

- An agent that downloads upstream data (Kaggle, FiGraph github) and tries
  to recompute test labels. We re-split for time-forward eval, so upstream
  labels don't directly map, but a determined agent could try.
- An agent that checks out `server` branch to read scoring code. The branch
  is public; reading the scoring logic doesn't leak GT.
- An agent that runs on a private machine and tries to OCR a maintainer's
  laptop screen. We are not your adversary.

If your threat model includes adversarial agents with arbitrary capability,
this design is wrong; use container-isolated evaluation instead.

## 11. Rationale (why this vs the prior GitHub-Actions design)

We considered using GitHub Actions + a private GT repo. Rejected because:

- Two repos (public + private GT) is more moving parts than one repo + one
  small server.
- A FastAPI/Flask app on a $5/month VM (or free HF Space) gives us the same
  privacy boundary (GT lives on server fs, not in git).
- Synchronous scoring (instant response) is a better UX than async PR
  comments.
- The trust model is "agents are honest within an academic benchmark," not
  "agents are adversaries." Doesn't need physical isolation; needs an API
  contract that prevents accidental leaks.

The result: ~200 lines of server code, ~150 lines of client code, deploys
on free tiers, no maintenance beyond restarting the server if it crashes.