Spaces:

lanczos
/

graphtestbed

Sleeping

File size: 6,735 Bytes

ad6901d

# GraphTestbed Protocol

The contract between agent harnesses, the public client (`gtb`), and the
scoring API (`server/api.py`).

## 1. Two-branch single-repo layout

```
GraphTestbed (single public repo)
├── main branch     → what agents `git clone`
└── server branch   → what the maintainer deploys
```

Branches are **not access control** — they're an organizational split. Both
are public; both can be cloned by anyone. The actual privacy boundary is:

- **Test labels never enter git** (any branch).
- The scoring server holds GT at `/var/graphtestbed/gt/<task>.csv`,
  populated separately from git at deploy time.

## 2. Data layout (per task)

### What the agent sees (HuggingFace public dataset)

```
graphtestbed/<task>:
  train_features.csv       # labeled
  val_features.csv         # labeled
  test_features.csv        # NO target column
  description.md
  sample_submission.csv
  <auxiliary>.csv          # optional: edges*.csv, train_text.csv, ...
```

### What the server holds privately

```
/var/graphtestbed/gt/<task>.csv
  <id_col>,Label           # one row per test entity, with the true label
```

ID column matches `test_features.csv[<id_col>]` 100%.

## 3. Submission contract

### Format

Per-task schema in `datasets/manifest.yaml`:

```yaml
<task>:
  submission_schema:
    id_col: <name>
    pred_col: <name>
    n_rows: <int>
    pred_dtype: float    # binary classification
```

CSV must have:
- Exactly 2 columns: `<id_col>, <pred_col>`
- Exactly `n_rows` rows
- ID set 100% matching `test_features.csv[<id_col>]`
- `<pred_col>` values: float in [0, 1] for binary

### Submit

```bash
gtb submit <task> --file <preds.csv> --agent <name>
```

This:
1. Validates schema **locally** (no API call if malformed).
2. POSTs `multipart/form-data` to `$GRAPHTESTBED_API/submit` with
   `task=<task>&agent=<name>&file=<csv>`.
3. Server re-validates schema, checks quota, scores against GT, returns
   metrics JSON.
4. Client prints metrics + leaderboard rank + remaining quota.

## 4. API endpoints

### `POST /submit` (multipart)

Form fields:
- `task` (str, required)
- `agent` (str, required)
- `file` (CSV, required, ≤50 MB)

Response 200:
```json
{
  "run_id": "8a3f29bce4a1",
  "task": "arxiv-citation",
  "agent": "autopipe-v0.4",
  "primary": 0.689,
  "secondary": {"auc_pr": 0.661, "f1": 0.591},
  "n_rows": 19394,
  "leaderboard_rank": 3,
  "quota_remaining": 4,
  "submitted_at": "2026-04-18T14:23:00"
}
```

Response 4xx:
- 400 — missing form fields, malformed CSV
- 404 — unknown task
- 422 — schema check failed (returns reason)
- 429 — quota exceeded
- 503 — GT not deployed for this task

### `GET /leaderboard/<task>`

Returns array sorted by `primary` desc, **best per agent** (not all
submissions):
```json
[
  {"agent": "human-baseline", "primary": 0.901, "n_submissions": 1, "first_seen": "..."},
  {"agent": "autopipe-v0.4",  "primary": 0.689, "n_submissions": 4, "first_seen": "..."},
  ...
]
```

### `GET /healthz`

```json
{
  "status": "ok",
  "tasks": ["ieee-fraud-detection", ...],
  "gt_present": ["arxiv-citation", "figraph"],
  "quota_per_day": 5,
  "uptime_unix": 1745081234
}
```

## 5. Anti-leakage rails (lightweight)

| Rail | Mechanism |
|---|---|
| Test labels not in git | `.gitignore` blocks `ground_truth*`, `private/`. GT lives only on server fs. |
| Score bucketing | Server rounds to 3 decimals before returning |
| No per-row feedback | API returns aggregate metrics only |
| Quota | 5 submissions / day / IP / task by default |
| Schema check first | Malformed submissions don't burn quota and don't expose GT to scoring |

## 6. Reproducibility

Every server response includes `run_id`. Server stores in sqlite:

```
run_id, task, agent, primary_metric, secondary_json, submission_sha256,
n_rows, submitter_ip, submitted_at
```

To audit a leaderboard entry: maintainer queries sqlite by run_id, gets the
submission CSV from `submissions/<task>/<agent>/<ts>.csv` (the client
doesn't push these to git automatically — see §8 if you want that).

## 7. Versioning

### Dataset versions

`manifest.yaml` pins `(hf_repo, hf_revision, sha256)` per file. To bump:
1. Push new files to HF as a new revision.
2. Update `manifest.yaml`'s sha256 fields.
3. Existing leaderboard entries reference the old data via their `run_id`'s
   stored `submission_sha256`; new submissions use the new data.

For breaking changes (different metric, different test rows), add a v2
task entry: `arxiv-citation-v2`. Don't mutate v1's manifest.

### Metric versions

Same rule: don't mutate, add `<task>-v2` if metric definition changes.

## 8. Optional: client-side submission archiving

If you want every submission CSV preserved publicly, add a step to
`graphtestbed/submit.py` that opens a PR with the CSV under
`submissions/<task>/<agent>/<ts>.csv` after the API returns. Off by default
to keep the client simple.

## 9. What the agent must NOT assume

- **No persistent state** between submissions. Each `/submit` is independent.
- **No mid-run feedback.** One scorer call per submit, full stop.
- **No private leaderboard.** Every submission is rate-counted by IP and
  scored — no shadow leaderboard for testing.
- **`agent` field is your name.** The server takes you at your word; don't
  impersonate.

## 10. What the API explicitly does NOT defend against

- An agent that downloads upstream data (Kaggle, FiGraph github) and tries
  to recompute test labels. We re-split for time-forward eval, so upstream
  labels don't directly map, but a determined agent could try.
- An agent that checks out `server` branch to read scoring code. The branch
  is public; reading the scoring logic doesn't leak GT.
- An agent that runs on a private machine and tries to OCR a maintainer's
  laptop screen. We are not your adversary.

If your threat model includes adversarial agents with arbitrary capability,
this design is wrong; use container-isolated evaluation instead.

## 11. Rationale (why this vs the prior GitHub-Actions design)

We considered using GitHub Actions + a private GT repo. Rejected because:

- Two repos (public + private GT) is more moving parts than one repo + one
  small server.
- A FastAPI/Flask app on a $5/month VM (or free HF Space) gives us the same
  privacy boundary (GT lives on server fs, not in git).
- Synchronous scoring (instant response) is a better UX than async PR
  comments.
- The trust model is "agents are honest within an academic benchmark," not
  "agents are adversaries." Doesn't need physical isolation; needs an API
  contract that prevents accidental leaks.

The result: ~200 lines of server code, ~150 lines of client code, deploys
on free tiers, no maintenance beyond restarting the server if it crashes.