File size: 2,306 Bytes
8fdba4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
license: mit
---

# SWE-Bench Trajectory Eval Bundle (v1)

Companion artifact for the trajectory-probe downstream eval of the
code-graph-v7 encoders (W1, I6, ...).

## Contents

- `traj_full_bundle.tar.gz` (488 MB) — contains:
  - `specs.jsonl`: 2456 SWE-Bench Verified agent trajectories harvested
    from `swe-bench-submissions` S3 bucket. Fields: instance_id, traj_id,
    repo, base_commit, patches (1 entry = final model patch), resolved.
  - `repos/`: shallow (`--filter=blob:none`) clones of the 12 target
    repos (django, sympy, sphinx, matplotlib, scikit-learn, astropy,
    xarray, pytest, pylint, requests, seaborn, flask). ~671 MB
    uncompressed. Blobs pulled lazily per base_commit checkout.
  - `graphjepa/`: pipeline code (trajectory_pipeline, trajectory_realize,
    trajectory_probe, trajectory_harvest) plus scripts/trajectory_full.sh.
- `harvest.log` — stdout from the S3 harvester that produced specs.jsonl.

## Downstream workflow

```bash
tar -xzf traj_full_bundle.tar.gz
rsync -a traj_full/graphjepa/ graphjepa/
mkdir -p outputs/traj_real
cp traj_full/specs.jsonl outputs/traj_real/
mv traj_full/repos outputs/traj_real/repos

# realize (4 sharded workers by repo)
SHARDS=4 bash graphjepa/scripts/trajectory_full.sh
tail -f outputs/traj_real/logs/realize_shard*.log

# merge manifests + probe with each encoder
cat outputs/traj_real/manifest_shard*.jsonl > outputs/traj_real/manifest.jsonl
for NAME in W1_softplus_s0 I6_joint_s0; do
  .venv/bin/python -m graphjepa.trajectory_probe \
    --manifest outputs/traj_real/manifest.jsonl \
    --ckpt outputs/$NAME/ckpt_final.pt \
    --pool mean --split-by repo \
    --output outputs/traj_real/probe_${NAME}.json
done
```

## Provenance

Specs harvested from 5 SWE-Bench Verified submissions:

| Submission | N | Resolved | Rate |
|---|---|---|---|
| 20240620_sweagent_claude3.5sonnet | 485 | 168 | 34.6% |
| 20241022_tools_claude-3-5-sonnet-updated | 483 | 245 | 50.7% |
| 20241028_agentless-1.5_gpt4o | 495 | 194 | 39.2% |
| 20241029_OpenHands-CodeAct-2.1-sonnet | 493 | 265 | 53.8% |
| 20250405_amazon-q-developer-2025 | 500 | 330 | 66.0% |
| **total** | **2456** | **1202** | **48.9%** |

500 unique instance_ids, 499 unique base_commits (median 5 trajectories
per commit — different agents attempting the same task).