| --- |
| license: mit |
| --- |
| |
| # SWE-Bench Trajectory Eval Bundle (v1) |
|
|
| Companion artifact for the trajectory-probe downstream eval of the |
| code-graph-v7 encoders (W1, I6, ...). |
|
|
| ## Contents |
|
|
| - `traj_full_bundle.tar.gz` (488 MB) — contains: |
| - `specs.jsonl`: 2456 SWE-Bench Verified agent trajectories harvested |
| from `swe-bench-submissions` S3 bucket. Fields: instance_id, traj_id, |
| repo, base_commit, patches (1 entry = final model patch), resolved. |
| - `repos/`: shallow (`--filter=blob:none`) clones of the 12 target |
| repos (django, sympy, sphinx, matplotlib, scikit-learn, astropy, |
| xarray, pytest, pylint, requests, seaborn, flask). ~671 MB |
| uncompressed. Blobs pulled lazily per base_commit checkout. |
| - `graphjepa/`: pipeline code (trajectory_pipeline, trajectory_realize, |
| trajectory_probe, trajectory_harvest) plus scripts/trajectory_full.sh. |
| - `harvest.log` — stdout from the S3 harvester that produced specs.jsonl. |
| |
| ## Downstream workflow |
|
|
| ```bash |
| tar -xzf traj_full_bundle.tar.gz |
| rsync -a traj_full/graphjepa/ graphjepa/ |
| mkdir -p outputs/traj_real |
| cp traj_full/specs.jsonl outputs/traj_real/ |
| mv traj_full/repos outputs/traj_real/repos |
| |
| # realize (4 sharded workers by repo) |
| SHARDS=4 bash graphjepa/scripts/trajectory_full.sh |
| tail -f outputs/traj_real/logs/realize_shard*.log |
| |
| # merge manifests + probe with each encoder |
| cat outputs/traj_real/manifest_shard*.jsonl > outputs/traj_real/manifest.jsonl |
| for NAME in W1_softplus_s0 I6_joint_s0; do |
| .venv/bin/python -m graphjepa.trajectory_probe \ |
| --manifest outputs/traj_real/manifest.jsonl \ |
| --ckpt outputs/$NAME/ckpt_final.pt \ |
| --pool mean --split-by repo \ |
| --output outputs/traj_real/probe_${NAME}.json |
| done |
| ``` |
|
|
| ## Provenance |
|
|
| Specs harvested from 5 SWE-Bench Verified submissions: |
|
|
| | Submission | N | Resolved | Rate | |
| |---|---|---|---| |
| | 20240620_sweagent_claude3.5sonnet | 485 | 168 | 34.6% | |
| | 20241022_tools_claude-3-5-sonnet-updated | 483 | 245 | 50.7% | |
| | 20241028_agentless-1.5_gpt4o | 495 | 194 | 39.2% | |
| | 20241029_OpenHands-CodeAct-2.1-sonnet | 493 | 265 | 53.8% | |
| | 20250405_amazon-q-developer-2025 | 500 | 330 | 66.0% | |
| | **total** | **2456** | **1202** | **48.9%** | |
|
|
| 500 unique instance_ids, 499 unique base_commits (median 5 trajectories |
| per commit — different agents attempting the same task). |
|
|