IDMedicine
/

code-graph-trajeval-v1

Model card Files Files and versions

code-graph-trajeval-v1 / README.md

Bremin's picture

Upload README.md with huggingface_hub

8fdba4d verified 17 days ago

|

history blame contribute delete

2.31 kB

	---
	license: mit
	---

	# SWE-Bench Trajectory Eval Bundle (v1)

	Companion artifact for the trajectory-probe downstream eval of the
	code-graph-v7 encoders (W1, I6, ...).

	## Contents

	- `traj_full_bundle.tar.gz` (488 MB) — contains:
	- `specs.jsonl`: 2456 SWE-Bench Verified agent trajectories harvested
	from `swe-bench-submissions` S3 bucket. Fields: instance_id, traj_id,
	repo, base_commit, patches (1 entry = final model patch), resolved.
	- `repos/`: shallow (`--filter=blob:none`) clones of the 12 target
	repos (django, sympy, sphinx, matplotlib, scikit-learn, astropy,
	xarray, pytest, pylint, requests, seaborn, flask). ~671 MB
	uncompressed. Blobs pulled lazily per base_commit checkout.
	- `graphjepa/`: pipeline code (trajectory_pipeline, trajectory_realize,
	trajectory_probe, trajectory_harvest) plus scripts/trajectory_full.sh.
	- `harvest.log` — stdout from the S3 harvester that produced specs.jsonl.

	## Downstream workflow

	```bash
	tar -xzf traj_full_bundle.tar.gz
	rsync -a traj_full/graphjepa/ graphjepa/
	mkdir -p outputs/traj_real
	cp traj_full/specs.jsonl outputs/traj_real/
	mv traj_full/repos outputs/traj_real/repos

	# realize (4 sharded workers by repo)
	SHARDS=4 bash graphjepa/scripts/trajectory_full.sh
	tail -f outputs/traj_real/logs/realize_shard*.log

	# merge manifests + probe with each encoder
	cat outputs/traj_real/manifest_shard*.jsonl > outputs/traj_real/manifest.jsonl
	for NAME in W1_softplus_s0 I6_joint_s0; do
	.venv/bin/python -m graphjepa.trajectory_probe \
	--manifest outputs/traj_real/manifest.jsonl \
	--ckpt outputs/$NAME/ckpt_final.pt \
	--pool mean --split-by repo \
	--output outputs/traj_real/probe_${NAME}.json
	done
	```

	## Provenance

	Specs harvested from 5 SWE-Bench Verified submissions:

	\| Submission \| N \| Resolved \| Rate \|
	\|---\|---\|---\|---\|
	\| 20240620_sweagent_claude3.5sonnet \| 485 \| 168 \| 34.6% \|
	\| 20241022_tools_claude-3-5-sonnet-updated \| 483 \| 245 \| 50.7% \|
	\| 20241028_agentless-1.5_gpt4o \| 495 \| 194 \| 39.2% \|
	\| 20241029_OpenHands-CodeAct-2.1-sonnet \| 493 \| 265 \| 53.8% \|
	\| 20250405_amazon-q-developer-2025 \| 500 \| 330 \| 66.0% \|
	\| total \| 2456 \| 1202 \| 48.9% \|

	500 unique instance_ids, 499 unique base_commits (median 5 trajectories
	per commit — different agents attempting the same task).