CORE-bench v1.1 Collection Benchmark for AI agents on scientific reproducibility — mainline (39) and OOD (19) splits derived from Code Ocean capsules. • 2 items • Updated 6 days ago