agent-evals 's Collections

CORE-bench v1.1

Benchmark for AI agents on scientific reproducibility — mainline (39) and OOD (19) splits derived from Code Ocean capsules.