CORE-bench v1.1 - a agent-evals Collection

agent-evals 's Collections

updated May 5

Benchmark for AI agents on scientific reproducibility — mainline (39) and OOD (19) splits derived from Code Ocean capsules.