SWE-bench is a benchmark for evaluating Language Models and AI Systems on their ability resolve real world GitHub Issues.