SWE-Star-32B

Introduction

SWE-Star is a family of language models based on the Qwen2.5-Coder family and trained on the SWE-Star dataset. The dataset contains approximately 250k agentic coding trajectories distilled from Devstral-2-Small using SWE-Smith tasks.

The complete data generation, training, and evaluation pipeline is openly available in our GitHub repository, enabling anyone to reproduce our results.

Additional details are available in our blog posts.

Evaluation

We evaluated our models on SWE-Bench Verified, the de facto benchmark for agentic coding in Python. Our 32B model significantly outperforms both the original SWE-Smith models (> +10%) and other prior work:

Note: The headline numbers are evaluated using the OpenHands Iterative Eval Protocol with three attempts. The results shown below report Pass@1 with a single attempt and a 100-step limit.

To unroll the benchmark instances, we used our custom OpenHands-like scaffold with the think, str_replace_editor, execute_bash, and submit tools. The agent uses XML-based tool calling with a maximum of 100 steps and a sampling temperature of 0.

It is worth noting that the agent was run on MN5 without internet access, which may slightly lower the final scores. The SWE-Bench evaluation harness was run on a machine with internet access.

Our models also achieve very high Pass@16 rates, making them strong candidates for further reinforcement learning. Our 32B model reaches a Pass@16 score of 75.5%: