File size: 4,740 Bytes
6aa591b
 
c4791d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6aa591b
 
c4791d4
6aa591b
663ad18
6aa591b
c4791d4
6aa591b
72baeee
 
 
 
c4791d4
 
6aa591b
c4791d4
6aa591b
c4791d4
 
 
 
6aa591b
c4791d4
6aa591b
c4791d4
6aa591b
 
c4791d4
 
 
 
 
 
 
 
6aa591b
 
72baeee
 
 
 
 
 
 
c4791d4
6aa591b
c4791d4
6aa591b
c4791d4
 
 
 
 
 
 
 
6aa591b
c4791d4
6aa591b
c4791d4
6aa591b
c4791d4
6aa591b
c4791d4
6aa591b
c4791d4
6aa591b
c4791d4
6aa591b
c4791d4
 
 
 
6aa591b
c4791d4
6aa591b
c4791d4
663ad18
 
 
 
 
 
 
 
c4791d4
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
library_name: transformers
license: apache-2.0
language:
- en
base_model: Qwen/Qwen3-14B
pipeline_tag: text-generation
tags:
- code-search
- code-localization
- reinforcement-learning
- agent
- software-engineering
- GSPO
- OpenHands
- SWE-Bench
datasets:
- OpenHands/SWE-smith-py-code-search
- OpenHands/SWE-Gym-code-search
- OpenHands/CodeScout_Training_Rollouts
---

# CodeScout-14B

[πŸ“„ Paper](https://arxiv.org/abs/2603.17829) β€’ [πŸ’» Code](https://github.com/OpenHands/codescout) β€’ [πŸ€— Collection](https://huggingface.co/collections/OpenHands/codescout-69b9a6adcf21f348f4db937f)

**Strongest CodeScout model β€” open-source SOTA on SWE-Bench code localization.**

<p align="center">
  <img src="codescout_overview.png" alt="CodeScout Overview" width="100%">
</p>

CodeScout-14B is part of the **CodeScout** family of open-source RL-trained code search agents.
CodeScout models achieve state-of-the-art repository-level code localization using *nothing more than a standard Unix terminal* β€” no static analysis, no repository graphs, no language-specific tooling.

## Key Highlights

- **Open-source SOTA** on SWE-Bench Verified, Pro, and Lite for code localization
- Outperforms **2–18Γ— larger** base and post-trained LLMs across all benchmarks
- Surpasses GPT-5 and approaches Claude Sonnet 4.5 using RepoNavigator, despite using only a bash terminal
- Achieves **8–33%** higher function-level F1 than Qwen3-32B (Thinking)

## Results

Performance on SWE-Bench code localization (instance-averaged F1 scores):


| Benchmark | CodeScout-1.7B | CodeScout-4B | CodeScout-14B |
|---|---|---|---|
| **SWE-Bench Verified** β€” File F1 | 55.46 | 68.52 | **68.57** |
| **SWE-Bench Verified** β€” Func F1 | 28.22 | 36.78 | **40.32** |
| **SWE-Bench Pro** β€” File F1 | 40.96 | 51.77 | **53.63** |
| **SWE-Bench Pro** β€” Func F1 | 18.24 | **29.03** | 28.74 |
| **SWE-Bench Lite** β€” File F1 | 56.57 | 67.03 | **71.84** |
| **SWE-Bench Lite** β€” Func F1 | 27.07 | 39.87 | **44.43** |


<p align="center">
  <img src="f1_vs_params_file.png" alt="File-level F1 vs Model Size" width="48%">
  <img src="f1_vs_params_function.png" alt="Function-level F1 vs Model Size" width="48%">
</p>

<p align="center"><em>Code localization performance on SWE-Bench Verified. CodeScout (⭐) achieves superior or competitive results over larger open-source LLMs and narrows the gap with closed-source frontier models.</em></p>

## Training

CodeScout-14B is trained directly from `Qwen3-14B` using GSPO (Group Sequence Policy Optimization) reinforcement learning.

- **Training data:** 9,600 instances from [SWE-Smith](https://huggingface.co/datasets/OpenHands/SWE-smith-py-code-search) (39K filtered, 128 repos)
- **RL steps:** 300
- **Batch size:** 32, with 4 rollouts per instance
- **Max context length:** 50K tokens (extended with YaRN)
- **Max turns per episode:** 4
- **Reward:** Multi-level F1 (file + module + function) with auxiliary turn-completion reward
- **Hardware:** 8Γ—H100 GPUs
- **Learning rate:** 1e-6 (constant)

## How It Works

CodeScout uses the **OpenHands-Bash** scaffold β€” an agent equipped with only a `Terminal` tool (supporting standard Unix commands like `rg`, `find`, `grep`, `ls`) and a `LocalizationFinish` tool for structured output submission. The agent iteratively navigates the repository to identify relevant files, classes, and functions related to a given issue.

The model is trained with **GSPO** (Group Sequence Policy Optimization) using multi-level F1 rewards at the file, module, and function level.

## Intended Use

CodeScout-14B is designed for **repository-level code localization**: given a GitHub issue description and a code repository, it identifies the relevant files, classes, and functions that need to be modified. It is intended to be used as a localization subagent within larger coding agent pipelines.

## Limitations

- Trained and evaluated exclusively on **Python** repositories
- Designed for code *localization*, not code *editing* or issue resolution
- Performance may vary on repositories significantly different from the training distribution
- Requires the OpenHands-Bash scaffold for optimal performance

## Citation

```bibtex
@misc{sutawika2026codescouteffectiverecipereinforcement,
      title={CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents}, 
      author={Lintang Sutawika and Aditya Bharat Soni and Bharath Sriraam R R and Apurva Gandhi and Taha Yassine and Sanidhya Vijayvargiya and Yuchen Li and Xuhui Zhou and Yilin Zhang and Leander Melroy Maben and Graham Neubig},
      year={2026},
      eprint={2603.17829},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2603.17829}, 
}
```