File size: 5,978 Bytes
b2cbc7d
 
 
 
 
b29bc35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: cc-by-nc-4.0
base_model:
- Alibaba-NLP/gte-Qwen2-7B-instruct
---
`SweRankEmbed-Large` is a 7B bi-encoder for code retrieval. It significantly outperforms other embedding models on the issue localization task. 

The model has been trained on large-scale issue localization data collected from public python github repositories. Check out our [blog post](https://gangiswag.github.io/SweRank/) and [paper](https://arxiv.org/abs/2505.07849) for more details!

You can combine `SweRankEmbed` with our [`SweRankLLM-Small`]() or [`SweRankLLM-Large`]() rerankers for even higher quality ranking performance.

Link to code: [https://github.com/gangiswag/SweRank](https://github.com/gangiswag/SweRank)

## Performance

SweRank models show SOTA localization performance on a variety of benchmarks like SWE-Bench-Lite and LocBench, considerably out-performing agent-based approaches relying on Claude-3.5

| Model Name  | SWE-Bench-Lite Func@10 | LocBench Func@15
| ------------------------------------------------------------------- | -------------------------------- | -------------------------------- |
| OpenHands (Claude 3.5)    | 70.07                            |  59.29 |
| LocAgent (Claude 3.5)     | 77.37                            |  60.71 |
| CodeRankEmbed (137M)      | 58.76                            |  50.89 |
| GTE-Qwen2-7B-Instruct (7B)| 70.44                            |  57.14 |
| SweRankEmbed-Small (137M) | 74.45                            |  63.39 |
| SweRankEmbed-Large (7B)   | 82.12                            |  67.32 |
| + GPT-4.1 reranker        | 87.96                            |  74.64 |
| + SweRankLLM-Small (7B) reranker        | 86.13                            |  74.46 |
| + SweRankLLM-Large (32B) reranker       | 88.69                            |  76.25 |


## Requirements

```shell
transformers>=4.39.2
flash_attn>=2.5.6
```

## Usage with Sentence-Transformers

```python
from from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Salesforce/SweRankEmbed-Large", trust_remote_code=True)
# In case you want to reduce the maximum length:
model.max_seq_length = 8192

queries = ['Calculate the n-th factorial']
documents = ['def fact(n):\n if n < 0:\n  raise ValueError\n return 1 if n == 0 else n * fact(n - 1)']

query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

scores = query_embeddings @ document_embeddings.T

for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    # Output passages & scores
    print("Query:", query)
    for document, score in doc_score_pairs:
        print(score, document)
```

Observe the `config_sentence_transformers.json` to see all pre-built prompt names.

## Usage with Huggingface Transformers

**Important**: the query prompt must include the following task instruction prefix: "*Instruct: Given a github issue, identify the code that needs to be changed to fix the issue.\nQuery: *"

```python
import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a github issue, identify the code that needs to be changed to fix the issue.'

tokenizer = AutoTokenizer.from_pretrained('Salesforce/SweRankEmbed-Large',  trust_remote_code=True)
model = AutoModel.from_pretrained('Salesforce/SweRankEmbed-Large',  trust_remote_code=True)
model.eval()

max_length = 8192

queries = ['Calculate the n-th factorial']
queries_with_prefix  = [get_detailed_instruct(task, query) for query in queries]
query_inputs = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=max_length)

documents = ['def fact(n):\n if n < 0:\n  raise ValueError\n return 1 if n == 0 else n * fact(n - 1)']
document_inputs = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=max_length)

# Compute token embeddings
with torch.no_grad():
    query_embeddings = last_token_pool(model(**query_inputs).last_hidden_state, query_inputs["attention_mask"]])
    document_embeddings = last_token_pool(model(**document_inputs).last_hidden_state, document_inputs["attention_mask"]])


# normalize embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)

scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    #Output passages & scores
    print("Query:", query)
    for document, score in doc_score_pairs:
        print(score, document)
```

## Citation

If you find this model work useful in your research, please consider citing our paper:

```
@article{reddy2025swerank,
  title={SweRank: Software Issue Localization with Code Ranking},
  author={Reddy, Revanth Gangi and Suresh, Tarun and Doo, JaeHyeok and Liu, Ye and Nguyen, Xuan Phi and Zhou, Yingbo and Yavuz, Semih and Xiong, Caiming and Ji, Heng and Joty, Shafiq},
  journal={arXiv preprint arXiv:2505.07849},
  year={2025}
}
```