| | --- |
| | license: cc-by-nc-4.0 |
| | base_model: |
| | - Alibaba-NLP/gte-Qwen2-7B-instruct |
| | --- |
| | `SweRankEmbed-Large` is a 7B bi-encoder for code retrieval. It significantly outperforms other embedding models on the issue localization task. |
| |
|
| | The model has been trained on large-scale issue localization data collected from public python github repositories. Check out our [blog post](https://gangiswag.github.io/SweRank/) and [paper](https://arxiv.org/abs/2505.07849) for more details! |
| |
|
| | You can combine `SweRankEmbed` with our [`SweRankLLM-Small`]() or [`SweRankLLM-Large`]() rerankers for even higher quality ranking performance. |
| |
|
| | Link to code: [https://github.com/gangiswag/SweRank](https://github.com/gangiswag/SweRank) |
| |
|
| | ## Performance |
| |
|
| | SweRank models show SOTA localization performance on a variety of benchmarks like SWE-Bench-Lite and LocBench, considerably out-performing agent-based approaches relying on Claude-3.5 |
| |
|
| | | Model Name | SWE-Bench-Lite Func@10 | LocBench Func@15 |
| | | ------------------------------------------------------------------- | -------------------------------- | -------------------------------- | |
| | | OpenHands (Claude 3.5) | 70.07 | 59.29 | |
| | | LocAgent (Claude 3.5) | 77.37 | 60.71 | |
| | | CodeRankEmbed (137M) | 58.76 | 50.89 | |
| | | GTE-Qwen2-7B-Instruct (7B)| 70.44 | 57.14 | |
| | | SweRankEmbed-Small (137M) | 74.45 | 63.39 | |
| | | SweRankEmbed-Large (7B) | 82.12 | 67.32 | |
| | | + GPT-4.1 reranker | 87.96 | 74.64 | |
| | | + SweRankLLM-Small (7B) reranker | 86.13 | 74.46 | |
| | | + SweRankLLM-Large (32B) reranker | 88.69 | 76.25 | |
| |
|
| |
|
| | ## Requirements |
| |
|
| | ```shell |
| | transformers>=4.39.2 |
| | flash_attn>=2.5.6 |
| | ``` |
| |
|
| | ## Usage with Sentence-Transformers |
| |
|
| | ```python |
| | from from sentence_transformers import SentenceTransformer |
| | |
| | model = SentenceTransformer("Salesforce/SweRankEmbed-Large", trust_remote_code=True) |
| | # In case you want to reduce the maximum length: |
| | model.max_seq_length = 8192 |
| | |
| | queries = ['Calculate the n-th factorial'] |
| | documents = ['def fact(n):\n if n < 0:\n raise ValueError\n return 1 if n == 0 else n * fact(n - 1)'] |
| | |
| | query_embeddings = model.encode(queries, prompt_name="query") |
| | document_embeddings = model.encode(documents) |
| | |
| | scores = query_embeddings @ document_embeddings.T |
| | |
| | for query, query_scores in zip(queries, scores): |
| | doc_score_pairs = list(zip(documents, query_scores)) |
| | doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) |
| | # Output passages & scores |
| | print("Query:", query) |
| | for document, score in doc_score_pairs: |
| | print(score, document) |
| | ``` |
| |
|
| | Observe the `config_sentence_transformers.json` to see all pre-built prompt names. |
| |
|
| | ## Usage with Huggingface Transformers |
| |
|
| | **Important**: the query prompt must include the following task instruction prefix: "*Instruct: Given a github issue, identify the code that needs to be changed to fix the issue.\nQuery: *" |
| |
|
| | ```python |
| | import torch |
| | import torch.nn.functional as F |
| | |
| | from torch import Tensor |
| | from transformers import AutoTokenizer, AutoModel |
| | |
| | def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: |
| | left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) |
| | if left_padding: |
| | return last_hidden_states[:, -1] |
| | else: |
| | sequence_lengths = attention_mask.sum(dim=1) - 1 |
| | batch_size = last_hidden_states.shape[0] |
| | return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths] |
| | |
| | def get_detailed_instruct(task_description: str, query: str) -> str: |
| | return f'Instruct: {task_description}\nQuery: {query}' |
| | |
| | # Each query must come with a one-sentence instruction that describes the task |
| | task = 'Given a github issue, identify the code that needs to be changed to fix the issue.' |
| | |
| | tokenizer = AutoTokenizer.from_pretrained('Salesforce/SweRankEmbed-Large', trust_remote_code=True) |
| | model = AutoModel.from_pretrained('Salesforce/SweRankEmbed-Large', trust_remote_code=True) |
| | model.eval() |
| | |
| | max_length = 8192 |
| | |
| | queries = ['Calculate the n-th factorial'] |
| | queries_with_prefix = [get_detailed_instruct(task, query) for query in queries] |
| | query_inputs = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=max_length) |
| | |
| | documents = ['def fact(n):\n if n < 0:\n raise ValueError\n return 1 if n == 0 else n * fact(n - 1)'] |
| | document_inputs = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=max_length) |
| | |
| | # Compute token embeddings |
| | with torch.no_grad(): |
| | query_embeddings = last_token_pool(model(**query_inputs).last_hidden_state, query_inputs["attention_mask"]]) |
| | document_embeddings = last_token_pool(model(**document_inputs).last_hidden_state, document_inputs["attention_mask"]]) |
| | |
| | |
| | # normalize embeddings |
| | query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1) |
| | document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1) |
| | |
| | scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1)) |
| | for query, query_scores in zip(queries, scores): |
| | doc_score_pairs = list(zip(documents, query_scores)) |
| | doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) |
| | #Output passages & scores |
| | print("Query:", query) |
| | for document, score in doc_score_pairs: |
| | print(score, document) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you find this model work useful in your research, please consider citing our paper: |
| |
|
| | ``` |
| | @article{reddy2025swerank, |
| | title={SweRank: Software Issue Localization with Code Ranking}, |
| | author={Reddy, Revanth Gangi and Suresh, Tarun and Doo, JaeHyeok and Liu, Ye and Nguyen, Xuan Phi and Zhou, Yingbo and Yavuz, Semih and Xiong, Caiming and Ji, Heng and Joty, Shafiq}, |
| | journal={arXiv preprint arXiv:2505.07849}, |
| | year={2025} |
| | } |
| | ``` |