beyond retrieval
Collection
A multitask benchmark and model for code search • 3 items • Updated
CoREB-Reranker is a code reranker fine-tuned from Qwen3-Reranker-4B via LoRA on a mixed reranker corpus. It is the only reranker we evaluate that achieves consistent gains across all three code search tasks (text-to-code, code-to-text, and code-to-code).
Reranking delta on CoREB v202603, using C2LLM-7B as the first-stage retriever:
| Reranker | Text-to-Code | Code-to-Text | Code-to-Code |
|---|---|---|---|
| Jina Reranker v2 | -8.3 | -22.4 | -8.8 |
| Jina Reranker v3 | -2.2 | -5.0 | -0.1 |
| Qwen3-Reranker-0.6B | -0.6 | -8.2 | -2.3 |
| Qwen3-Reranker-4B | -0.1 | -3.2 | +3.3 |
| CoREB-Reranker (ours) | +1.1 | +0.8 | +5.1 |
CoREB-Reranker follows the same usage pattern as Qwen3-Reranker. The instruction is task-specific — use the appropriate one for your retrieval task:
from enum import Enum
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
class Task(Enum):
TEXT_TO_CODE = "Given a natural language programming task, retrieve code that correctly solves or implements the task."
CODE_TO_CODE = "Given a code snippet, retrieve code that is semantically equivalent or solves the same task."
CODE_TO_TEXT = "Given a code snippet, retrieve the natural language description or problem statement that best matches the code."
model_id = "hq-bench/coreb-code-reranker"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
PREFIX = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
SUFFIX = "<|im_end|>\n<|im_start|>assistant\n"
yes_id = tokenizer.convert_tokens_to_ids("yes")
no_id = tokenizer.convert_tokens_to_ids("no")
def score(query: str, document: str, task: Task) -> float:
prompt = f"{PREFIX}<Instruct>: {task.value}\n<Query>: {query}\n<Document>: {document}{SUFFIX}"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
logits = model(**inputs).logits[0, -1, :]
return (logits[yes_id] - logits[no_id]).item()
# Text-to-Code: natural language query -> code
print(score(
query="binary search implementation",
document="def binary_search(arr, target):\n lo, hi = 0, len(arr) - 1\n ...",
task=Task.TEXT_TO_CODE,
))
# Code-to-Code: code -> semantically equivalent code
print(score(
query="def binary_search(arr, target): ...",
document="int binarySearch(int[] arr, int target) { ... }",
task=Task.CODE_TO_CODE,
))
# Code-to-Text: code -> problem description
print(score(
query="def binary_search(arr, target): ...",
document="Find the index of a target value in a sorted array using binary search.",
task=Task.CODE_TO_TEXT,
))
For batch reranking with the CoREB evaluation pipeline, see the CoREB repository.
@article{xue2026coreb,
title={Beyond Retrieval: A Multitask Benchmark and Reranker for Code Search},
author={Xue, Siqiao and Liao, Zihan and Qin, Jin and Zhang, Ziyin and Mu, Yixiang and Zhou, Fan and Yu, Hang},
journal={arXiv preprint arXiv:2605.04615},
year={2026},
url={https://arxiv.org/abs/2605.04615}
}