metadata
license: apache-2.0
language:
- en
- code
library_name: transformers
tags:
- code
- embeddings
- retrieval
- code-search
- semantic-search
- feature-extraction
- sentence-transformers
datasets:
- code-rag-bench/cornstack
- bigcode/stackoverflow
- code_search_net
pipeline_tag: feature-extraction
base_model: Qwen/Qwen2.5-Coder-0.5B
model-index:
- name: CodeCompass-Embed
results:
- task:
type: retrieval
name: Code Retrieval
dataset:
type: CoIR-Retrieval/codetrans-dl
name: CodeTrans-DL
metrics:
- type: ndcg@10
value: 0.3305
name: NDCG@10
- task:
type: retrieval
name: Code Retrieval
dataset:
type: CoIR-Retrieval/CodeSearchNet-python
name: CodeSearchNet Python
metrics:
- type: ndcg@10
value: 0.9228
name: NDCG@10
- type: mrr@10
value: 0.9106
name: MRR@10
CodeCompass-Embed
CodeCompass-Embed is a code embedding model fine-tuned from Qwen2.5-Coder-0.5B for semantic code search and retrieval tasks.
Model Highlights
- π #1 on CodeTrans-DL (code translation between frameworks)
- π₯ #4 on CodeSearchNet-Python (natural language to code search)
- β‘ 494M parameters, 896-dim embeddings
- π Bidirectional attention (converted from causal LLM)
- π― Mean pooling with L2 normalization
- π Trained at 512 tokens, extrapolates to longer sequences via RoPE
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen2.5-Coder-0.5B |
| Parameters | 494M |
| Embedding Dimension | 896 |
| Max Sequence Length | 512 (training) / 32K (inference) |
| Pooling | Mean |
| Normalization | L2 |
| Attention | Bidirectional (all 24 layers) |
Benchmark Results (CoIR)
Evaluated on the CoIR Benchmark (NDCG@10). Sorted by CSN-Python.
| Model | Params | CSN-Python | CodeTrans-DL | Text2SQL | SO-QA | CF-ST | Apps |
|---|---|---|---|---|---|---|---|
| SFR-Embedding-Code | 400M | 0.9505 | 0.2683 | 0.9949 | 0.9107 | 0.7258 | 0.2212 |
| Jina-Code-v2 | 161M | 0.9439 | 0.2739 | 0.5169 | 0.8874 | 0.6975 | 0.1538 |
| CodeRankEmbed | 137M | 0.9378 | 0.2604 | 0.7686 | 0.8990 | 0.7166 | 0.1993 |
| CodeCompass-Embed | 494M | 0.9228 | 0.3305 | 0.5673 | 0.6480 | 0.4080 | 0.1277 |
| Snowflake-Arctic-Embed-L | 568M | 0.9146 | 0.1958 | 0.5401 | 0.8718 | 0.6503 | 0.1435 |
| BGE-M3 | 568M | 0.8976 | 0.2194 | 0.5728 | 0.8501 | 0.6437 | 0.1445 |
| BGE-Base-en-v1.5 | 109M | 0.8944 | 0.2125 | 0.5265 | 0.8581 | 0.6423 | 0.1415 |
| CodeT5+-110M | 110M | 0.8702 | 0.1794 | 0.3275 | 0.8147 | 0.5804 | 0.1179 |
CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.
Usage
With Transformers
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
# Load model
model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")
# CRITICAL: Enable bidirectional attention for embeddings
for layer in model.layers:
layer.self_attn.is_causal = False
model.eval()
def encode(texts, is_query=False):
# Add instruction prefix for queries
if is_query:
texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {t}" for t in texts]
inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
hidden = outputs.hidden_states[-1]
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1).float()
embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
# L2 normalize
embeddings = F.normalize(embeddings, p=2, dim=-1)
return embeddings
# Example: Code Search
query = "How to sort a list in Python"
code_snippets = [
"def sort_list(lst):\n return sorted(lst)",
"def add_numbers(a, b):\n return a + b",
"def reverse_string(s):\n return s[::-1]",
]
query_emb = encode([query], is_query=True)
code_embs = encode(code_snippets, is_query=False)
# Compute similarities
similarities = (query_emb @ code_embs.T).squeeze()
print(f"Query: {query}")
for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
print(f" [{sim:.4f}] {code[:50]}...")
Instruction Templates
For optimal performance, use these instruction prefixes for queries:
| Task | Instruction Template |
|---|---|
| NL β Code | Instruct: Find the most relevant code snippet given the following query:\nQuery: {query} |
| Code β Code | Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {query} |
| Tech Q&A | Instruct: Find the most relevant answer given the following question:\nQuery: {query} |
| Text β SQL | Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {query} |
Note: Document/corpus texts do NOT need instruction prefixes.
Training Details
- Base Model: Qwen2.5-Coder-0.5B
- Training Data: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
- Architecture Modification: Converted all 24 attention layers from causal to bidirectional
- Pooling: Mean pooling
- Loss: InfoNCE with temperature Ο=0.05
- Hard Negatives: 7 per sample (embedding-mined)
- Effective Batch Size: 1024 (via GradCache)
- Training Steps: 950
- Hardware: NVIDIA H100
Limitations
- Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
- Trained primarily on Python/JavaScript/Java/Go/PHP/Ruby
- May not generalize well to low-resource programming languages
Citation
@misc{codecompass2026,
author = {Faisal Mumtaz},
title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/faisalmumtaz/codecompass-embed}
}
License
Apache 2.0