File size: 5,118 Bytes
f60b6ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a7f436
 
 
f60b6ec
 
 
 
 
e9fd252
 
 
 
 
 
 
 
 
 
f60b6ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6321f18
 
 
 
 
 
f60b6ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6321f18
 
 
 
 
 
 
 
 
 
 
 
 
 
f60b6ec
 
 
 
 
 
 
 
 
 
 
72021f1
5a88cb2
f60b6ec
 
 
 
 
 
72021f1
 
f60b6ec
 
 
 
 
 
 
 
 
 
 
 
72021f1
 
 
f60b6ec
 
 
 
72021f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6321f18
f60b6ec
 
 
6321f18
72021f1
f60b6ec
 
 
 
3a7f436
 
 
 
 
 
 
f60b6ec
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
license: apache-2.0
language:
- en
- code
library_name: transformers
tags:
- code
- embeddings
- retrieval
- code-search
- semantic-search
- feature-extraction
- sentence-transformers
datasets:
- code-rag-bench/cornstack
- bigcode/stackoverflow
- code_search_net
pipeline_tag: feature-extraction
base_model: Qwen/Qwen2.5-Coder-0.5B
model-index:
- name: CodeCompass-Embed
  results:
  - task:
      type: retrieval
      name: Code Retrieval
    dataset:
      type: CoIR-Retrieval/codetrans-dl
      name: CodeTrans-DL
    metrics:
    - type: ndcg@10
      value: 0.3305
      name: NDCG@10
  - task:
      type: retrieval
      name: Code Retrieval
    dataset:
      type: CoIR-Retrieval/CodeSearchNet-python
      name: CodeSearchNet Python
    metrics:
    - type: ndcg@10
      value: 0.9228
      name: NDCG@10
---

# CodeCompass-Embed

**CodeCompass-Embed** is a code embedding model fine-tuned from [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) for semantic code search and retrieval tasks.

## Model Highlights

- ๐Ÿ† #1 on CodeTrans-DL (code translation between frameworks)
- ๐Ÿฅ‡ #4 on CodeSearchNet-Python (natural language to code search)
- โšก 494M parameters, 896-dim embeddings
- ๐Ÿ”„ Bidirectional attention (converted from causal LLM)
- ๐ŸŽฏ Mean pooling with L2 normalization
- ๐Ÿ“ Trained at 512 tokens, extrapolates to longer sequences via RoPE

## Model Details

| Property | Value |
|----------|-------|
| Base Model | Qwen2.5-Coder-0.5B |
| Parameters | 494M |
| Embedding Dimension | 896 |
| Max Sequence Length | 512 (training) / 32K (inference) |
| Pooling | Mean |
| Normalization | L2 |
| Attention | Bidirectional (all 24 layers) |

## Benchmark Results (CoIR)

Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (NDCG@10). Sorted by CSN-Python.

| Model | Params | CSN-Python | CodeTrans-DL | Text2SQL | SO-QA | CF-ST | Apps |
|-------|--------|------------|--------------|----------|-------|-------|------|
| SFR-Embedding-Code | 400M | 0.9505 | 0.2683 | 0.9949 | 0.9107 | 0.7258 | 0.2212 |
| Jina-Code-v2 | 161M | 0.9439 | 0.2739 | 0.5169 | 0.8874 | 0.6975 | 0.1538 |
| CodeRankEmbed | 137M | 0.9378 | 0.2604 | 0.7686 | 0.8990 | 0.7166 | 0.1993 |
| **CodeCompass-Embed** | **494M** | **0.9228** | **0.3305** | **0.5673** | **0.6480** | **0.4080** | **0.1277** |
| Snowflake-Arctic-Embed-L | 568M | 0.9146 | 0.1958 | 0.5401 | 0.8718 | 0.6503 | 0.1435 |
| BGE-M3 | 568M | 0.8976 | 0.2194 | 0.5728 | 0.8501 | 0.6437 | 0.1445 |
| BGE-Base-en-v1.5 | 109M | 0.8944 | 0.2125 | 0.5265 | 0.8581 | 0.6423 | 0.1415 |
| CodeT5+-110M | 110M | 0.8702 | 0.1794 | 0.3275 | 0.8147 | 0.5804 | 0.1179 |

*CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.*

## Usage

```python
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")

# Enable bidirectional attention
for layer in model.layers:
    layer.self_attn.is_causal = False

model.eval()

def encode(texts, is_query=False):
    if is_query:
        texts = [f"Instruct: Find the most relevant code snippet given the following query:
Query: {t}" for t in texts]
    
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden = outputs.hidden_states[-1]
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        embeddings = F.normalize(embeddings, p=2, dim=-1)
    
    return embeddings

query_emb = encode(["sort a list"], is_query=True)
code_embs = encode(["def sort(lst): return sorted(lst)"])
similarity = (query_emb @ code_embs.T).item()
```

## Instruction Templates

| Task | Template |
|------|----------|
| NL to Code | `Instruct: Find the most relevant code snippet given the following query:
Query: {q}` |
| Code to Code | `Instruct: Find an equivalent code snippet given the following code snippet:
Query: {q}` |
| Tech Q&A | `Instruct: Find the most relevant answer given the following question:
Query: {q}` |
| Text to SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:
Query: {q}` |

Documents do not need instruction prefixes.

## Training

- **Data**: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
- **Loss**: InfoNCE (ฯ„=0.05) with 7 hard negatives per sample
- **Batch Size**: 1024 (via GradCache)
- **Steps**: 950
- **Hardware**: NVIDIA H100

## Limitations

- Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
- Trained on Python/JavaScript/Java/Go/PHP/Ruby

## Citation

```bibtex
@misc{codecompass2026,
  author = {Faisal Mumtaz},
  title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/faisalmumtaz/codecompass-embed}
}
```

## License

Apache 2.0