File size: 5,118 Bytes
f60b6ec 3a7f436 f60b6ec e9fd252 f60b6ec 6321f18 f60b6ec 6321f18 f60b6ec 72021f1 5a88cb2 f60b6ec 72021f1 f60b6ec 72021f1 f60b6ec 72021f1 6321f18 f60b6ec 6321f18 72021f1 f60b6ec 3a7f436 f60b6ec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 |
---
license: apache-2.0
language:
- en
- code
library_name: transformers
tags:
- code
- embeddings
- retrieval
- code-search
- semantic-search
- feature-extraction
- sentence-transformers
datasets:
- code-rag-bench/cornstack
- bigcode/stackoverflow
- code_search_net
pipeline_tag: feature-extraction
base_model: Qwen/Qwen2.5-Coder-0.5B
model-index:
- name: CodeCompass-Embed
results:
- task:
type: retrieval
name: Code Retrieval
dataset:
type: CoIR-Retrieval/codetrans-dl
name: CodeTrans-DL
metrics:
- type: ndcg@10
value: 0.3305
name: NDCG@10
- task:
type: retrieval
name: Code Retrieval
dataset:
type: CoIR-Retrieval/CodeSearchNet-python
name: CodeSearchNet Python
metrics:
- type: ndcg@10
value: 0.9228
name: NDCG@10
---
# CodeCompass-Embed
**CodeCompass-Embed** is a code embedding model fine-tuned from [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) for semantic code search and retrieval tasks.
## Model Highlights
- ๐ #1 on CodeTrans-DL (code translation between frameworks)
- ๐ฅ #4 on CodeSearchNet-Python (natural language to code search)
- โก 494M parameters, 896-dim embeddings
- ๐ Bidirectional attention (converted from causal LLM)
- ๐ฏ Mean pooling with L2 normalization
- ๐ Trained at 512 tokens, extrapolates to longer sequences via RoPE
## Model Details
| Property | Value |
|----------|-------|
| Base Model | Qwen2.5-Coder-0.5B |
| Parameters | 494M |
| Embedding Dimension | 896 |
| Max Sequence Length | 512 (training) / 32K (inference) |
| Pooling | Mean |
| Normalization | L2 |
| Attention | Bidirectional (all 24 layers) |
## Benchmark Results (CoIR)
Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (NDCG@10). Sorted by CSN-Python.
| Model | Params | CSN-Python | CodeTrans-DL | Text2SQL | SO-QA | CF-ST | Apps |
|-------|--------|------------|--------------|----------|-------|-------|------|
| SFR-Embedding-Code | 400M | 0.9505 | 0.2683 | 0.9949 | 0.9107 | 0.7258 | 0.2212 |
| Jina-Code-v2 | 161M | 0.9439 | 0.2739 | 0.5169 | 0.8874 | 0.6975 | 0.1538 |
| CodeRankEmbed | 137M | 0.9378 | 0.2604 | 0.7686 | 0.8990 | 0.7166 | 0.1993 |
| **CodeCompass-Embed** | **494M** | **0.9228** | **0.3305** | **0.5673** | **0.6480** | **0.4080** | **0.1277** |
| Snowflake-Arctic-Embed-L | 568M | 0.9146 | 0.1958 | 0.5401 | 0.8718 | 0.6503 | 0.1435 |
| BGE-M3 | 568M | 0.8976 | 0.2194 | 0.5728 | 0.8501 | 0.6437 | 0.1445 |
| BGE-Base-en-v1.5 | 109M | 0.8944 | 0.2125 | 0.5265 | 0.8581 | 0.6423 | 0.1415 |
| CodeT5+-110M | 110M | 0.8702 | 0.1794 | 0.3275 | 0.8147 | 0.5804 | 0.1179 |
*CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.*
## Usage
```python
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")
# Enable bidirectional attention
for layer in model.layers:
layer.self_attn.is_causal = False
model.eval()
def encode(texts, is_query=False):
if is_query:
texts = [f"Instruct: Find the most relevant code snippet given the following query:
Query: {t}" for t in texts]
inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
hidden = outputs.hidden_states[-1]
mask = inputs["attention_mask"].unsqueeze(-1).float()
embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
embeddings = F.normalize(embeddings, p=2, dim=-1)
return embeddings
query_emb = encode(["sort a list"], is_query=True)
code_embs = encode(["def sort(lst): return sorted(lst)"])
similarity = (query_emb @ code_embs.T).item()
```
## Instruction Templates
| Task | Template |
|------|----------|
| NL to Code | `Instruct: Find the most relevant code snippet given the following query:
Query: {q}` |
| Code to Code | `Instruct: Find an equivalent code snippet given the following code snippet:
Query: {q}` |
| Tech Q&A | `Instruct: Find the most relevant answer given the following question:
Query: {q}` |
| Text to SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:
Query: {q}` |
Documents do not need instruction prefixes.
## Training
- **Data**: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
- **Loss**: InfoNCE (ฯ=0.05) with 7 hard negatives per sample
- **Batch Size**: 1024 (via GradCache)
- **Steps**: 950
- **Hardware**: NVIDIA H100
## Limitations
- Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
- Trained on Python/JavaScript/Java/Go/PHP/Ruby
## Citation
```bibtex
@misc{codecompass2026,
author = {Faisal Mumtaz},
title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/faisalmumtaz/codecompass-embed}
}
```
## License
Apache 2.0
|