|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
- code |
|
|
library_name: transformers |
|
|
tags: |
|
|
- code |
|
|
- embeddings |
|
|
- retrieval |
|
|
- code-search |
|
|
- semantic-search |
|
|
- feature-extraction |
|
|
- sentence-transformers |
|
|
datasets: |
|
|
- code-rag-bench/cornstack |
|
|
- bigcode/stackoverflow |
|
|
- code_search_net |
|
|
pipeline_tag: feature-extraction |
|
|
base_model: Qwen/Qwen2.5-Coder-0.5B |
|
|
model-index: |
|
|
- name: CodeCompass-Embed |
|
|
results: |
|
|
- task: |
|
|
type: retrieval |
|
|
name: Code Retrieval |
|
|
dataset: |
|
|
type: CoIR-Retrieval/codetrans-dl |
|
|
name: CodeTrans-DL |
|
|
metrics: |
|
|
- type: ndcg@10 |
|
|
value: 0.3305 |
|
|
name: NDCG@10 |
|
|
- task: |
|
|
type: retrieval |
|
|
name: Code Retrieval |
|
|
dataset: |
|
|
type: CoIR-Retrieval/CodeSearchNet-python |
|
|
name: CodeSearchNet Python |
|
|
metrics: |
|
|
- type: ndcg@10 |
|
|
value: 0.9228 |
|
|
name: NDCG@10 |
|
|
--- |
|
|
|
|
|
# CodeCompass-Embed |
|
|
|
|
|
**CodeCompass-Embed** is a code embedding model fine-tuned from [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) for semantic code search and retrieval tasks. |
|
|
|
|
|
## Model Highlights |
|
|
|
|
|
- ๐ #1 on CodeTrans-DL (code translation between frameworks) |
|
|
- ๐ฅ #4 on CodeSearchNet-Python (natural language to code search) |
|
|
- โก 494M parameters, 896-dim embeddings |
|
|
- ๐ Bidirectional attention (converted from causal LLM) |
|
|
- ๐ฏ Mean pooling with L2 normalization |
|
|
- ๐ Trained at 512 tokens, extrapolates to longer sequences via RoPE |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| Base Model | Qwen2.5-Coder-0.5B | |
|
|
| Parameters | 494M | |
|
|
| Embedding Dimension | 896 | |
|
|
| Max Sequence Length | 512 (training) / 32K (inference) | |
|
|
| Pooling | Mean | |
|
|
| Normalization | L2 | |
|
|
| Attention | Bidirectional (all 24 layers) | |
|
|
|
|
|
## Benchmark Results (CoIR) |
|
|
|
|
|
Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (NDCG@10). Sorted by CSN-Python. |
|
|
|
|
|
| Model | Params | CSN-Python | CodeTrans-DL | Text2SQL | SO-QA | CF-ST | Apps | |
|
|
|-------|--------|------------|--------------|----------|-------|-------|------| |
|
|
| SFR-Embedding-Code | 400M | 0.9505 | 0.2683 | 0.9949 | 0.9107 | 0.7258 | 0.2212 | |
|
|
| Jina-Code-v2 | 161M | 0.9439 | 0.2739 | 0.5169 | 0.8874 | 0.6975 | 0.1538 | |
|
|
| CodeRankEmbed | 137M | 0.9378 | 0.2604 | 0.7686 | 0.8990 | 0.7166 | 0.1993 | |
|
|
| **CodeCompass-Embed** | **494M** | **0.9228** | **0.3305** | **0.5673** | **0.6480** | **0.4080** | **0.1277** | |
|
|
| Snowflake-Arctic-Embed-L | 568M | 0.9146 | 0.1958 | 0.5401 | 0.8718 | 0.6503 | 0.1435 | |
|
|
| BGE-M3 | 568M | 0.8976 | 0.2194 | 0.5728 | 0.8501 | 0.6437 | 0.1445 | |
|
|
| BGE-Base-en-v1.5 | 109M | 0.8944 | 0.2125 | 0.5265 | 0.8581 | 0.6423 | 0.1415 | |
|
|
| CodeT5+-110M | 110M | 0.8702 | 0.1794 | 0.3275 | 0.8147 | 0.5804 | 0.1179 | |
|
|
|
|
|
*CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.* |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torch.nn.functional as F |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed") |
|
|
|
|
|
# Enable bidirectional attention |
|
|
for layer in model.layers: |
|
|
layer.self_attn.is_causal = False |
|
|
|
|
|
model.eval() |
|
|
|
|
|
def encode(texts, is_query=False): |
|
|
if is_query: |
|
|
texts = [f"Instruct: Find the most relevant code snippet given the following query: |
|
|
Query: {t}" for t in texts] |
|
|
|
|
|
inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs, output_hidden_states=True) |
|
|
hidden = outputs.hidden_states[-1] |
|
|
mask = inputs["attention_mask"].unsqueeze(-1).float() |
|
|
embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9) |
|
|
embeddings = F.normalize(embeddings, p=2, dim=-1) |
|
|
|
|
|
return embeddings |
|
|
|
|
|
query_emb = encode(["sort a list"], is_query=True) |
|
|
code_embs = encode(["def sort(lst): return sorted(lst)"]) |
|
|
similarity = (query_emb @ code_embs.T).item() |
|
|
``` |
|
|
|
|
|
## Instruction Templates |
|
|
|
|
|
| Task | Template | |
|
|
|------|----------| |
|
|
| NL to Code | `Instruct: Find the most relevant code snippet given the following query: |
|
|
Query: {q}` | |
|
|
| Code to Code | `Instruct: Find an equivalent code snippet given the following code snippet: |
|
|
Query: {q}` | |
|
|
| Tech Q&A | `Instruct: Find the most relevant answer given the following question: |
|
|
Query: {q}` | |
|
|
| Text to SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query: |
|
|
Query: {q}` | |
|
|
|
|
|
Documents do not need instruction prefixes. |
|
|
|
|
|
## Training |
|
|
|
|
|
- **Data**: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet |
|
|
- **Loss**: InfoNCE (ฯ=0.05) with 7 hard negatives per sample |
|
|
- **Batch Size**: 1024 (via GradCache) |
|
|
- **Steps**: 950 |
|
|
- **Hardware**: NVIDIA H100 |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback) |
|
|
- Trained on Python/JavaScript/Java/Go/PHP/Ruby |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{codecompass2026, |
|
|
author = {Faisal Mumtaz}, |
|
|
title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}, |
|
|
year = {2026}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/faisalmumtaz/codecompass-embed} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|