codecompass-embed / README.md

faisalmumtaz

Simplify README: single benchmark table, factual highlights

6321f18 verified 3 months ago

6.17 kB

license: apache-2.0
language:
  - en
  - code
library_name: transformers
tags:
  - code
  - embeddings
  - retrieval
  - code-search
  - semantic-search
  - feature-extraction
  - sentence-transformers
datasets:
  - code-rag-bench/cornstack
  - bigcode/stackoverflow
  - code_search_net
pipeline_tag: feature-extraction
base_model: Qwen/Qwen2.5-Coder-0.5B
model-index:
  - name: CodeCompass-Embed
    results:
      - task:
          type: retrieval
          name: Code Retrieval
        dataset:
          type: CoIR-Retrieval/codetrans-dl
          name: CodeTrans-DL
        metrics:
          - type: ndcg@10
            value: 0.3305
            name: NDCG@10
      - task:
          type: retrieval
          name: Code Retrieval
        dataset:
          type: CoIR-Retrieval/CodeSearchNet-python
          name: CodeSearchNet Python
        metrics:
          - type: ndcg@10
            value: 0.9228
            name: NDCG@10
          - type: mrr@10
            value: 0.9106
            name: MRR@10

CodeCompass-Embed

CodeCompass-Embed is a code embedding model fine-tuned from Qwen2.5-Coder-0.5B for semantic code search and retrieval tasks.

Model Highlights

🏆 #1 on CodeTrans-DL (code translation between frameworks)
🥇 #4 on CodeSearchNet-Python (natural language to code search)
⚡ 494M parameters, 896-dim embeddings
🔄 Bidirectional attention (converted from causal LLM)
🎯 Mean pooling with L2 normalization
📏 Trained at 512 tokens, extrapolates to longer sequences via RoPE

Model Details

Property	Value
Base Model	Qwen2.5-Coder-0.5B
Parameters	494M
Embedding Dimension	896
Max Sequence Length	512 (training) / 32K (inference)
Pooling	Mean
Normalization	L2
Attention	Bidirectional (all 24 layers)

Benchmark Results (CoIR)

Evaluated on the CoIR Benchmark (NDCG@10). Sorted by CSN-Python.

Model	Params	CSN-Python	CodeTrans-DL	Text2SQL	SO-QA	CF-ST	Apps
SFR-Embedding-Code	400M	0.9505	0.2683	0.9949	0.9107	0.7258	0.2212
Jina-Code-v2	161M	0.9439	0.2739	0.5169	0.8874	0.6975	0.1538
CodeRankEmbed	137M	0.9378	0.2604	0.7686	0.8990	0.7166	0.1993
CodeCompass-Embed	494M	0.9228	0.3305	0.5673	0.6480	0.4080	0.1277
Snowflake-Arctic-Embed-L	568M	0.9146	0.1958	0.5401	0.8718	0.6503	0.1435
BGE-M3	568M	0.8976	0.2194	0.5728	0.8501	0.6437	0.1445
BGE-Base-en-v1.5	109M	0.8944	0.2125	0.5265	0.8581	0.6423	0.1415
CodeT5+-110M	110M	0.8702	0.1794	0.3275	0.8147	0.5804	0.1179

CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.

Usage

With Transformers

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")

# CRITICAL: Enable bidirectional attention for embeddings
for layer in model.layers:
    layer.self_attn.is_causal = False

model.eval()

def encode(texts, is_query=False):
    # Add instruction prefix for queries
    if is_query:
        texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {t}" for t in texts]
    
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden = outputs.hidden_states[-1]
        
        # Mean pooling
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        
        # L2 normalize
        embeddings = F.normalize(embeddings, p=2, dim=-1)
    
    return embeddings

# Example: Code Search
query = "How to sort a list in Python"
code_snippets = [
    "def sort_list(lst):\n    return sorted(lst)",
    "def add_numbers(a, b):\n    return a + b",
    "def reverse_string(s):\n    return s[::-1]",
]

query_emb = encode([query], is_query=True)
code_embs = encode(code_snippets, is_query=False)

# Compute similarities
similarities = (query_emb @ code_embs.T).squeeze()
print(f"Query: {query}")
for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
    print(f"  [{sim:.4f}] {code[:50]}...")

Instruction Templates

For optimal performance, use these instruction prefixes for queries:

Task	Instruction Template
NL → Code	`Instruct: Find the most relevant code snippet given the following query:\nQuery: {query}`
Code → Code	`Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {query}`
Tech Q&A	`Instruct: Find the most relevant answer given the following question:\nQuery: {query}`
Text → SQL	`Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {query}`

Note: Document/corpus texts do NOT need instruction prefixes.

Training Details

Base Model: Qwen2.5-Coder-0.5B
Training Data: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
Architecture Modification: Converted all 24 attention layers from causal to bidirectional
Pooling: Mean pooling
Loss: InfoNCE with temperature τ=0.05
Hard Negatives: 7 per sample (embedding-mined)
Effective Batch Size: 1024 (via GradCache)
Training Steps: 950
Hardware: NVIDIA H100

Limitations

Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
Trained primarily on Python/JavaScript/Java/Go/PHP/Ruby
May not generalize well to low-resource programming languages

Citation

@misc{codecompass2026,
  author = {Faisal Mumtaz},
  title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/faisalmumtaz/codecompass-embed}
}

License

Apache 2.0