codecompass-embed / README.md
faisalmumtaz's picture
Simplify README: single benchmark table, factual highlights
6321f18 verified
|
raw
history blame
6.17 kB
metadata
license: apache-2.0
language:
  - en
  - code
library_name: transformers
tags:
  - code
  - embeddings
  - retrieval
  - code-search
  - semantic-search
  - feature-extraction
  - sentence-transformers
datasets:
  - code-rag-bench/cornstack
  - bigcode/stackoverflow
  - code_search_net
pipeline_tag: feature-extraction
base_model: Qwen/Qwen2.5-Coder-0.5B
model-index:
  - name: CodeCompass-Embed
    results:
      - task:
          type: retrieval
          name: Code Retrieval
        dataset:
          type: CoIR-Retrieval/codetrans-dl
          name: CodeTrans-DL
        metrics:
          - type: ndcg@10
            value: 0.3305
            name: NDCG@10
      - task:
          type: retrieval
          name: Code Retrieval
        dataset:
          type: CoIR-Retrieval/CodeSearchNet-python
          name: CodeSearchNet Python
        metrics:
          - type: ndcg@10
            value: 0.9228
            name: NDCG@10
          - type: mrr@10
            value: 0.9106
            name: MRR@10

CodeCompass-Embed

CodeCompass-Embed is a code embedding model fine-tuned from Qwen2.5-Coder-0.5B for semantic code search and retrieval tasks.

Model Highlights

  • πŸ† #1 on CodeTrans-DL (code translation between frameworks)
  • πŸ₯‡ #4 on CodeSearchNet-Python (natural language to code search)
  • ⚑ 494M parameters, 896-dim embeddings
  • πŸ”„ Bidirectional attention (converted from causal LLM)
  • 🎯 Mean pooling with L2 normalization
  • πŸ“ Trained at 512 tokens, extrapolates to longer sequences via RoPE

Model Details

Property Value
Base Model Qwen2.5-Coder-0.5B
Parameters 494M
Embedding Dimension 896
Max Sequence Length 512 (training) / 32K (inference)
Pooling Mean
Normalization L2
Attention Bidirectional (all 24 layers)

Benchmark Results (CoIR)

Evaluated on the CoIR Benchmark (NDCG@10). Sorted by CSN-Python.

Model Params CSN-Python CodeTrans-DL Text2SQL SO-QA CF-ST Apps
SFR-Embedding-Code 400M 0.9505 0.2683 0.9949 0.9107 0.7258 0.2212
Jina-Code-v2 161M 0.9439 0.2739 0.5169 0.8874 0.6975 0.1538
CodeRankEmbed 137M 0.9378 0.2604 0.7686 0.8990 0.7166 0.1993
CodeCompass-Embed 494M 0.9228 0.3305 0.5673 0.6480 0.4080 0.1277
Snowflake-Arctic-Embed-L 568M 0.9146 0.1958 0.5401 0.8718 0.6503 0.1435
BGE-M3 568M 0.8976 0.2194 0.5728 0.8501 0.6437 0.1445
BGE-Base-en-v1.5 109M 0.8944 0.2125 0.5265 0.8581 0.6423 0.1415
CodeT5+-110M 110M 0.8702 0.1794 0.3275 0.8147 0.5804 0.1179

CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.

Usage

With Transformers

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")

# CRITICAL: Enable bidirectional attention for embeddings
for layer in model.layers:
    layer.self_attn.is_causal = False

model.eval()

def encode(texts, is_query=False):
    # Add instruction prefix for queries
    if is_query:
        texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {t}" for t in texts]
    
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden = outputs.hidden_states[-1]
        
        # Mean pooling
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        
        # L2 normalize
        embeddings = F.normalize(embeddings, p=2, dim=-1)
    
    return embeddings

# Example: Code Search
query = "How to sort a list in Python"
code_snippets = [
    "def sort_list(lst):\n    return sorted(lst)",
    "def add_numbers(a, b):\n    return a + b",
    "def reverse_string(s):\n    return s[::-1]",
]

query_emb = encode([query], is_query=True)
code_embs = encode(code_snippets, is_query=False)

# Compute similarities
similarities = (query_emb @ code_embs.T).squeeze()
print(f"Query: {query}")
for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
    print(f"  [{sim:.4f}] {code[:50]}...")

Instruction Templates

For optimal performance, use these instruction prefixes for queries:

Task Instruction Template
NL β†’ Code Instruct: Find the most relevant code snippet given the following query:\nQuery: {query}
Code β†’ Code Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {query}
Tech Q&A Instruct: Find the most relevant answer given the following question:\nQuery: {query}
Text β†’ SQL Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {query}

Note: Document/corpus texts do NOT need instruction prefixes.

Training Details

  • Base Model: Qwen2.5-Coder-0.5B
  • Training Data: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
  • Architecture Modification: Converted all 24 attention layers from causal to bidirectional
  • Pooling: Mean pooling
  • Loss: InfoNCE with temperature Ο„=0.05
  • Hard Negatives: 7 per sample (embedding-mined)
  • Effective Batch Size: 1024 (via GradCache)
  • Training Steps: 950
  • Hardware: NVIDIA H100

Limitations

  • Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
  • Trained primarily on Python/JavaScript/Java/Go/PHP/Ruby
  • May not generalize well to low-resource programming languages

Citation

@misc{codecompass2026,
  author = {Faisal Mumtaz},
  title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/faisalmumtaz/codecompass-embed}
}

License

Apache 2.0