--- license: apache-2.0 language: - en - code library_name: transformers tags: - code - embeddings - retrieval - code-search - semantic-search - feature-extraction - sentence-transformers datasets: - code-rag-bench/cornstack - bigcode/stackoverflow - code_search_net pipeline_tag: feature-extraction base_model: Qwen/Qwen2.5-Coder-0.5B model-index: - name: CodeCompass-Embed results: - task: type: retrieval name: Code Retrieval dataset: type: CoIR-Retrieval/codetrans-dl name: CodeTrans-DL metrics: - type: ndcg@10 value: 0.3305 name: NDCG@10 - task: type: retrieval name: Code Retrieval dataset: type: CoIR-Retrieval/CodeSearchNet-python name: CodeSearchNet Python metrics: - type: ndcg@10 value: 0.9228 name: NDCG@10 --- # CodeCompass-Embed **CodeCompass-Embed** is a code embedding model fine-tuned from [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) for semantic code search and retrieval tasks. ## Model Highlights - 🏆 #1 on CodeTrans-DL (code translation between frameworks) - 🥇 #4 on CodeSearchNet-Python (natural language to code search) - ⚡ 494M parameters, 896-dim embeddings - 🔄 Bidirectional attention (converted from causal LLM) - 🎯 Mean pooling with L2 normalization - 📏 Trained at 512 tokens, extrapolates to longer sequences via RoPE ## Model Details | Property | Value | |----------|-------| | Base Model | Qwen2.5-Coder-0.5B | | Parameters | 494M | | Embedding Dimension | 896 | | Max Sequence Length | 512 (training) / 32K (inference) | | Pooling | Mean | | Normalization | L2 | | Attention | Bidirectional (all 24 layers) | ## Benchmark Results (CoIR) Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (NDCG@10). Sorted by CSN-Python. | Model | Params | CSN-Python | CodeTrans-DL | Text2SQL | SO-QA | CF-ST | Apps | |-------|--------|------------|--------------|----------|-------|-------|------| | SFR-Embedding-Code | 400M | 0.9505 | 0.2683 | 0.9949 | 0.9107 | 0.7258 | 0.2212 | | Jina-Code-v2 | 161M | 0.9439 | 0.2739 | 0.5169 | 0.8874 | 0.6975 | 0.1538 | | CodeRankEmbed | 137M | 0.9378 | 0.2604 | 0.7686 | 0.8990 | 0.7166 | 0.1993 | | **CodeCompass-Embed** | **494M** | **0.9228** | **0.3305** | **0.5673** | **0.6480** | **0.4080** | **0.1277** | | Snowflake-Arctic-Embed-L | 568M | 0.9146 | 0.1958 | 0.5401 | 0.8718 | 0.6503 | 0.1435 | | BGE-M3 | 568M | 0.8976 | 0.2194 | 0.5728 | 0.8501 | 0.6437 | 0.1445 | | BGE-Base-en-v1.5 | 109M | 0.8944 | 0.2125 | 0.5265 | 0.8581 | 0.6423 | 0.1415 | | CodeT5+-110M | 110M | 0.8702 | 0.1794 | 0.3275 | 0.8147 | 0.5804 | 0.1179 | *CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.* ## Usage ```python import torch import torch.nn.functional as F from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed") # Enable bidirectional attention for layer in model.layers: layer.self_attn.is_causal = False model.eval() def encode(texts, is_query=False): if is_query: texts = [f"Instruct: Find the most relevant code snippet given the following query: Query: {t}" for t in texts] inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs, output_hidden_states=True) hidden = outputs.hidden_states[-1] mask = inputs["attention_mask"].unsqueeze(-1).float() embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9) embeddings = F.normalize(embeddings, p=2, dim=-1) return embeddings query_emb = encode(["sort a list"], is_query=True) code_embs = encode(["def sort(lst): return sorted(lst)"]) similarity = (query_emb @ code_embs.T).item() ``` ## Instruction Templates | Task | Template | |------|----------| | NL to Code | `Instruct: Find the most relevant code snippet given the following query: Query: {q}` | | Code to Code | `Instruct: Find an equivalent code snippet given the following code snippet: Query: {q}` | | Tech Q&A | `Instruct: Find the most relevant answer given the following question: Query: {q}` | | Text to SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query: Query: {q}` | Documents do not need instruction prefixes. ## Training - **Data**: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet - **Loss**: InfoNCE (τ=0.05) with 7 hard negatives per sample - **Batch Size**: 1024 (via GradCache) - **Steps**: 950 - **Hardware**: NVIDIA H100 ## Limitations - Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback) - Trained on Python/JavaScript/Java/Go/PHP/Ruby ## Citation ```bibtex @misc{codecompass2026, author = {Faisal Mumtaz}, title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/faisalmumtaz/codecompass-embed} } ``` ## License Apache 2.0