faisalmumtaz
/

codecompass-embed

@@ -25,21 +25,24 @@ model-index:
       type: retrieval
       name: Code Retrieval
     dataset:
-      type: CoIR-Retrieval/codetrans-dl
-      name: CodeTrans-DL
     metrics:
     - type: ndcg@10
-      value: 0.3305
       name: NDCG@10
   - task:
       type: retrieval
-      name: Code Retrieval
     dataset:
-      type: CoIR-Retrieval/CodeSearchNet-python
-      name: CodeSearchNet Python
     metrics:
     - type: ndcg@10
-      value: 0.9228
       name: NDCG@10
 ---
@@ -49,12 +52,13 @@ model-index:
 ## Model Highlights
-- 🏆 #1 on CodeTrans-DL (code translation between frameworks)
-- 🥇 #4 on CodeSearchNet-Python (natural language to code search)
-- ⚡ 494M parameters, 896-dim embeddings
-- 🔄 Bidirectional attention (converted from causal LLM)
-- 🎯 Mean pooling with L2 normalization
 - 📏 Trained at 512 tokens, extrapolates to longer sequences via RoPE
 ## Model Details
@@ -70,96 +74,144 @@ model-index:
 ## Benchmark Results (CoIR)
-Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (NDCG@10). Sorted by CSN-Python.
-| Model | Params | CSN-Python | CodeTrans-DL | Text2SQL | SO-QA | CF-ST | Apps |
-|-------|--------|------------|--------------|----------|-------|-------|------|
-| SFR-Embedding-Code | 400M | 0.9505 | 0.2683 | 0.9949 | 0.9107 | 0.7258 | 0.2212 |
-| Jina-Code-v2 | 161M | 0.9439 | 0.2739 | 0.5169 | 0.8874 | 0.6975 | 0.1538 |
-| CodeRankEmbed | 137M | 0.9378 | 0.2604 | 0.7686 | 0.8990 | 0.7166 | 0.1993 |
-| **CodeCompass-Embed** | **494M** | **0.9228** | **0.3305** | **0.5673** | **0.6480** | **0.4080** | **0.1277** |
-| Snowflake-Arctic-Embed-L | 568M | 0.9146 | 0.1958 | 0.5401 | 0.8718 | 0.6503 | 0.1435 |
-| BGE-M3 | 568M | 0.8976 | 0.2194 | 0.5728 | 0.8501 | 0.6437 | 0.1445 |
-| BGE-Base-en-v1.5 | 109M | 0.8944 | 0.2125 | 0.5265 | 0.8581 | 0.6423 | 0.1415 |
-| CodeT5+-110M | 110M | 0.8702 | 0.1794 | 0.3275 | 0.8147 | 0.5804 | 0.1179 |
-*CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.*
 ## Usage
 ```python
 import torch
 import torch.nn.functional as F
 from transformers import AutoModel, AutoTokenizer
 model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")
-# Enable bidirectional attention
-for layer in model.layers:
     layer.self_attn.is_causal = False
 model.eval()
 def encode(texts, is_query=False):
     if is_query:
-        texts = [f"Instruct: Find the most relevant code snippet given the following query:
-Query: {t}" for t in texts]
     inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
     with torch.no_grad():
         outputs = model(**inputs, output_hidden_states=True)
         hidden = outputs.hidden_states[-1]
         mask = inputs["attention_mask"].unsqueeze(-1).float()
         embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
         embeddings = F.normalize(embeddings, p=2, dim=-1)
     return embeddings
-query_emb = encode(["sort a list"], is_query=True)
-code_embs = encode(["def sort(lst): return sorted(lst)"])
-similarity = (query_emb @ code_embs.T).item()
 ```
 ## Instruction Templates
-| Task | Template |
-|------|----------|
-| NL to Code | `Instruct: Find the most relevant code snippet given the following query:
-Query: {q}` |
-| Code to Code | `Instruct: Find an equivalent code snippet given the following code snippet:
-Query: {q}` |
-| Tech Q&A | `Instruct: Find the most relevant answer given the following question:
-Query: {q}` |
-| Text to SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:
-Query: {q}` |
-Documents do not need instruction prefixes.
-## Training
-- **Data**: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
-- **Loss**: InfoNCE (τ=0.05) with 7 hard negatives per sample
-- **Batch Size**: 1024 (via GradCache)
-- **Steps**: 950
-- **Hardware**: NVIDIA H100
 ## Limitations
-- Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
-- Trained on Python/JavaScript/Java/Go/PHP/Ruby
 ## Citation
 ```bibtex
-@misc{codecompass2026,
-  author = {Faisal Mumtaz},
-  title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search},
-  year = {2026},
-  publisher = {Hugging Face},
-  url = {https://huggingface.co/faisalmumtaz/codecompass-embed}
-}
 ```
 ## License

       type: retrieval
       name: Code Retrieval
     dataset:
+      type: CoIR-Retrieval/CodeSearchNet-python
+      name: CodeSearchNet Python
     metrics:
     - type: ndcg@10
+      value: 0.979
       name: NDCG@10
+    - type: mrr@10
+      value: 0.976
+      name: MRR@10
   - task:
       type: retrieval
+      name: Code Translation
     dataset:
+      type: CoIR-Retrieval/codetrans-dl
+      name: CodeTrans-DL
     metrics:
     - type: ndcg@10
+      value: 0.286
       name: NDCG@10
 ---
 ## Model Highlights
+- 🏆 **#1 on CodeSearchNet-Python** — NDCG@10 = 0.979, beating SFR-Embedding-Code (0.951) by +2.9%
+- 🥇 **#1 on CodeTrans-DL** — Code translation between deep learning frameworks
+- ⚡ **494M parameters**, 896-dim embeddings — runs on consumer GPUs
+- 🔄 **Bidirectional attention** (converted from causal LLM)
+- 🎯 **Mean pooling** with L2 normalization
 - 📏 Trained at 512 tokens, extrapolates to longer sequences via RoPE
+- 🌐 **Multi-language**: Python, Java, JavaScript, Go, Ruby, PHP
 ## Model Details
 ## Benchmark Results (CoIR)
+Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (ACL 2025). All scores are NDCG@10. Sorted by CSN-Python.
+| Model | Params | CSN-Py | CodeTrans | Text2SQL | SO-QA | CodeFeedback | Apps |
+|-------|--------|--------|-----------|----------|-------|--------------|------|
+| **CodeCompass-Embed (ours)** | **494M** | **0.979** 🏆 | **0.286** 🏆 | **0.736** | **0.834** | **0.814** | **0.349** |
+| SFR-Embedding-Code | 400M | 0.951 | 0.268 | 0.995 | 0.911 | 0.726 | 0.221 |
+| Jina-Code-v2 | 161M | 0.944 | 0.274 | 0.517 | 0.887 | 0.698 | 0.154 |
+| CodeRankEmbed | 137M | 0.938 | 0.260 | 0.769 | 0.899 | 0.717 | 0.199 |
+| Snowflake-Arctic-Embed-L | 568M | 0.915 | 0.196 | 0.540 | 0.872 | 0.650 | 0.144 |
+| BGE-M3 | 568M | 0.898 | 0.219 | 0.573 | 0.850 | 0.644 | 0.145 |
+| BGE-Base-en-v1.5 | 109M | 0.894 | 0.213 | 0.527 | 0.858 | 0.642 | 0.142 |
+| CodeT5+-110M | 110M | 0.870 | 0.179 | 0.328 | 0.815 | 0.580 | 0.118 |
+### Multi-Language Code Search (CodeSearchNet)
+| Language | NDCG@10 | MRR@10 |
+|----------|---------|--------|
+| **Python** | **0.979** | **0.976** |
+| Go | 0.797 | 0.767 |
+| Java | 0.639 | 0.600 |
+| PHP | 0.627 | 0.585 |
+| JavaScript | 0.621 | 0.578 |
+| Ruby | 0.579 | 0.535 |
+### Full Results (All 12 Tasks)
+| Task | NDCG@10 | MRR@10 |
+|------|---------|--------|
+| **codesearchnet-python** | **0.979** 🏆 | **0.976** |
+| stackoverflow-qa | 0.834 | 0.810 |
+| codefeedback-st | 0.814 | 0.775 |
+| codesearchnet-go | 0.797 | 0.767 |
+| synthetic-text2sql | 0.736 | 0.662 |
+| codesearchnet-java | 0.639 | 0.600 |
+| codesearchnet-php | 0.627 | 0.585 |
+| codesearchnet-javascript | 0.621 | 0.578 |
+| codesearchnet-ruby | 0.579 | 0.535 |
+| apps | 0.349 | 0.307 |
+| codetrans-dl | 0.286 🏆 | 0.164 |
+| cosqa | 0.209 | 0.165 |
+| **Average (12 tasks)** | **0.623** | **0.577** |
 ## Usage
+### With Transformers
 ```python
 import torch
 import torch.nn.functional as F
 from transformers import AutoModel, AutoTokenizer
+# Load model
 model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")
+# CRITICAL: Enable bidirectional attention for embeddings
+for layer in model.model.layers:
     layer.self_attn.is_causal = False
 model.eval()
 def encode(texts, is_query=False):
+    # Add instruction prefix for queries
     if is_query:
+        texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts]
     inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
     with torch.no_grad():
         outputs = model(**inputs, output_hidden_states=True)
         hidden = outputs.hidden_states[-1]
+        # Mean pooling
         mask = inputs["attention_mask"].unsqueeze(-1).float()
         embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
+        # L2 normalize
         embeddings = F.normalize(embeddings, p=2, dim=-1)
     return embeddings
+# Example: Code Search
+query = "How to sort a list in Python"
+code_snippets = [
+    "def sort_list(lst):\n    return sorted(lst)",
+    "def add_numbers(a, b):\n    return a + b",
+    "def reverse_string(s):\n    return s[::-1]",
+]
+query_emb = encode([query], is_query=True)
+code_embs = encode(code_snippets, is_query=False)
+# Compute similarities
+similarities = (query_emb @ code_embs.T).squeeze()
+print(f"Query: {{query}}")
+for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
+    print(f"  [{{sim:.4f}}] {{code[:50]}}...")
 ```
 ## Instruction Templates
+For optimal performance, use these instruction prefixes for queries:
+| Task | Instruction Template |
+|------|---------------------|
+| NL → Code | `Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}}` |
+| Code → Code | `Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}}` |
+| Tech Q&A | `Instruct: Find the most relevant answer given the following question:\nQuery: {{query}}` |
+| Text → SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}}` |
+**Note**: Document/corpus texts do NOT need instruction prefixes.
+## Training Details
+- **Base Model**: Qwen2.5-Coder-0.5B (continued fine-tuning from previous CodeCompass checkpoint)
+- **Training Data**: 100K GPT-filtered gold-standard samples from CoRNStack, StackOverflow, CodeSearchNet + hard negatives
+- **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
+- **Loss**: InfoNCE with temperature τ=0.05
+- **Hard Negatives**: Up to 8 per sample (GPT-validated)
+- **Effective Batch Size**: 1024 (via GradCache)
+- **Hardware**: NVIDIA H100 (95GB)
 ## Limitations
+- Strongest on Python; other languages show lower but competitive performance
+- Weaker on competitive programming tasks (APPS) due to long solution lengths vs. 512 training context
+- May not generalize to low-resource programming languages not seen in training
 ## Citation
 ```bibtex
+@misc{{codecompass2026,
+  author = {{Faisal Mumtaz}},
+  title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}},
+  year = {{2026}},
+  publisher = {{Hugging Face}},
+  url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}}
+}}
 ```
 ## License

config.json CHANGED Viewed

@@ -5,6 +5,7 @@
   "attention_dropout": 0.0,
   "bos_token_id": 151643,
   "dtype": "bfloat16",
   "eos_token_id": 151643,
   "hidden_act": "silu",
   "hidden_size": 896,
@@ -37,11 +38,15 @@
     "full_attention"
   ],
   "max_position_embeddings": 32768,
   "max_window_layers": 24,
   "model_type": "qwen2",
   "num_attention_heads": 14,
   "num_hidden_layers": 24,
   "num_key_value_heads": 2,
   "rms_norm_eps": 1e-06,
   "rope_scaling": null,
   "rope_theta": 1000000.0,
@@ -49,13 +54,8 @@
   "tie_word_embeddings": true,
   "transformers_version": "4.40.0",
   "use_cache": false,
   "use_sliding_window": false,
   "vocab_size": 151936,
-  "torch_dtype": "bfloat16",
-  "model_name": "Qwen/Qwen2.5-Coder-0.5B",
-  "embedding_dim": 896,
-  "max_seq_len": 512,
-  "use_lora": false,
-  "pooling": "mean",
-  "normalize": true
 }

   "attention_dropout": 0.0,
   "bos_token_id": 151643,
   "dtype": "bfloat16",
+  "embedding_dim": 896,
   "eos_token_id": 151643,
   "hidden_act": "silu",
   "hidden_size": 896,
     "full_attention"
   ],
   "max_position_embeddings": 32768,
+  "max_seq_len": 512,
   "max_window_layers": 24,
+  "model_name": "faisalmumtaz/codecompass-embed",
   "model_type": "qwen2",
+  "normalize": true,
   "num_attention_heads": 14,
   "num_hidden_layers": 24,
   "num_key_value_heads": 2,
+  "pooling": "mean",
   "rms_norm_eps": 1e-06,
   "rope_scaling": null,
   "rope_theta": 1000000.0,
   "tie_word_embeddings": true,
   "transformers_version": "4.40.0",
   "use_cache": false,
+  "use_lora": false,
   "use_sliding_window": false,
   "vocab_size": 151936,
+  "torch_dtype": "bfloat16"
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e0fe7769104ed825d2feba8e063b6f4ea499668985ed73a63c067c5ccc2cd1db
 size 988096088

 version https://git-lfs.github.com/spec/v1
+oid sha256:c4c4cae2b4ab31994a5aa68a011ac8e0f4125f54123d1b8674b721079e4dd2c1
 size 988096088

tokenizer_config.json CHANGED Viewed

@@ -199,9 +199,16 @@
   "eos_token": "<|endoftext|>",
   "errors": "replace",
   "extra_special_tokens": {},
   "model_max_length": 32768,
   "pad_token": "<|endoftext|>",
   "split_special_tokens": false,
   "tokenizer_class": "Qwen2Tokenizer",
   "unk_token": null
 }

   "eos_token": "<|endoftext|>",
   "errors": "replace",
   "extra_special_tokens": {},
+  "max_length": 1024,
   "model_max_length": 32768,
+  "pad_to_multiple_of": null,
   "pad_token": "<|endoftext|>",
+  "pad_token_type_id": 0,
+  "padding_side": "left",
   "split_special_tokens": false,
+  "stride": 0,
   "tokenizer_class": "Qwen2Tokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
   "unk_token": null
 }