faisalmumtaz
/

codecompass-embed

@@ -41,9 +41,6 @@ model-index:
     - type: ndcg@10
       value: 0.9228
       name: NDCG@10
-    - type: mrr@10
-      value: 0.9106
-      name: MRR@10
 ---
 # CodeCompass-Embed
@@ -90,91 +87,68 @@ Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (NDCG@10).
 ## Usage
-### With Transformers
 ```python
 import torch
 import torch.nn.functional as F
 from transformers import AutoModel, AutoTokenizer
-# Load model
 model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")
-# CRITICAL: Enable bidirectional attention for embeddings
 for layer in model.layers:
     layer.self_attn.is_causal = False
 model.eval()
 def encode(texts, is_query=False):
-    # Add instruction prefix for queries
     if is_query:
-        texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {t}" for t in texts]
     inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
     with torch.no_grad():
         outputs = model(**inputs, output_hidden_states=True)
         hidden = outputs.hidden_states[-1]
-        # Mean pooling
         mask = inputs["attention_mask"].unsqueeze(-1).float()
         embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
-        # L2 normalize
         embeddings = F.normalize(embeddings, p=2, dim=-1)
     return embeddings
-# Example: Code Search
-query = "How to sort a list in Python"
-code_snippets = [
-    "def sort_list(lst):\n    return sorted(lst)",
-    "def add_numbers(a, b):\n    return a + b",
-    "def reverse_string(s):\n    return s[::-1]",
-]
-query_emb = encode([query], is_query=True)
-code_embs = encode(code_snippets, is_query=False)
-# Compute similarities
-similarities = (query_emb @ code_embs.T).squeeze()
-print(f"Query: {query}")
-for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
-    print(f"  [{sim:.4f}] {code[:50]}...")
 ```
 ## Instruction Templates
-For optimal performance, use these instruction prefixes for queries:
-| Task | Instruction Template |
-|------|---------------------|
-| NL → Code | `Instruct: Find the most relevant code snippet given the following query:\nQuery: {query}` |
-| Code → Code | `Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {query}` |
-| Tech Q&A | `Instruct: Find the most relevant answer given the following question:\nQuery: {query}` |
-| Text → SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {query}` |
-**Note**: Document/corpus texts do NOT need instruction prefixes.
-## Training Details
-- **Base Model**: Qwen2.5-Coder-0.5B
-- **Training Data**: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
-- **Architecture Modification**: Converted all 24 attention layers from causal to bidirectional
-- **Pooling**: Mean pooling
-- **Loss**: InfoNCE with temperature τ=0.05
-- **Hard Negatives**: 7 per sample (embedding-mined)
-- **Effective Batch Size**: 1024 (via GradCache)
-- **Training Steps**: 950
 - **Hardware**: NVIDIA H100
 ## Limitations
 - Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
-- Trained primarily on Python/JavaScript/Java/Go/PHP/Ruby
-- May not generalize well to low-resource programming languages
 ## Citation

     - type: ndcg@10
       value: 0.9228
       name: NDCG@10
 ---
 # CodeCompass-Embed
 ## Usage
 ```python
 import torch
 import torch.nn.functional as F
 from transformers import AutoModel, AutoTokenizer
 model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")
+# Enable bidirectional attention
 for layer in model.layers:
     layer.self_attn.is_causal = False
 model.eval()
 def encode(texts, is_query=False):
     if is_query:
+        texts = [f"Instruct: Find the most relevant code snippet given the following query:
+Query: {t}" for t in texts]
     inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
     with torch.no_grad():
         outputs = model(**inputs, output_hidden_states=True)
         hidden = outputs.hidden_states[-1]
         mask = inputs["attention_mask"].unsqueeze(-1).float()
         embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
         embeddings = F.normalize(embeddings, p=2, dim=-1)
     return embeddings
+query_emb = encode(["sort a list"], is_query=True)
+code_embs = encode(["def sort(lst): return sorted(lst)"])
+similarity = (query_emb @ code_embs.T).item()
 ```
 ## Instruction Templates
+| Task | Template |
+|------|----------|
+| NL to Code | `Instruct: Find the most relevant code snippet given the following query:
+Query: {q}` |
+| Code to Code | `Instruct: Find an equivalent code snippet given the following code snippet:
+Query: {q}` |
+| Tech Q&A | `Instruct: Find the most relevant answer given the following question:
+Query: {q}` |
+| Text to SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:
+Query: {q}` |
+Documents do not need instruction prefixes.
+## Training
+- **Data**: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
+- **Loss**: InfoNCE (τ=0.05) with 7 hard negatives per sample
+- **Batch Size**: 1024 (via GradCache)
+- **Steps**: 950
 - **Hardware**: NVIDIA H100
 ## Limitations
 - Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
+- Trained on Python/JavaScript/Java/Go/PHP/Ruby
 ## Citation