faisalmumtaz
/

codecompass-embed

@@ -13,8 +13,9 @@ tags:
 - feature-extraction
 - sentence-transformers
 datasets:
-- CoIR-Retrieval/CodeSearchNet-python
-- bigcode/the-stack
 pipeline_tag: feature-extraction
 base_model: Qwen/Qwen2.5-Coder-0.5B
 model-index:
@@ -110,7 +111,8 @@ model.eval()
 def encode(texts, is_query=False):
     # Add instruction prefix for queries
     if is_query:
-        texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts]
     inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
@@ -130,9 +132,12 @@ def encode(texts, is_query=False):
 # Example: Code Search
 query = "How to sort a list in Python"
 code_snippets = [
-    "def sort_list(lst):\n    return sorted(lst)",
-    "def add_numbers(a, b):\n    return a + b",
-    "def reverse_string(s):\n    return s[::-1]",
 ]
 query_emb = encode([query], is_query=True)
@@ -140,18 +145,9 @@ code_embs = encode(code_snippets, is_query=False)
 # Compute similarities
 similarities = (query_emb @ code_embs.T).squeeze()
-print(f"Query: {{query}}")
 for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
-    print(f"  [{{sim:.4f}}] {{code[:50]}}...")
-```
-### With Sentence Transformers (Coming Soon)
-```python
-from sentence_transformers import SentenceTransformer
-model = SentenceTransformer("faisalmumtaz/codecompass-embed")
-embeddings = model.encode(["def hello(): print('world')"])
 ```
 ## Instruction Templates
@@ -160,10 +156,14 @@ For optimal performance, use these instruction prefixes for queries:
 | Task | Instruction Template |
 |------|---------------------|
-| NL → Code | `Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}}` |
-| Code → Code | `Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}}` |
-| Tech Q&A | `Instruct: Find the most relevant answer given the following question:\nQuery: {{query}}` |
-| Text → SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}}` |
 **Note**: Document/corpus texts do NOT need instruction prefixes.
@@ -188,13 +188,13 @@ For optimal performance, use these instruction prefixes for queries:
 ## Citation
 ```bibtex
-@misc{{codecompass2026,
-  author = {{Faisal Mumtaz}},
-  title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}},
-  year = {{2026}},
-  publisher = {{Hugging Face}},
-  url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}}
-}}
 ```
 ## License

 - feature-extraction
 - sentence-transformers
 datasets:
+- code-rag-bench/cornstack
+- bigcode/stackoverflow
+- code_search_net
 pipeline_tag: feature-extraction
 base_model: Qwen/Qwen2.5-Coder-0.5B
 model-index:
 def encode(texts, is_query=False):
     # Add instruction prefix for queries
     if is_query:
+        texts = [f"Instruct: Find the most relevant code snippet given the following query:
+Query: {t}" for t in texts]
     inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
 # Example: Code Search
 query = "How to sort a list in Python"
 code_snippets = [
+    "def sort_list(lst):
+    return sorted(lst)",
+    "def add_numbers(a, b):
+    return a + b",
+    "def reverse_string(s):
+    return s[::-1]",
 ]
 query_emb = encode([query], is_query=True)
 # Compute similarities
 similarities = (query_emb @ code_embs.T).squeeze()
+print(f"Query: {query}")
 for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
+    print(f"  [{sim:.4f}] {code[:50]}...")
 ```
 ## Instruction Templates
 | Task | Instruction Template |
 |------|---------------------|
+| NL → Code | `Instruct: Find the most relevant code snippet given the following query:
+Query: {query}` |
+| Code → Code | `Instruct: Find an equivalent code snippet given the following code snippet:
+Query: {query}` |
+| Tech Q&A | `Instruct: Find the most relevant answer given the following question:
+Query: {query}` |
+| Text → SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:
+Query: {query}` |
 **Note**: Document/corpus texts do NOT need instruction prefixes.
 ## Citation
 ```bibtex
+@misc{codecompass2026,
+  author = {Faisal Mumtaz},
+  title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search},
+  year = {2026},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/faisalmumtaz/codecompass-embed}
+}
 ```
 ## License