faisalmumtaz commited on
Commit
1ccb9d7
·
verified ·
1 Parent(s): 72021f1

Upload CodeCompass-Embed v2 — #1 on CSN-Python (NDCG@10=0.979), 12-task CoIR eval

Browse files
Files changed (4) hide show
  1. README.md +111 -59
  2. config.json +7 -7
  3. model.safetensors +1 -1
  4. tokenizer_config.json +7 -0
README.md CHANGED
@@ -25,21 +25,24 @@ model-index:
25
  type: retrieval
26
  name: Code Retrieval
27
  dataset:
28
- type: CoIR-Retrieval/codetrans-dl
29
- name: CodeTrans-DL
30
  metrics:
31
  - type: ndcg@10
32
- value: 0.3305
33
  name: NDCG@10
 
 
 
34
  - task:
35
  type: retrieval
36
- name: Code Retrieval
37
  dataset:
38
- type: CoIR-Retrieval/CodeSearchNet-python
39
- name: CodeSearchNet Python
40
  metrics:
41
  - type: ndcg@10
42
- value: 0.9228
43
  name: NDCG@10
44
  ---
45
 
@@ -49,12 +52,13 @@ model-index:
49
 
50
  ## Model Highlights
51
 
52
- - 🏆 #1 on CodeTrans-DL (code translation between frameworks)
53
- - 🥇 #4 on CodeSearchNet-Python (natural language to code search)
54
- - ⚡ 494M parameters, 896-dim embeddings
55
- - 🔄 Bidirectional attention (converted from causal LLM)
56
- - 🎯 Mean pooling with L2 normalization
57
  - 📏 Trained at 512 tokens, extrapolates to longer sequences via RoPE
 
58
 
59
  ## Model Details
60
 
@@ -70,96 +74,144 @@ model-index:
70
 
71
  ## Benchmark Results (CoIR)
72
 
73
- Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (NDCG@10). Sorted by CSN-Python.
74
-
75
- | Model | Params | CSN-Python | CodeTrans-DL | Text2SQL | SO-QA | CF-ST | Apps |
76
- |-------|--------|------------|--------------|----------|-------|-------|------|
77
- | SFR-Embedding-Code | 400M | 0.9505 | 0.2683 | 0.9949 | 0.9107 | 0.7258 | 0.2212 |
78
- | Jina-Code-v2 | 161M | 0.9439 | 0.2739 | 0.5169 | 0.8874 | 0.6975 | 0.1538 |
79
- | CodeRankEmbed | 137M | 0.9378 | 0.2604 | 0.7686 | 0.8990 | 0.7166 | 0.1993 |
80
- | **CodeCompass-Embed** | **494M** | **0.9228** | **0.3305** | **0.5673** | **0.6480** | **0.4080** | **0.1277** |
81
- | Snowflake-Arctic-Embed-L | 568M | 0.9146 | 0.1958 | 0.5401 | 0.8718 | 0.6503 | 0.1435 |
82
- | BGE-M3 | 568M | 0.8976 | 0.2194 | 0.5728 | 0.8501 | 0.6437 | 0.1445 |
83
- | BGE-Base-en-v1.5 | 109M | 0.8944 | 0.2125 | 0.5265 | 0.8581 | 0.6423 | 0.1415 |
84
- | CodeT5+-110M | 110M | 0.8702 | 0.1794 | 0.3275 | 0.8147 | 0.5804 | 0.1179 |
85
-
86
- *CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
  ## Usage
89
 
 
 
90
  ```python
91
  import torch
92
  import torch.nn.functional as F
93
  from transformers import AutoModel, AutoTokenizer
94
 
 
95
  model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
96
  tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")
97
 
98
- # Enable bidirectional attention
99
- for layer in model.layers:
100
  layer.self_attn.is_causal = False
101
 
102
  model.eval()
103
 
104
  def encode(texts, is_query=False):
 
105
  if is_query:
106
- texts = [f"Instruct: Find the most relevant code snippet given the following query:
107
- Query: {t}" for t in texts]
108
 
109
  inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
110
 
111
  with torch.no_grad():
112
  outputs = model(**inputs, output_hidden_states=True)
113
  hidden = outputs.hidden_states[-1]
 
 
114
  mask = inputs["attention_mask"].unsqueeze(-1).float()
115
  embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
 
 
116
  embeddings = F.normalize(embeddings, p=2, dim=-1)
117
 
118
  return embeddings
119
 
120
- query_emb = encode(["sort a list"], is_query=True)
121
- code_embs = encode(["def sort(lst): return sorted(lst)"])
122
- similarity = (query_emb @ code_embs.T).item()
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  ```
124
 
125
  ## Instruction Templates
126
 
127
- | Task | Template |
128
- |------|----------|
129
- | NL to Code | `Instruct: Find the most relevant code snippet given the following query:
130
- Query: {q}` |
131
- | Code to Code | `Instruct: Find an equivalent code snippet given the following code snippet:
132
- Query: {q}` |
133
- | Tech Q&A | `Instruct: Find the most relevant answer given the following question:
134
- Query: {q}` |
135
- | Text to SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:
136
- Query: {q}` |
137
 
138
- Documents do not need instruction prefixes.
139
 
140
- ## Training
141
 
142
- - **Data**: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
143
- - **Loss**: InfoNCE (τ=0.05) with 7 hard negatives per sample
144
- - **Batch Size**: 1024 (via GradCache)
145
- - **Steps**: 950
146
- - **Hardware**: NVIDIA H100
 
 
147
 
148
  ## Limitations
149
 
150
- - Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
151
- - Trained on Python/JavaScript/Java/Go/PHP/Ruby
 
152
 
153
  ## Citation
154
 
155
  ```bibtex
156
- @misc{codecompass2026,
157
- author = {Faisal Mumtaz},
158
- title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search},
159
- year = {2026},
160
- publisher = {Hugging Face},
161
- url = {https://huggingface.co/faisalmumtaz/codecompass-embed}
162
- }
163
  ```
164
 
165
  ## License
 
25
  type: retrieval
26
  name: Code Retrieval
27
  dataset:
28
+ type: CoIR-Retrieval/CodeSearchNet-python
29
+ name: CodeSearchNet Python
30
  metrics:
31
  - type: ndcg@10
32
+ value: 0.979
33
  name: NDCG@10
34
+ - type: mrr@10
35
+ value: 0.976
36
+ name: MRR@10
37
  - task:
38
  type: retrieval
39
+ name: Code Translation
40
  dataset:
41
+ type: CoIR-Retrieval/codetrans-dl
42
+ name: CodeTrans-DL
43
  metrics:
44
  - type: ndcg@10
45
+ value: 0.286
46
  name: NDCG@10
47
  ---
48
 
 
52
 
53
  ## Model Highlights
54
 
55
+ - 🏆 **#1 on CodeSearchNet-Python** NDCG@10 = 0.979, beating SFR-Embedding-Code (0.951) by +2.9%
56
+ - 🥇 **#1 on CodeTrans-DL** Code translation between deep learning frameworks
57
+ - ⚡ **494M parameters**, 896-dim embeddings — runs on consumer GPUs
58
+ - 🔄 **Bidirectional attention** (converted from causal LLM)
59
+ - 🎯 **Mean pooling** with L2 normalization
60
  - 📏 Trained at 512 tokens, extrapolates to longer sequences via RoPE
61
+ - 🌐 **Multi-language**: Python, Java, JavaScript, Go, Ruby, PHP
62
 
63
  ## Model Details
64
 
 
74
 
75
  ## Benchmark Results (CoIR)
76
 
77
+ Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (ACL 2025). All scores are NDCG@10. Sorted by CSN-Python.
78
+
79
+ | Model | Params | CSN-Py | CodeTrans | Text2SQL | SO-QA | CodeFeedback | Apps |
80
+ |-------|--------|--------|-----------|----------|-------|--------------|------|
81
+ | **CodeCompass-Embed (ours)** | **494M** | **0.979** 🏆 | **0.286** 🏆 | **0.736** | **0.834** | **0.814** | **0.349** |
82
+ | SFR-Embedding-Code | 400M | 0.951 | 0.268 | 0.995 | 0.911 | 0.726 | 0.221 |
83
+ | Jina-Code-v2 | 161M | 0.944 | 0.274 | 0.517 | 0.887 | 0.698 | 0.154 |
84
+ | CodeRankEmbed | 137M | 0.938 | 0.260 | 0.769 | 0.899 | 0.717 | 0.199 |
85
+ | Snowflake-Arctic-Embed-L | 568M | 0.915 | 0.196 | 0.540 | 0.872 | 0.650 | 0.144 |
86
+ | BGE-M3 | 568M | 0.898 | 0.219 | 0.573 | 0.850 | 0.644 | 0.145 |
87
+ | BGE-Base-en-v1.5 | 109M | 0.894 | 0.213 | 0.527 | 0.858 | 0.642 | 0.142 |
88
+ | CodeT5+-110M | 110M | 0.870 | 0.179 | 0.328 | 0.815 | 0.580 | 0.118 |
89
+
90
+ ### Multi-Language Code Search (CodeSearchNet)
91
+
92
+ | Language | NDCG@10 | MRR@10 |
93
+ |----------|---------|--------|
94
+ | **Python** | **0.979** | **0.976** |
95
+ | Go | 0.797 | 0.767 |
96
+ | Java | 0.639 | 0.600 |
97
+ | PHP | 0.627 | 0.585 |
98
+ | JavaScript | 0.621 | 0.578 |
99
+ | Ruby | 0.579 | 0.535 |
100
+
101
+ ### Full Results (All 12 Tasks)
102
+
103
+ | Task | NDCG@10 | MRR@10 |
104
+ |------|---------|--------|
105
+ | **codesearchnet-python** | **0.979** 🏆 | **0.976** |
106
+ | stackoverflow-qa | 0.834 | 0.810 |
107
+ | codefeedback-st | 0.814 | 0.775 |
108
+ | codesearchnet-go | 0.797 | 0.767 |
109
+ | synthetic-text2sql | 0.736 | 0.662 |
110
+ | codesearchnet-java | 0.639 | 0.600 |
111
+ | codesearchnet-php | 0.627 | 0.585 |
112
+ | codesearchnet-javascript | 0.621 | 0.578 |
113
+ | codesearchnet-ruby | 0.579 | 0.535 |
114
+ | apps | 0.349 | 0.307 |
115
+ | codetrans-dl | 0.286 🏆 | 0.164 |
116
+ | cosqa | 0.209 | 0.165 |
117
+ | **Average (12 tasks)** | **0.623** | **0.577** |
118
 
119
  ## Usage
120
 
121
+ ### With Transformers
122
+
123
  ```python
124
  import torch
125
  import torch.nn.functional as F
126
  from transformers import AutoModel, AutoTokenizer
127
 
128
+ # Load model
129
  model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
130
  tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")
131
 
132
+ # CRITICAL: Enable bidirectional attention for embeddings
133
+ for layer in model.model.layers:
134
  layer.self_attn.is_causal = False
135
 
136
  model.eval()
137
 
138
  def encode(texts, is_query=False):
139
+ # Add instruction prefix for queries
140
  if is_query:
141
+ texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts]
 
142
 
143
  inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
144
 
145
  with torch.no_grad():
146
  outputs = model(**inputs, output_hidden_states=True)
147
  hidden = outputs.hidden_states[-1]
148
+
149
+ # Mean pooling
150
  mask = inputs["attention_mask"].unsqueeze(-1).float()
151
  embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
152
+
153
+ # L2 normalize
154
  embeddings = F.normalize(embeddings, p=2, dim=-1)
155
 
156
  return embeddings
157
 
158
+ # Example: Code Search
159
+ query = "How to sort a list in Python"
160
+ code_snippets = [
161
+ "def sort_list(lst):\n return sorted(lst)",
162
+ "def add_numbers(a, b):\n return a + b",
163
+ "def reverse_string(s):\n return s[::-1]",
164
+ ]
165
+
166
+ query_emb = encode([query], is_query=True)
167
+ code_embs = encode(code_snippets, is_query=False)
168
+
169
+ # Compute similarities
170
+ similarities = (query_emb @ code_embs.T).squeeze()
171
+ print(f"Query: {{query}}")
172
+ for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
173
+ print(f" [{{sim:.4f}}] {{code[:50]}}...")
174
  ```
175
 
176
  ## Instruction Templates
177
 
178
+ For optimal performance, use these instruction prefixes for queries:
179
+
180
+ | Task | Instruction Template |
181
+ |------|---------------------|
182
+ | NL Code | `Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}}` |
183
+ | Code → Code | `Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}}` |
184
+ | Tech Q&A | `Instruct: Find the most relevant answer given the following question:\nQuery: {{query}}` |
185
+ | Text → SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}}` |
 
 
186
 
187
+ **Note**: Document/corpus texts do NOT need instruction prefixes.
188
 
189
+ ## Training Details
190
 
191
+ - **Base Model**: Qwen2.5-Coder-0.5B (continued fine-tuning from previous CodeCompass checkpoint)
192
+ - **Training Data**: 100K GPT-filtered gold-standard samples from CoRNStack, StackOverflow, CodeSearchNet + hard negatives
193
+ - **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
194
+ - **Loss**: InfoNCE with temperature τ=0.05
195
+ - **Hard Negatives**: Up to 8 per sample (GPT-validated)
196
+ - **Effective Batch Size**: 1024 (via GradCache)
197
+ - **Hardware**: NVIDIA H100 (95GB)
198
 
199
  ## Limitations
200
 
201
+ - Strongest on Python; other languages show lower but competitive performance
202
+ - Weaker on competitive programming tasks (APPS) due to long solution lengths vs. 512 training context
203
+ - May not generalize to low-resource programming languages not seen in training
204
 
205
  ## Citation
206
 
207
  ```bibtex
208
+ @misc{{codecompass2026,
209
+ author = {{Faisal Mumtaz}},
210
+ title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}},
211
+ year = {{2026}},
212
+ publisher = {{Hugging Face}},
213
+ url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}}
214
+ }}
215
  ```
216
 
217
  ## License
config.json CHANGED
@@ -5,6 +5,7 @@
5
  "attention_dropout": 0.0,
6
  "bos_token_id": 151643,
7
  "dtype": "bfloat16",
 
8
  "eos_token_id": 151643,
9
  "hidden_act": "silu",
10
  "hidden_size": 896,
@@ -37,11 +38,15 @@
37
  "full_attention"
38
  ],
39
  "max_position_embeddings": 32768,
 
40
  "max_window_layers": 24,
 
41
  "model_type": "qwen2",
 
42
  "num_attention_heads": 14,
43
  "num_hidden_layers": 24,
44
  "num_key_value_heads": 2,
 
45
  "rms_norm_eps": 1e-06,
46
  "rope_scaling": null,
47
  "rope_theta": 1000000.0,
@@ -49,13 +54,8 @@
49
  "tie_word_embeddings": true,
50
  "transformers_version": "4.40.0",
51
  "use_cache": false,
 
52
  "use_sliding_window": false,
53
  "vocab_size": 151936,
54
- "torch_dtype": "bfloat16",
55
- "model_name": "Qwen/Qwen2.5-Coder-0.5B",
56
- "embedding_dim": 896,
57
- "max_seq_len": 512,
58
- "use_lora": false,
59
- "pooling": "mean",
60
- "normalize": true
61
  }
 
5
  "attention_dropout": 0.0,
6
  "bos_token_id": 151643,
7
  "dtype": "bfloat16",
8
+ "embedding_dim": 896,
9
  "eos_token_id": 151643,
10
  "hidden_act": "silu",
11
  "hidden_size": 896,
 
38
  "full_attention"
39
  ],
40
  "max_position_embeddings": 32768,
41
+ "max_seq_len": 512,
42
  "max_window_layers": 24,
43
+ "model_name": "faisalmumtaz/codecompass-embed",
44
  "model_type": "qwen2",
45
+ "normalize": true,
46
  "num_attention_heads": 14,
47
  "num_hidden_layers": 24,
48
  "num_key_value_heads": 2,
49
+ "pooling": "mean",
50
  "rms_norm_eps": 1e-06,
51
  "rope_scaling": null,
52
  "rope_theta": 1000000.0,
 
54
  "tie_word_embeddings": true,
55
  "transformers_version": "4.40.0",
56
  "use_cache": false,
57
+ "use_lora": false,
58
  "use_sliding_window": false,
59
  "vocab_size": 151936,
60
+ "torch_dtype": "bfloat16"
 
 
 
 
 
 
61
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e0fe7769104ed825d2feba8e063b6f4ea499668985ed73a63c067c5ccc2cd1db
3
  size 988096088
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c4c4cae2b4ab31994a5aa68a011ac8e0f4125f54123d1b8674b721079e4dd2c1
3
  size 988096088
tokenizer_config.json CHANGED
@@ -199,9 +199,16 @@
199
  "eos_token": "<|endoftext|>",
200
  "errors": "replace",
201
  "extra_special_tokens": {},
 
202
  "model_max_length": 32768,
 
203
  "pad_token": "<|endoftext|>",
 
 
204
  "split_special_tokens": false,
 
205
  "tokenizer_class": "Qwen2Tokenizer",
 
 
206
  "unk_token": null
207
  }
 
199
  "eos_token": "<|endoftext|>",
200
  "errors": "replace",
201
  "extra_special_tokens": {},
202
+ "max_length": 1024,
203
  "model_max_length": 32768,
204
+ "pad_to_multiple_of": null,
205
  "pad_token": "<|endoftext|>",
206
+ "pad_token_type_id": 0,
207
+ "padding_side": "left",
208
  "split_special_tokens": false,
209
+ "stride": 0,
210
  "tokenizer_class": "Qwen2Tokenizer",
211
+ "truncation_side": "right",
212
+ "truncation_strategy": "longest_first",
213
  "unk_token": null
214
  }