Feature Extraction
sentence-transformers
Safetensors
modernbert
code-search
code-embedding
retrieval
dense
text-embeddings-inference
Shuu12121 commited on
Commit
11b8466
·
verified ·
1 Parent(s): 7cfae01

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +197 -76
README.md CHANGED
@@ -17,14 +17,39 @@ datasets:
17
  - Shuu12121/codeedit_hard_negative_datasets_kd
18
  ---
19
 
20
- # NightOwl CodeEmbedding
21
 
22
- `NightOwl-CodeEmbedding` is a 768-dimensional dense embedding model specialized
23
- for code retrieval, code-edit retrieval, and technical question answering. It
24
- is fine-tuned from [`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl)
25
- and uses CLS pooling with cosine similarity.
26
 
27
- The model does not require query or document prefixes.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ## Usage
30
 
@@ -39,92 +64,188 @@ documents = [
39
  "def average(values): return sum(values) / len(values)",
40
  ]
41
 
42
- query_embeddings = model.encode(queries, normalize_embeddings=True)
43
- document_embeddings = model.encode(documents, normalize_embeddings=True)
44
 
45
- scores = query_embeddings @ document_embeddings.T
 
46
  print(scores)
47
  ```
48
 
49
  ## Model Details
50
 
51
- | Property | Value |
52
- |---|---|
53
- | Base model | `Shuu12121/NightOwl` |
54
- | Architecture | ModernBERT |
55
- | Parameters | 150,779,136 |
56
- | Embedding dimension | 768 |
57
- | Pooling | CLS |
58
- | Maximum sequence length | 1,024 tokens |
59
- | Similarity | Cosine |
60
- | Query/document prefixes | None |
61
- | Weight dtype | FP32 |
62
- | Weight memory | 575 MiB |
63
- | License | Apache-2.0 |
64
 
65
  ## MTEB Results
66
 
67
- The model was evaluated using:
68
-
69
- - Model revision: `c7c8a57b9539297e192d5cf39b9aecf1fb376edd`
70
- - MTEB version: `2.15.1`
71
- - Metric: NDCG@10
72
- - Hardware: NVIDIA GeForce RTX 5090
73
- - Batch size: 64
74
-
75
- Multi-subset task scores are macro averages. `CodeEditSearchRetrieval` uses its
76
- official `train` evaluation split; the other tasks use `test`.
77
-
78
- | Task | Split | NDCG@10 |
79
- |---|---:|---:|
80
- | AppsRetrieval | test | 0.39177 |
81
- | COIRCodeSearchNetRetrieval | test | 0.84264 |
82
- | CodeEditSearchRetrieval | train | 0.74808 |
83
- | CodeFeedbackMT | test | 0.76690 |
84
- | CodeFeedbackST | test | 0.85207 |
85
- | CodeSearchNetCCRetrieval | test | 0.91805 |
86
- | CodeSearchNetRetrieval | test | 0.89239 |
87
- | CodeTransOceanContest | test | 0.75953 |
88
- | CodeTransOceanDL | test | 0.36057 |
89
- | CosQA | test | 0.42810 |
90
- | StackOverflowQA | test | 0.86608 |
91
- | SyntheticText2SQL | test | 0.68266 |
92
- | **Macro average (all 12 tasks)** | | **0.70907** |
93
- | **CoIR macro average (10 tasks)** | | **0.68684** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  ## Training
96
 
97
  The model was trained with `CachedMultipleNegativesRankingLoss` using
98
- bidirectional query-to-document and document-to-query objectives. The generated
99
- training metadata reports 2,534,400 training samples with one positive and
100
- fifteen negatives per anchor.
101
-
102
- The training data covers the following MTEB task families:
103
-
104
- - `AppsRetrieval`
105
- - `COIRCodeSearchNetRetrieval`
106
- - `CodeFeedbackMT`
107
- - `CodeFeedbackST`
108
- - `CodeSearchNetCCRetrieval`
109
- - `CodeSearchNetRetrieval`
110
- - `CodeTransOceanContest`
111
- - `CodeTransOceanDL`
112
- - `CosQA`
113
- - `StackOverflowQA`
114
- - `SyntheticText2SQL`
115
-
116
- `CodeEditSearchRetrieval` was evaluated separately and is not listed as a
117
- training dataset in the published model metadata.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
  ## Limitations
120
 
121
- - The model is specialized for code-related retrieval and may underperform
122
- general-purpose text embedding models on unrelated domains.
123
- - Inputs longer than 1,024 tokens are truncated.
124
- - Benchmark scores include in-domain tasks related to the training data and
125
- should not be interpreted as strictly zero-shot results.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
  ## Citation
128
 
129
- If you use this model, cite Sentence Transformers and the base model where
130
- appropriate.
 
 
 
 
 
 
 
 
 
 
 
17
  - Shuu12121/codeedit_hard_negative_datasets_kd
18
  ---
19
 
20
+ # NightOwl-CodeEmbedding 🦉
21
 
22
+ `NightOwl-CodeEmbedding` is a compact 768-dimensional dense embedding model
23
+ specialized for code retrieval, code-edit retrieval, and technical question
24
+ answering.
 
25
 
26
+ The model is fine-tuned from
27
+ [`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl), a
28
+ ModernBERT-based code model. It uses CLS pooling with cosine similarity and
29
+ does **not** require `query:` / `passage:` style prefixes.
30
+
31
+ ## Highlights
32
+
33
+ * Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
34
+ * Covers **eight programming languages**, including Rust and TypeScript in
35
+ addition to the six CodeSearchNet languages
36
+ * Handles a wide range of code retrieval scenarios: NL-to-code search,
37
+ code-to-code retrieval, **code-edit retrieval**, and technical QA
38
+ * Trained with hard negatives mined by `Qwen/Qwen3-Embedding-0.6B`
39
+ (15 hard negatives per anchor)
40
+ * Decontaminated against CodeSearchNet test splits and the
41
+ CodeEditSearchRetrieval benchmark (see [Data Decontamination](#data-decontamination))
42
+ * Drop-in compatible with `sentence-transformers`, Apache-2.0 license
43
+
44
+ ## Supported Languages
45
+
46
+ The training data covers the six CodeSearchNet languages plus two additional
47
+ languages:
48
+
49
+ * Go, Java, JavaScript, PHP, Python, Ruby (CodeSearchNet languages)
50
+ * **Rust, TypeScript** (additional)
51
+
52
+ Performance on languages outside this set is not guaranteed and may vary.
53
 
54
  ## Usage
55
 
 
64
  "def average(values): return sum(values) / len(values)",
65
  ]
66
 
67
+ query_embeddings = model.encode(queries)
68
+ document_embeddings = model.encode(documents)
69
 
70
+ # Cosine similarity (embeddings are normalized internally by similarity())
71
+ scores = model.similarity(query_embeddings, document_embeddings)
72
  print(scores)
73
  ```
74
 
75
  ## Model Details
76
 
77
+ | Property | Value |
78
+ | ----------------------- | -------------------- |
79
+ | Base model | `Shuu12121/NightOwl` |
80
+ | Architecture | ModernBERT |
81
+ | Parameters | 150,779,136 |
82
+ | Embedding dimension | 768 |
83
+ | Pooling | CLS pooling |
84
+ | Maximum sequence length | 1,024 tokens |
85
+ | Similarity | Cosine similarity |
86
+ | Query/document prefixes | Not required |
87
+ | Weight dtype | FP32 |
88
+ | Weight memory | 575 MiB |
89
+ | License | Apache-2.0 |
90
 
91
  ## MTEB Results
92
 
93
+ The model was evaluated with MTEB on code-related retrieval and technical QA
94
+ tasks.
95
+
96
+ Evaluation setup:
97
+
98
+ * Model revision: `c7c8a57b9539297e192d5cf39b9aecf1fb376edd`
99
+ * MTEB version: `2.15.1`
100
+ * Metric: `NDCG@10`
101
+ * Hardware: NVIDIA GeForce RTX 5090
102
+ * Batch size: 64
103
+
104
+ Multi-subset task scores are reported as macro averages.
105
+
106
+ | Task | Split | NDCG@10 |
107
+ | -------------------------------- | ------: | ----------: |
108
+ | AppsRetrieval | test | 0.39177 |
109
+ | COIRCodeSearchNetRetrieval | test | 0.84264 |
110
+ | CodeEditSearchRetrieval | train¹ | 0.74808 |
111
+ | CodeFeedbackMT | test | 0.76690 |
112
+ | CodeFeedbackST | test | 0.85207 |
113
+ | CodeSearchNetCCRetrieval | test | 0.91805 |
114
+ | CodeSearchNetRetrieval | test | 0.89239 |
115
+ | CodeTransOceanContest | test | 0.75953 |
116
+ | CodeTransOceanDL | test | 0.36057 |
117
+ | CosQA | test | 0.42810 |
118
+ | StackOverflowQA | test | 0.86608 |
119
+ | SyntheticText2SQL | test | 0.68266 |
120
+ | **Macro average, all 12 tasks** | | **0.70907** |
121
+ | **CoIR macro average, 10 tasks** | | **0.68684** |
122
+
123
+ ¹ `CodeEditSearchRetrieval` does not provide a standard `test` split in MTEB,
124
+ so the official `train` split is used for evaluation. These examples were
125
+ **not** used for fine-tuning. See
126
+ [Data Decontamination](#data-decontamination) for details.
127
+
128
+ <!-- TODO: Add a comparison table with the base NightOwl model and/or
129
+ similar-sized code embedding models (e.g. jina-embeddings-v2-base-code,
130
+ CodeXEmbed) to give readers a reference point. -->
131
+
132
+ Because the benchmark suite consists of in-domain code retrieval tasks related
133
+ to the model's training distribution, these results should not be interpreted
134
+ as strictly zero-shot performance.
135
 
136
  ## Training
137
 
138
  The model was trained with `CachedMultipleNegativesRankingLoss` using
139
+ bidirectional query-to-document and document-to-query objectives.
140
+
141
+ | Property | Value |
142
+ | -------------------------- | ----------------------------------------- |
143
+ | Training samples | 2,534,400 |
144
+ | Positives per anchor | 1 |
145
+ | Negatives per anchor | 15 |
146
+ | Loss | `CachedMultipleNegativesRankingLoss` |
147
+ | Objective | Bidirectional retrieval training |
148
+ | Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B` |
149
+ | Epochs | 1 |
150
+ | Learning rate | 6e-5 |
151
+ | Batch size | 1024 |
152
+
153
+ ### Training Data
154
+
155
+ The training data is a mixture of:
156
+
157
+ 1. **Public code-retrieval datasets** covering the following CoIR task
158
+ families: AppsRetrieval, COIRCodeSearchNetRetrieval, CodeFeedbackMT,
159
+ CodeFeedbackST, CodeSearchNetCCRetrieval, CodeSearchNetRetrieval,
160
+ CodeTransOceanContest, CodeTransOceanDL, CosQA, StackOverflowQA, and
161
+ SyntheticText2SQL.
162
+ 2. **Custom code-comment pair data** consisting of code snippets paired with
163
+ natural-language description comments across the eight supported languages
164
+ (the six CodeSearchNet languages plus Rust and TypeScript).
165
+ 3. **Code-edit data** derived from `commitpackft`, pairing edit intents with
166
+ code changes.
167
+
168
+ All datasets were constructed as hard-negative retrieval datasets: for each
169
+ anchor, one positive and fifteen hard negatives were used. Hard negatives were
170
+ mined with
171
+ [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B),
172
+ which retrieves semantically similar but non-matching candidates, producing
173
+ negatives that are more difficult than random negatives. The mining model is
174
+ used only during dataset construction and is not required at inference time.
175
+
176
+ This setup is intended to improve discrimination between code snippets,
177
+ programming questions, edit examples, and technically similar retrieval
178
+ candidates.
179
+
180
+ ### Data Decontamination
181
+
182
+ To reduce benchmark contamination, the following overlaps were removed from
183
+ the training data **before** training:
184
+
185
+ * Overlaps between the custom code-comment pair data and the
186
+ **CodeSearchNet test split**
187
+ * Overlaps between the `commitpackft`-derived code-edit data and the
188
+ **CodeEditSearchRetrieval** benchmark evaluation data
189
+
190
+ For `CodeEditSearchRetrieval`, note that MTEB labels the evaluation split
191
+ `train`. This refers only to the official split name available for the task;
192
+ the evaluated examples were not included in this model's fine-tuning data.
193
+ The reported score should therefore be interpreted as **in-domain
194
+ generalization on held-out benchmark examples**, not as training-set
195
+ performance — though, given the in-domain training distribution, also not as
196
+ strictly zero-shot performance.
197
+
198
+ ## Intended Use
199
+
200
+ This model is intended for code-related retrieval tasks such as:
201
+
202
+ * Natural language to code search
203
+ * Code-to-code retrieval and similar function search
204
+ * Code-edit retrieval (matching edit intents to code changes)
205
+ * Retrieval over programming Q&A and technical questions
206
+ * Local semantic code search systems
207
+ * RAG systems over codebases and developer documentation
208
+
209
+ Example use cases include indexing functions, snippets, programming solutions,
210
+ StackOverflow-style answers, code review examples, and edit-related code
211
+ examples.
212
 
213
  ## Limitations
214
 
215
+ * The model is specialized for code-related retrieval and may underperform
216
+ general-purpose text embedding models on unrelated natural language tasks.
217
+ * Inputs longer than 1,024 tokens are truncated.
218
+ * Performance may vary by programming language, query style, and the
219
+ granularity of indexed code chunks; languages outside the eight supported
220
+ languages are untested.
221
+ * The model uses dense single-vector embeddings. For very fine-grained
222
+ matching, rerankers or late-interaction models may provide better precision.
223
+
224
+ ## Recommended Indexing Settings
225
+
226
+ Encode both queries and documents with normalized embeddings:
227
+
228
+ ```python
229
+ embeddings = model.encode(texts, normalize_embeddings=True)
230
+ ```
231
+
232
+ With normalized embeddings, dot product is equivalent to cosine similarity.
233
+
234
+ For codebase search, indexing function-level or class-level chunks is usually
235
+ recommended. Very long files may exceed the 1,024-token context limit and
236
+ should be split into smaller semantic chunks.
237
 
238
  ## Citation
239
 
240
+ If you use this model, please cite it together with the base model and
241
+ Sentence Transformers.
242
+
243
+ ```bibtex
244
+ @misc{nightowl_codeembedding,
245
+ title = {NightOwl-CodeEmbedding},
246
+ author = {Shuu12121},
247
+ year = {2026},
248
+ publisher = {Hugging Face},
249
+ url = {https://huggingface.co/Shuu12121/NightOwl-CodeEmbedding}
250
+ }
251
+ ```