Feature Extraction
sentence-transformers
Safetensors
modernbert
code-search
code-embedding
retrieval
dense
text-embeddings-inference
Shuu12121 commited on
Commit
a0cbae3
Β·
verified Β·
1 Parent(s): 11b8466

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +179 -8
README.md CHANGED
@@ -31,6 +31,11 @@ does **not** require `query:` / `passage:` style prefixes.
31
  ## Highlights
32
 
33
  * Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
 
 
 
 
 
34
  * Covers **eight programming languages**, including Rust and TypeScript in
35
  addition to the six CodeSearchNet languages
36
  * Handles a wide range of code retrieval scenarios: NL-to-code search,
@@ -125,14 +130,171 @@ so the official `train` split is used for evaluation. These examples were
125
  **not** used for fine-tuning. See
126
  [Data Decontamination](#data-decontamination) for details.
127
 
128
- <!-- TODO: Add a comparison table with the base NightOwl model and/or
129
- similar-sized code embedding models (e.g. jina-embeddings-v2-base-code,
130
- CodeXEmbed) to give readers a reference point. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
 
132
  Because the benchmark suite consists of in-domain code retrieval tasks related
133
  to the model's training distribution, these results should not be interpreted
134
  as strictly zero-shot performance.
135
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  ## Training
137
 
138
  The model was trained with `CachedMultipleNegativesRankingLoss` using
@@ -146,9 +308,9 @@ bidirectional query-to-document and document-to-query objectives.
146
  | Loss | `CachedMultipleNegativesRankingLoss` |
147
  | Objective | Bidirectional retrieval training |
148
  | Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B` |
149
- | Epochs | 1 |
150
- | Learning rate | 6e-5 |
151
- | Batch size | 1024 |
152
 
153
  ### Training Data
154
 
@@ -214,12 +376,21 @@ examples.
214
 
215
  * The model is specialized for code-related retrieval and may underperform
216
  general-purpose text embedding models on unrelated natural language tasks.
217
- * Inputs longer than 1,024 tokens are truncated.
 
 
 
 
 
218
  * Performance may vary by programming language, query style, and the
219
  granularity of indexed code chunks; languages outside the eight supported
220
  languages are untested.
221
  * The model uses dense single-vector embeddings. For very fine-grained
222
- matching, rerankers or late-interaction models may provide better precision.
 
 
 
 
223
 
224
  ## Recommended Indexing Settings
225
 
 
31
  ## Highlights
32
 
33
  * Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
34
+ * On the [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard)
35
+ leaderboard it ranks **18th out of 241 models overall** and is the
36
+ **top-scoring single-vector model under 300M parameters** among scored entries
37
+ on the official board, ahead of many models an order of magnitude larger (see
38
+ [Leaderboard Standing](#leaderboard-standing))
39
  * Covers **eight programming languages**, including Rust and TypeScript in
40
  addition to the six CodeSearchNet languages
41
  * Handles a wide range of code retrieval scenarios: NL-to-code search,
 
130
  **not** used for fine-tuning. See
131
  [Data Decontamination](#data-decontamination) for details.
132
 
133
+ ### Leaderboard Standing
134
+
135
+ On the public [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard)
136
+ leaderboard, the model's task-mean score (70.91, i.e. the 12-task macro average
137
+ above Γ—100) places it as follows:
138
+
139
+ * **#18 of 241 models overall**, ahead of many models that are an order of
140
+ magnitude larger
141
+ * **#6 of 155 among sub-1B-parameter dense single-vector models** β€” and the
142
+ **smallest model in that top six**. The five models ranked above it are all
143
+ β‰ˆ0.33–0.6B parameters (`F2LLM-v2-0.6B/330M`, `pplx-embed-v1-0.6b`,
144
+ `C2LLM-0.5B`, `Qwen3-Embedding-0.6B`), i.e. 2–4Γ— larger.
145
+ * **#1 among ranked dense single-vector models under 300M parameters** (the
146
+ leaderboard's small-model view)
147
+ * **#2 once late-interaction / multi-vector models are included**, behind only
148
+ `lightonai/LateOn-Code` (a multi-vector late-interaction model β€” see the
149
+ [head-to-head below](#head-to-head-vs-lateon-code))
150
+
151
+ > **Reading the numbers fairly.** MTEB(Code, v1) reports a *zero-shot %* for
152
+ > each model β€” the fraction of leaderboard tasks the model was *not* trained on.
153
+ > `NightOwl-CodeEmbedding` is **8%** zero-shot: it was trained on most of these
154
+ > task families, so its score reflects strong **in-domain** retrieval rather
155
+ > than zero-shot transfer. Models marked **100%** (e.g. `embeddinggemma-300m`,
156
+ > the `granite-embedding` r2 family, `Qwen3-Embedding`) are evaluated fully
157
+ > out-of-domain, so a raw score comparison across rows with different
158
+ > zero-shot % is not apples-to-apples. The fairest direct comparisons are to
159
+ > other code-specialized models at similar zero-shot levels (e.g.
160
+ > `LateOn-Code` at 8%, the `F2LLM` / `C2LLM` families at 8–58%).
161
+
162
+ ### Comparison with similar-sized models
163
+
164
+ The table below compares `NightOwl-CodeEmbedding` with other compact code /
165
+ general embedding models on MTEB(Code, v1), with a size ladder of larger models
166
+ for reference. Score is the leaderboard task mean (higher is better); the
167
+ *Zero-shot* column is the share of tasks the model did not train on.
168
+
169
+ | Model | Params | Type | Emb. dim | Max tokens | Zero-shot | MTEB(Code, v1) ↑ |
170
+ | ---------------------------------------------------- | ------: | -------------- | -------------- | ---------: | --------: | ---------------: |
171
+ | **`NightOwl-CodeEmbedding`** (this model) | 150.8M | single-vector | 768 | 1,024 | 8% | **70.91** |
172
+ | `codefuse-ai/F2LLM-v2-160M` | 159M | single-vector | 640 | 40,960 | 58% | 70.38 |
173
+ | `google/embeddinggemma-300m` | 308M | single-vector | 768 | 2,048 | 100% | 68.76 |
174
+ | `codefuse-ai/F2LLM-v2-80M` | 80M | single-vector | 320 | 40,960 | 58% | 67.97 |
175
+ | `ibm-granite/granite-embedding-311m-multilingual-r2` | 312M | single-vector | 768 | 8,192 | 100% | 63.84 |
176
+ | _Late-interaction (multi-vector) reference_ | | | | | | |
177
+ | `lightonai/LateOn-Code` | 149M | multi-vector | 128 (per-tok) | 2,048 | 8% | 74.12 |
178
+ | _Larger single-vector reference (size ladder)_ | | | | | | |
179
+ | `codefuse-ai/F2LLM-v2-0.6B` (#1 sub-1B) | 596M | single-vector | 1,024 | 40,960 | 58% | 77.41 |
180
+ | `Qwen/Qwen3-Embedding-0.6B` | 596M | single-vector | 1,024 | 32,768 | 100% | 75.42 |
181
+ | `codefuse-ai/F2LLM-v2-14B` (#1 overall) | 13.99B | single-vector | 5,120 | 40,960 | 58% | 80.75 |
182
+
183
+ Takeaways:
184
+
185
+ * Among compact **single-vector dense** models, `NightOwl-CodeEmbedding` is the
186
+ strongest entry in the leaderboard's small-model view while also being one of
187
+ the smallest, edging out `F2LLM-v2-160M` and clearly ahead of
188
+ `embeddinggemma-300m`.
189
+ * The sub-1B leaders (`F2LLM-v2-0.6B`, `Qwen3-Embedding-0.6B`) score ~4–6.5
190
+ points higher but are ~4Γ— the parameter count and use larger embedding
191
+ dimensions, which directly increases index size and inference cost.
192
+ * The 14B model at the top of the overall board is ~10 points higher but ~93Γ—
193
+ larger, sitting in a different deployment cost regime entirely.
194
+
195
+ ### Head-to-head vs LateOn-Code
196
+
197
+ `lightonai/LateOn-Code` is the only sub-0.5B model that outranks
198
+ `NightOwl-CodeEmbedding` once multi-vector models are included, so it is worth a
199
+ closer look. It is a **ColBERT-style late-interaction** model (built with PyLate
200
+ on ModernBERT-base): it stores **one 128-dimensional vector per token** and
201
+ scores with the MaxSim operator, rather than a single 768-d vector per text.
202
+ That buys accuracy at the cost of a larger index and a different retrieval path
203
+ (PyLate + a PLAID index), whereas `NightOwl` is a drop-in single-vector
204
+ `sentence-transformers` model.
205
+
206
+ Per-task NDCG@10 (Γ—100) on MTEB(Code, v1); both models are code-specialized and
207
+ in-domain (8% zero-shot), so this is a like-for-like comparison. **Bold** marks
208
+ the higher score on each task.
209
+
210
+ | Task | NightOwl-CodeEmbedding | LateOn-Code (multi-vec) |
211
+ | -------------------------- | ---------------------: | ----------------------: |
212
+ | AppsRetrieval | 39.18 | **54.76** |
213
+ | COIRCodeSearchNetRetrieval | 84.26 | **86.57** |
214
+ | CodeEditSearchRetrieval | **74.81** | 64.99 |
215
+ | CodeFeedbackMT | 76.69 | **82.22** |
216
+ | CodeFeedbackST | 85.21 | **90.40** |
217
+ | CodeSearchNetCCRetrieval | **91.81** | 89.32 |
218
+ | CodeSearchNetRetrieval | 89.24 | **90.40** |
219
+ | CodeTransOceanContest | 75.95 | **87.44** |
220
+ | CodeTransOceanDL | 36.06 | **41.00** |
221
+ | CosQA | 42.81 | **45.23** |
222
+ | StackOverflowQA | 86.61 | **93.43** |
223
+ | SyntheticText2SQL | **68.27** | 63.67 |
224
+ | **Average** | 70.91 | **74.12** |
225
+
226
+ `LateOn-Code` wins on average, driven mostly by AppsRetrieval and the
227
+ feedback/translation/QA tasks. However, `NightOwl-CodeEmbedding` wins on three
228
+ tasks that map directly to its design focus:
229
+
230
+ * **CodeEditSearchRetrieval** (+9.8): matching edit intents to code changes β€”
231
+ `NightOwl`'s dedicated code-edit training shows here.
232
+ * **CodeSearchNetCCRetrieval** (+2.5): code-to-code / similar-function retrieval.
233
+ * **SyntheticText2SQL** (+4.6): NL-to-SQL retrieval.
234
+
235
+ So for single-vector code-edit and code-to-code retrieval specifically,
236
+ `NightOwl` is competitive with or ahead of a higher-average multi-vector model,
237
+ while keeping a standard dense-vector index. (LateOn-Code scores sourced from
238
+ the model's
239
+ [MTEB(Code, v1) table](https://huggingface.co/lightonai/LateOn-Code).)
240
 
241
  Because the benchmark suite consists of in-domain code retrieval tasks related
242
  to the model's training distribution, these results should not be interpreted
243
  as strictly zero-shot performance.
244
 
245
+ ## Base Model: the NightOwl Backbone
246
+
247
+ `NightOwl-CodeEmbedding` is fine-tuned from
248
+ [`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl), a
249
+ ModernBERT-style code encoder that was **pre-trained from scratch** β€” including
250
+ its own tokenizer β€” rather than adapted from a general-purpose checkpoint. The
251
+ whole stack, from tokenization to the pre-training objective, is controlled for
252
+ code.
253
+
254
+ **Code-aware tokenizer.** NightOwl uses a custom 50,368-token BPE tokenizer in
255
+ which whitespace is tokenized **independently** of adjacent words, so
256
+ indentation is represented by its own tokens instead of being merged into
257
+ "leading-whitespace + word" pieces. In code the same identifier recurs at many
258
+ indentation depths; folding whitespace into those pieces would spend large parts
259
+ of the vocabulary on near-duplicate "indent + token" variants. Keeping
260
+ whitespace separate avoids that waste and lets the fixed vocabulary budget cover
261
+ more genuinely distinct subwords, while still representing indentation faithfully
262
+ β€” which matters for whitespace-significant languages such as Python.
263
+
264
+ **Two-phase pre-training with line-level masking.** NightOwl is trained with
265
+ masked-language modeling (`mlm_probability = 0.3`) in two phases:
266
+
267
+ * *Phase 1 β€” mixed pre-training:* standard random-token MLM over code, natural
268
+ language, and technical documentation (produces `NightOwl-Pre`).
269
+ * *Phase 2 β€” code-only continuation:* **line-level MLM**, where entire
270
+ source-code lines are masked instead of random tokens. This aligns the
271
+ pre-training objective with code search and retrieval, where the unit of
272
+ meaning is closer to a line or statement than an isolated token. The
273
+ recommended `NightOwl` checkpoint is this Phase-2 result.
274
+
275
+ Backbone architecture (base):
276
+
277
+ | Property | Value |
278
+ | ------------------------------ | ----------------------------------------------------- |
279
+ | Architecture | ModernBERT (alternating local/global attention, RoPE) |
280
+ | Parameters | β‰ˆ150M |
281
+ | `hidden_size` / layers / heads | 768 / 19 / 12 |
282
+ | Vocabulary | 50,368 (custom code BPE) |
283
+ | Max sequence length | 1,024 (Phase 1) β†’ 2,048 (Phase 2) |
284
+
285
+ Pre-training data mixes `bigcode/starcoder2data-extras` (Kaggle notebooks,
286
+ StackOverflow threads, GitHub issues, technical documentation, …) with
287
+ whole-file source from `Shuu12121/github-file-programs-dataset` across the eight
288
+ supported languages (Python, JavaScript, TypeScript, Java, Go, Rust, Ruby, PHP).
289
+ Long examples are split into chunks so all tokens are used rather than truncated.
290
+
291
+ As a raw backbone β€” before any embedding fine-tuning β€” NightOwl reaches **0.8436
292
+ average MRR** on MTEB `CodeSearchNetRetrieval` under a fixed SentenceTransformer
293
+ fine-tuning protocol, ahead of CodeBERT-base (0.7944), GraphCodeBERT-base
294
+ (0.8325), UniXCoder-base (0.8003), and ModernBERT-base (0.8182) evaluated the
295
+ same way. `NightOwl-CodeEmbedding` builds the retrieval model described in this
296
+ card on top of that backbone.
297
+
298
  ## Training
299
 
300
  The model was trained with `CachedMultipleNegativesRankingLoss` using
 
308
  | Loss | `CachedMultipleNegativesRankingLoss` |
309
  | Objective | Bidirectional retrieval training |
310
  | Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B` |
311
+ | Epochs | 1 |
312
+ | Learning rate | 6e-5 |
313
+ | Batch size | 1024 |
314
 
315
  ### Training Data
316
 
 
376
 
377
  * The model is specialized for code-related retrieval and may underperform
378
  general-purpose text embedding models on unrelated natural language tasks.
379
+ * Inputs longer than 1,024 tokens are truncated. This is a shorter context
380
+ window than several models it competes with (e.g. the 8K+ token `F2LLM` and
381
+ `granite` models), so very long files must be chunked.
382
+ * MTEB(Code, v1) is largely in-domain for this model (8% zero-shot). On code
383
+ domains, query styles, or languages far from the training distribution,
384
+ expect lower performance than the leaderboard numbers suggest.
385
  * Performance may vary by programming language, query style, and the
386
  granularity of indexed code chunks; languages outside the eight supported
387
  languages are untested.
388
  * The model uses dense single-vector embeddings. For very fine-grained
389
+ matching, rerankers or late-interaction models (such as `LateOn-Code`) may
390
+ provide a higher average at the cost of a larger index and a non-standard
391
+ retrieval path β€” though, as the [head-to-head](#head-to-head-vs-lateon-code)
392
+ shows, single-vector `NightOwl` still leads on code-edit and code-to-code
393
+ retrieval.
394
 
395
  ## Recommended Indexing Settings
396