Feature Extraction
sentence-transformers
Safetensors
modernbert
code-search
code-embedding
retrieval
dense
text-embeddings-inference
File size: 21,367 Bytes
437f5d1
 
 
 
c30b665
 
 
 
437f5d1
 
c30b665
437f5d1
c6d9d28
7cfae01
 
 
 
437f5d1
 
11b8466
437f5d1
11b8466
 
 
437f5d1
11b8466
 
 
 
 
 
 
 
a0cbae3
 
 
 
 
11b8466
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
437f5d1
 
 
 
 
 
c30b665
437f5d1
c30b665
 
 
 
 
437f5d1
11b8466
 
437f5d1
11b8466
 
c30b665
437f5d1
 
c30b665
437f5d1
11b8466
 
 
 
 
 
 
 
 
 
 
 
 
c30b665
 
 
11b8466
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0cbae3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11b8466
 
 
 
c30b665
a0cbae3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c30b665
 
 
11b8466
 
 
 
 
 
 
 
 
 
a0cbae3
 
 
11b8466
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42e608e
c30b665
 
11b8466
 
a0cbae3
 
 
 
 
 
11b8466
 
 
 
a0cbae3
 
 
 
 
11b8466
 
 
 
 
 
 
 
 
 
 
 
 
 
437f5d1
c30b665
437f5d1
11b8466
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
---
tags:
- sentence-transformers
- feature-extraction
- code-search
- code-embedding
- retrieval
- modernbert
- dense
base_model: Shuu12121/NightOwl
pipeline_tag: feature-extraction
library_name: sentence-transformers
license: apache-2.0
datasets:
- Shuu12121/coir_hard_negative_datasets_v3_kd
- Shuu12121/owl_code_search_hard_negative_datasets_V2_kd
- Shuu12121/codeedit_hard_negative_datasets_kd
---

# NightOwl-CodeEmbedding πŸ¦‰

`NightOwl-CodeEmbedding` is a compact 768-dimensional dense embedding model
specialized for code retrieval, code-edit retrieval, and technical question
answering.

The model is fine-tuned from
[`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl), a
ModernBERT-based code model. It uses CLS pooling with cosine similarity and
does **not** require `query:` / `passage:` style prefixes.

## Highlights

* Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
* On the [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard)
  leaderboard it ranks **18th out of 241 models overall** and is the
  **top-scoring single-vector model under 300M parameters** among scored entries
  on the official board, ahead of many models an order of magnitude larger (see
  [Leaderboard Standing](#leaderboard-standing))
* Covers **eight programming languages**, including Rust and TypeScript in
  addition to the six CodeSearchNet languages
* Handles a wide range of code retrieval scenarios: NL-to-code search,
  code-to-code retrieval, **code-edit retrieval**, and technical QA
* Trained with hard negatives mined by `Qwen/Qwen3-Embedding-0.6B`
  (15 hard negatives per anchor)
* Decontaminated against CodeSearchNet test splits and the
  CodeEditSearchRetrieval benchmark (see [Data Decontamination](#data-decontamination))
* Drop-in compatible with `sentence-transformers`, Apache-2.0 license

## Supported Languages

The training data covers the six CodeSearchNet languages plus two additional
languages:

* Go, Java, JavaScript, PHP, Python, Ruby (CodeSearchNet languages)
* **Rust, TypeScript** (additional)

Performance on languages outside this set is not guaranteed and may vary.

## Usage

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Shuu12121/NightOwl-CodeEmbedding")

queries = ["Python function that sorts a list in descending order"]
documents = [
    "def sort_desc(values): return sorted(values, reverse=True)",
    "def average(values): return sum(values) / len(values)",
]

query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

# Cosine similarity (embeddings are normalized internally by similarity())
scores = model.similarity(query_embeddings, document_embeddings)
print(scores)
```

## Model Details

| Property                | Value                |
| ----------------------- | -------------------- |
| Base model              | `Shuu12121/NightOwl` |
| Architecture            | ModernBERT           |
| Parameters              | 150,779,136          |
| Embedding dimension     | 768                  |
| Pooling                 | CLS pooling          |
| Maximum sequence length | 1,024 tokens         |
| Similarity              | Cosine similarity    |
| Query/document prefixes | Not required         |
| Weight dtype            | FP32                 |
| Weight memory           | 575 MiB              |
| License                 | Apache-2.0           |

## MTEB Results

The model was evaluated with MTEB on code-related retrieval and technical QA
tasks.

Evaluation setup:

* Model revision: `c7c8a57b9539297e192d5cf39b9aecf1fb376edd`
* MTEB version: `2.15.1`
* Metric: `NDCG@10`
* Hardware: NVIDIA GeForce RTX 5090
* Batch size: 64

Multi-subset task scores are reported as macro averages.

| Task                             |   Split |     NDCG@10 |
| -------------------------------- | ------: | ----------: |
| AppsRetrieval                    |    test |     0.39177 |
| COIRCodeSearchNetRetrieval       |    test |     0.84264 |
| CodeEditSearchRetrieval          | trainΒΉ |     0.74808 |
| CodeFeedbackMT                   |    test |     0.76690 |
| CodeFeedbackST                   |    test |     0.85207 |
| CodeSearchNetCCRetrieval         |    test |     0.91805 |
| CodeSearchNetRetrieval           |    test |     0.89239 |
| CodeTransOceanContest            |    test |     0.75953 |
| CodeTransOceanDL                 |    test |     0.36057 |
| CosQA                            |    test |     0.42810 |
| StackOverflowQA                  |    test |     0.86608 |
| SyntheticText2SQL                |    test |     0.68266 |
| **Macro average, all 12 tasks**  |         | **0.70907** |
| **CoIR macro average, 10 tasks** |         | **0.68684** |

ΒΉ `CodeEditSearchRetrieval` does not provide a standard `test` split in MTEB,
so the official `train` split is used for evaluation. These examples were
**not** used for fine-tuning. See
[Data Decontamination](#data-decontamination) for details.

### Leaderboard Standing

On the public [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard)
leaderboard, the model's task-mean score (70.91, i.e. the 12-task macro average
above Γ—100) places it as follows:

* **#18 of 241 models overall**, ahead of many models that are an order of
  magnitude larger
* **#6 of 155 among sub-1B-parameter dense single-vector models** β€” and the
  **smallest model in that top six**. The five models ranked above it are all
  β‰ˆ0.33–0.6B parameters (`F2LLM-v2-0.6B/330M`, `pplx-embed-v1-0.6b`,
  `C2LLM-0.5B`, `Qwen3-Embedding-0.6B`), i.e. 2–4Γ— larger.
* **#1 among ranked dense single-vector models under 300M parameters** (the
  leaderboard's small-model view)
* **#2 once late-interaction / multi-vector models are included**, behind only
  `lightonai/LateOn-Code` (a multi-vector late-interaction model β€” see the
  [head-to-head below](#head-to-head-vs-lateon-code))

> **Reading the numbers fairly.** MTEB(Code, v1) reports a *zero-shot %* for
> each model β€” the fraction of leaderboard tasks the model was *not* trained on.
> `NightOwl-CodeEmbedding` is **8%** zero-shot: it was trained on most of these
> task families, so its score reflects strong **in-domain** retrieval rather
> than zero-shot transfer. Models marked **100%** (e.g. `embeddinggemma-300m`,
> the `granite-embedding` r2 family, `Qwen3-Embedding`) are evaluated fully
> out-of-domain, so a raw score comparison across rows with different
> zero-shot % is not apples-to-apples. The fairest direct comparisons are to
> other code-specialized models at similar zero-shot levels (e.g.
> `LateOn-Code` at 8%, the `F2LLM` / `C2LLM` families at 8–58%).

### Comparison with similar-sized models

The table below compares `NightOwl-CodeEmbedding` with other compact code /
general embedding models on MTEB(Code, v1), with a size ladder of larger models
for reference. Score is the leaderboard task mean (higher is better); the
*Zero-shot* column is the share of tasks the model did not train on.

| Model                                                | Params  | Type           | Emb. dim       | Max tokens | Zero-shot | MTEB(Code, v1) ↑ |
| ---------------------------------------------------- | ------: | -------------- | -------------- | ---------: | --------: | ---------------: |
| **`NightOwl-CodeEmbedding`** (this model)            |  150.8M | single-vector  | 768            |      1,024 |        8% |        **70.91** |
| `codefuse-ai/F2LLM-v2-160M`                          |    159M | single-vector  | 640            |     40,960 |       58% |            70.38 |
| `google/embeddinggemma-300m`                         |    308M | single-vector  | 768            |      2,048 |      100% |            68.76 |
| `codefuse-ai/F2LLM-v2-80M`                           |     80M | single-vector  | 320            |     40,960 |       58% |            67.97 |
| `ibm-granite/granite-embedding-311m-multilingual-r2` |    312M | single-vector  | 768            |      8,192 |      100% |            63.84 |
| _Late-interaction (multi-vector) reference_          |         |                |                |            |           |                  |
| `lightonai/LateOn-Code`                              |    149M | multi-vector   | 128 (per-tok)  |      2,048 |        8% |            74.12 |
| _Larger single-vector reference (size ladder)_       |         |                |                |            |           |                  |
| `codefuse-ai/F2LLM-v2-0.6B` (#1 sub-1B)              |    596M | single-vector  | 1,024          |     40,960 |       58% |            77.41 |
| `Qwen/Qwen3-Embedding-0.6B`                          |    596M | single-vector  | 1,024          |     32,768 |      100% |            75.42 |
| `codefuse-ai/F2LLM-v2-14B` (#1 overall)              |  13.99B | single-vector  | 5,120          |     40,960 |       58% |            80.75 |

Takeaways:

* Among compact **single-vector dense** models, `NightOwl-CodeEmbedding` is the
  strongest entry in the leaderboard's small-model view while also being one of
  the smallest, edging out `F2LLM-v2-160M` and clearly ahead of
  `embeddinggemma-300m`.
* The sub-1B leaders (`F2LLM-v2-0.6B`, `Qwen3-Embedding-0.6B`) score ~4–6.5
  points higher but are ~4Γ— the parameter count and use larger embedding
  dimensions, which directly increases index size and inference cost.
* The 14B model at the top of the overall board is ~10 points higher but ~93Γ—
  larger, sitting in a different deployment cost regime entirely.

### Head-to-head vs LateOn-Code

`lightonai/LateOn-Code` is the only sub-0.5B model that outranks
`NightOwl-CodeEmbedding` once multi-vector models are included, so it is worth a
closer look. It is a **ColBERT-style late-interaction** model (built with PyLate
on ModernBERT-base): it stores **one 128-dimensional vector per token** and
scores with the MaxSim operator, rather than a single 768-d vector per text.
That buys accuracy at the cost of a larger index and a different retrieval path
(PyLate + a PLAID index), whereas `NightOwl` is a drop-in single-vector
`sentence-transformers` model.

Per-task NDCG@10 (Γ—100) on MTEB(Code, v1); both models are code-specialized and
in-domain (8% zero-shot), so this is a like-for-like comparison. **Bold** marks
the higher score on each task.

| Task                       | NightOwl-CodeEmbedding | LateOn-Code (multi-vec) |
| -------------------------- | ---------------------: | ----------------------: |
| AppsRetrieval              |                  39.18 |               **54.76** |
| COIRCodeSearchNetRetrieval |                  84.26 |               **86.57** |
| CodeEditSearchRetrieval    |              **74.81** |                   64.99 |
| CodeFeedbackMT             |                  76.69 |               **82.22** |
| CodeFeedbackST             |                  85.21 |               **90.40** |
| CodeSearchNetCCRetrieval   |              **91.81** |                   89.32 |
| CodeSearchNetRetrieval     |                  89.24 |               **90.40** |
| CodeTransOceanContest      |                  75.95 |               **87.44** |
| CodeTransOceanDL           |                  36.06 |               **41.00** |
| CosQA                      |                  42.81 |               **45.23** |
| StackOverflowQA            |                  86.61 |               **93.43** |
| SyntheticText2SQL          |              **68.27** |                   63.67 |
| **Average**                |                  70.91 |               **74.12** |

`LateOn-Code` wins on average, driven mostly by AppsRetrieval and the
feedback/translation/QA tasks. However, `NightOwl-CodeEmbedding` wins on three
tasks that map directly to its design focus:

* **CodeEditSearchRetrieval** (+9.8): matching edit intents to code changes β€”
  `NightOwl`'s dedicated code-edit training shows here.
* **CodeSearchNetCCRetrieval** (+2.5): code-to-code / similar-function retrieval.
* **SyntheticText2SQL** (+4.6): NL-to-SQL retrieval.

So for single-vector code-edit and code-to-code retrieval specifically,
`NightOwl` is competitive with or ahead of a higher-average multi-vector model,
while keeping a standard dense-vector index. (LateOn-Code scores sourced from
the model's
[MTEB(Code, v1) table](https://huggingface.co/lightonai/LateOn-Code).)

Because the benchmark suite consists of in-domain code retrieval tasks related
to the model's training distribution, these results should not be interpreted
as strictly zero-shot performance.

## Base Model: the NightOwl Backbone

`NightOwl-CodeEmbedding` is fine-tuned from
[`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl), a
ModernBERT-style code encoder that was **pre-trained from scratch** β€” including
its own tokenizer β€” rather than adapted from a general-purpose checkpoint. The
whole stack, from tokenization to the pre-training objective, is controlled for
code.

**Code-aware tokenizer.** NightOwl uses a custom 50,368-token BPE tokenizer in
which whitespace is tokenized **independently** of adjacent words, so
indentation is represented by its own tokens instead of being merged into
"leading-whitespace + word" pieces. In code the same identifier recurs at many
indentation depths; folding whitespace into those pieces would spend large parts
of the vocabulary on near-duplicate "indent + token" variants. Keeping
whitespace separate avoids that waste and lets the fixed vocabulary budget cover
more genuinely distinct subwords, while still representing indentation faithfully
β€” which matters for whitespace-significant languages such as Python.

**Two-phase pre-training with line-level masking.** NightOwl is trained with
masked-language modeling (`mlm_probability = 0.3`) in two phases:

* *Phase 1 β€” mixed pre-training:* standard random-token MLM over code, natural
  language, and technical documentation (produces `NightOwl-Pre`).
* *Phase 2 β€” code-only continuation:* **line-level MLM**, where entire
  source-code lines are masked instead of random tokens. This aligns the
  pre-training objective with code search and retrieval, where the unit of
  meaning is closer to a line or statement than an isolated token. The
  recommended `NightOwl` checkpoint is this Phase-2 result.

Backbone architecture (base):

| Property                       | Value                                                 |
| ------------------------------ | ----------------------------------------------------- |
| Architecture                   | ModernBERT (alternating local/global attention, RoPE) |
| Parameters                     | β‰ˆ150M                                                 |
| `hidden_size` / layers / heads | 768 / 19 / 12                                          |
| Vocabulary                     | 50,368 (custom code BPE)                               |
| Max sequence length            | 1,024 (Phase 1) β†’ 2,048 (Phase 2)                     |

Pre-training data mixes `bigcode/starcoder2data-extras` (Kaggle notebooks,
StackOverflow threads, GitHub issues, technical documentation, …) with
whole-file source from `Shuu12121/github-file-programs-dataset` across the eight
supported languages (Python, JavaScript, TypeScript, Java, Go, Rust, Ruby, PHP).
Long examples are split into chunks so all tokens are used rather than truncated.

As a raw backbone β€” before any embedding fine-tuning β€” NightOwl reaches **0.8436
average MRR** on MTEB `CodeSearchNetRetrieval` under a fixed SentenceTransformer
fine-tuning protocol, ahead of CodeBERT-base (0.7944), GraphCodeBERT-base
(0.8325), UniXCoder-base (0.8003), and ModernBERT-base (0.8182) evaluated the
same way. `NightOwl-CodeEmbedding` builds the retrieval model described in this
card on top of that backbone.

## Training

The model was trained with `CachedMultipleNegativesRankingLoss` using
bidirectional query-to-document and document-to-query objectives.

| Property                   | Value                                     |
| -------------------------- | ----------------------------------------- |
| Training samples           | 2,534,400                                 |
| Positives per anchor       | 1                                         |
| Negatives per anchor       | 15                                        |
| Loss                       | `CachedMultipleNegativesRankingLoss`      |
| Objective                  | Bidirectional retrieval training          |
| Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B`               |
| Epochs                     | 1                                         |
| Learning rate              | 6e-5                                      |
| Batch size                 | 1024                                      |

### Training Data

The training data is a mixture of:

1. **Public code-retrieval datasets** covering the following CoIR task
   families: AppsRetrieval, COIRCodeSearchNetRetrieval, CodeFeedbackMT,
   CodeFeedbackST, CodeSearchNetCCRetrieval, CodeSearchNetRetrieval,
   CodeTransOceanContest, CodeTransOceanDL, CosQA, StackOverflowQA, and
   SyntheticText2SQL.
2. **Custom code-comment pair data** consisting of code snippets paired with
   natural-language description comments across the eight supported languages
   (the six CodeSearchNet languages plus Rust and TypeScript).
3. **Code-edit data** derived from `commitpackft`, pairing edit intents with
   code changes.

All datasets were constructed as hard-negative retrieval datasets: for each
anchor, one positive and fifteen hard negatives were used. Hard negatives were
mined with
[`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B),
which retrieves semantically similar but non-matching candidates, producing
negatives that are more difficult than random negatives. The mining model is
used only during dataset construction and is not required at inference time.

This setup is intended to improve discrimination between code snippets,
programming questions, edit examples, and technically similar retrieval
candidates.

### Data Decontamination

To reduce benchmark contamination, the following overlaps were removed from
the training data **before** training:

* Overlaps between the custom code-comment pair data and the
  **CodeSearchNet test split**
* Overlaps between the `commitpackft`-derived code-edit data and the
  **CodeEditSearchRetrieval** benchmark evaluation data

For `CodeEditSearchRetrieval`, note that MTEB labels the evaluation split
`train`. This refers only to the official split name available for the task;
the evaluated examples were not included in this model's fine-tuning data.
The reported score should therefore be interpreted as **in-domain
generalization on held-out benchmark examples**, not as training-set
performance β€” though, given the in-domain training distribution, also not as
strictly zero-shot performance.

## Intended Use

This model is intended for code-related retrieval tasks such as:

* Natural language to code search
* Code-to-code retrieval and similar function search
* Code-edit retrieval (matching edit intents to code changes)
* Retrieval over programming Q&A and technical questions
* Local semantic code search systems
* RAG systems over codebases and developer documentation

Example use cases include indexing functions, snippets, programming solutions,
StackOverflow-style answers, code review examples, and edit-related code
examples.

## Limitations

* The model is specialized for code-related retrieval and may underperform
  general-purpose text embedding models on unrelated natural language tasks.
* Inputs longer than 1,024 tokens are truncated. This is a shorter context
  window than several models it competes with (e.g. the 8K+ token `F2LLM` and
  `granite` models), so very long files must be chunked.
* MTEB(Code, v1) is largely in-domain for this model (8% zero-shot). On code
  domains, query styles, or languages far from the training distribution,
  expect lower performance than the leaderboard numbers suggest.
* Performance may vary by programming language, query style, and the
  granularity of indexed code chunks; languages outside the eight supported
  languages are untested.
* The model uses dense single-vector embeddings. For very fine-grained
  matching, rerankers or late-interaction models (such as `LateOn-Code`) may
  provide a higher average at the cost of a larger index and a non-standard
  retrieval path β€” though, as the [head-to-head](#head-to-head-vs-lateon-code)
  shows, single-vector `NightOwl` still leads on code-edit and code-to-code
  retrieval.

## Recommended Indexing Settings

Encode both queries and documents with normalized embeddings:

```python
embeddings = model.encode(texts, normalize_embeddings=True)
```

With normalized embeddings, dot product is equivalent to cosine similarity.

For codebase search, indexing function-level or class-level chunks is usually
recommended. Very long files may exceed the 1,024-token context limit and
should be split into smaller semantic chunks.

## Citation

If you use this model, please cite it together with the base model and
Sentence Transformers.

```bibtex
@misc{nightowl_codeembedding,
  title = {NightOwl-CodeEmbedding},
  author = {Shuu12121},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Shuu12121/NightOwl-CodeEmbedding}
}
```