Feature Extraction
sentence-transformers
Safetensors
modernbert
code-search
code-embedding
retrieval
dense
text-embeddings-inference
Instructions to use Shuu12121/NightOwl-CodeEmbedding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Shuu12121/NightOwl-CodeEmbedding with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Shuu12121/NightOwl-CodeEmbedding") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -31,6 +31,11 @@ does **not** require `query:` / `passage:` style prefixes.
|
|
| 31 |
## Highlights
|
| 32 |
|
| 33 |
* Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
* Covers **eight programming languages**, including Rust and TypeScript in
|
| 35 |
addition to the six CodeSearchNet languages
|
| 36 |
* Handles a wide range of code retrieval scenarios: NL-to-code search,
|
|
@@ -125,14 +130,171 @@ so the official `train` split is used for evaluation. These examples were
|
|
| 125 |
**not** used for fine-tuning. See
|
| 126 |
[Data Decontamination](#data-decontamination) for details.
|
| 127 |
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
|
| 132 |
Because the benchmark suite consists of in-domain code retrieval tasks related
|
| 133 |
to the model's training distribution, these results should not be interpreted
|
| 134 |
as strictly zero-shot performance.
|
| 135 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
## Training
|
| 137 |
|
| 138 |
The model was trained with `CachedMultipleNegativesRankingLoss` using
|
|
@@ -146,9 +308,9 @@ bidirectional query-to-document and document-to-query objectives.
|
|
| 146 |
| Loss | `CachedMultipleNegativesRankingLoss` |
|
| 147 |
| Objective | Bidirectional retrieval training |
|
| 148 |
| Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B` |
|
| 149 |
-
| Epochs | 1
|
| 150 |
-
| Learning rate | 6e-5
|
| 151 |
-
| Batch size | 1024
|
| 152 |
|
| 153 |
### Training Data
|
| 154 |
|
|
@@ -214,12 +376,21 @@ examples.
|
|
| 214 |
|
| 215 |
* The model is specialized for code-related retrieval and may underperform
|
| 216 |
general-purpose text embedding models on unrelated natural language tasks.
|
| 217 |
-
* Inputs longer than 1,024 tokens are truncated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
* Performance may vary by programming language, query style, and the
|
| 219 |
granularity of indexed code chunks; languages outside the eight supported
|
| 220 |
languages are untested.
|
| 221 |
* The model uses dense single-vector embeddings. For very fine-grained
|
| 222 |
-
matching, rerankers or late-interaction models
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
|
| 224 |
## Recommended Indexing Settings
|
| 225 |
|
|
|
|
| 31 |
## Highlights
|
| 32 |
|
| 33 |
* Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
|
| 34 |
+
* On the [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard)
|
| 35 |
+
leaderboard it ranks **18th out of 241 models overall** and is the
|
| 36 |
+
**top-scoring single-vector model under 300M parameters** among scored entries
|
| 37 |
+
on the official board, ahead of many models an order of magnitude larger (see
|
| 38 |
+
[Leaderboard Standing](#leaderboard-standing))
|
| 39 |
* Covers **eight programming languages**, including Rust and TypeScript in
|
| 40 |
addition to the six CodeSearchNet languages
|
| 41 |
* Handles a wide range of code retrieval scenarios: NL-to-code search,
|
|
|
|
| 130 |
**not** used for fine-tuning. See
|
| 131 |
[Data Decontamination](#data-decontamination) for details.
|
| 132 |
|
| 133 |
+
### Leaderboard Standing
|
| 134 |
+
|
| 135 |
+
On the public [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard)
|
| 136 |
+
leaderboard, the model's task-mean score (70.91, i.e. the 12-task macro average
|
| 137 |
+
above Γ100) places it as follows:
|
| 138 |
+
|
| 139 |
+
* **#18 of 241 models overall**, ahead of many models that are an order of
|
| 140 |
+
magnitude larger
|
| 141 |
+
* **#6 of 155 among sub-1B-parameter dense single-vector models** β and the
|
| 142 |
+
**smallest model in that top six**. The five models ranked above it are all
|
| 143 |
+
β0.33β0.6B parameters (`F2LLM-v2-0.6B/330M`, `pplx-embed-v1-0.6b`,
|
| 144 |
+
`C2LLM-0.5B`, `Qwen3-Embedding-0.6B`), i.e. 2β4Γ larger.
|
| 145 |
+
* **#1 among ranked dense single-vector models under 300M parameters** (the
|
| 146 |
+
leaderboard's small-model view)
|
| 147 |
+
* **#2 once late-interaction / multi-vector models are included**, behind only
|
| 148 |
+
`lightonai/LateOn-Code` (a multi-vector late-interaction model β see the
|
| 149 |
+
[head-to-head below](#head-to-head-vs-lateon-code))
|
| 150 |
+
|
| 151 |
+
> **Reading the numbers fairly.** MTEB(Code, v1) reports a *zero-shot %* for
|
| 152 |
+
> each model β the fraction of leaderboard tasks the model was *not* trained on.
|
| 153 |
+
> `NightOwl-CodeEmbedding` is **8%** zero-shot: it was trained on most of these
|
| 154 |
+
> task families, so its score reflects strong **in-domain** retrieval rather
|
| 155 |
+
> than zero-shot transfer. Models marked **100%** (e.g. `embeddinggemma-300m`,
|
| 156 |
+
> the `granite-embedding` r2 family, `Qwen3-Embedding`) are evaluated fully
|
| 157 |
+
> out-of-domain, so a raw score comparison across rows with different
|
| 158 |
+
> zero-shot % is not apples-to-apples. The fairest direct comparisons are to
|
| 159 |
+
> other code-specialized models at similar zero-shot levels (e.g.
|
| 160 |
+
> `LateOn-Code` at 8%, the `F2LLM` / `C2LLM` families at 8β58%).
|
| 161 |
+
|
| 162 |
+
### Comparison with similar-sized models
|
| 163 |
+
|
| 164 |
+
The table below compares `NightOwl-CodeEmbedding` with other compact code /
|
| 165 |
+
general embedding models on MTEB(Code, v1), with a size ladder of larger models
|
| 166 |
+
for reference. Score is the leaderboard task mean (higher is better); the
|
| 167 |
+
*Zero-shot* column is the share of tasks the model did not train on.
|
| 168 |
+
|
| 169 |
+
| Model | Params | Type | Emb. dim | Max tokens | Zero-shot | MTEB(Code, v1) β |
|
| 170 |
+
| ---------------------------------------------------- | ------: | -------------- | -------------- | ---------: | --------: | ---------------: |
|
| 171 |
+
| **`NightOwl-CodeEmbedding`** (this model) | 150.8M | single-vector | 768 | 1,024 | 8% | **70.91** |
|
| 172 |
+
| `codefuse-ai/F2LLM-v2-160M` | 159M | single-vector | 640 | 40,960 | 58% | 70.38 |
|
| 173 |
+
| `google/embeddinggemma-300m` | 308M | single-vector | 768 | 2,048 | 100% | 68.76 |
|
| 174 |
+
| `codefuse-ai/F2LLM-v2-80M` | 80M | single-vector | 320 | 40,960 | 58% | 67.97 |
|
| 175 |
+
| `ibm-granite/granite-embedding-311m-multilingual-r2` | 312M | single-vector | 768 | 8,192 | 100% | 63.84 |
|
| 176 |
+
| _Late-interaction (multi-vector) reference_ | | | | | | |
|
| 177 |
+
| `lightonai/LateOn-Code` | 149M | multi-vector | 128 (per-tok) | 2,048 | 8% | 74.12 |
|
| 178 |
+
| _Larger single-vector reference (size ladder)_ | | | | | | |
|
| 179 |
+
| `codefuse-ai/F2LLM-v2-0.6B` (#1 sub-1B) | 596M | single-vector | 1,024 | 40,960 | 58% | 77.41 |
|
| 180 |
+
| `Qwen/Qwen3-Embedding-0.6B` | 596M | single-vector | 1,024 | 32,768 | 100% | 75.42 |
|
| 181 |
+
| `codefuse-ai/F2LLM-v2-14B` (#1 overall) | 13.99B | single-vector | 5,120 | 40,960 | 58% | 80.75 |
|
| 182 |
+
|
| 183 |
+
Takeaways:
|
| 184 |
+
|
| 185 |
+
* Among compact **single-vector dense** models, `NightOwl-CodeEmbedding` is the
|
| 186 |
+
strongest entry in the leaderboard's small-model view while also being one of
|
| 187 |
+
the smallest, edging out `F2LLM-v2-160M` and clearly ahead of
|
| 188 |
+
`embeddinggemma-300m`.
|
| 189 |
+
* The sub-1B leaders (`F2LLM-v2-0.6B`, `Qwen3-Embedding-0.6B`) score ~4β6.5
|
| 190 |
+
points higher but are ~4Γ the parameter count and use larger embedding
|
| 191 |
+
dimensions, which directly increases index size and inference cost.
|
| 192 |
+
* The 14B model at the top of the overall board is ~10 points higher but ~93Γ
|
| 193 |
+
larger, sitting in a different deployment cost regime entirely.
|
| 194 |
+
|
| 195 |
+
### Head-to-head vs LateOn-Code
|
| 196 |
+
|
| 197 |
+
`lightonai/LateOn-Code` is the only sub-0.5B model that outranks
|
| 198 |
+
`NightOwl-CodeEmbedding` once multi-vector models are included, so it is worth a
|
| 199 |
+
closer look. It is a **ColBERT-style late-interaction** model (built with PyLate
|
| 200 |
+
on ModernBERT-base): it stores **one 128-dimensional vector per token** and
|
| 201 |
+
scores with the MaxSim operator, rather than a single 768-d vector per text.
|
| 202 |
+
That buys accuracy at the cost of a larger index and a different retrieval path
|
| 203 |
+
(PyLate + a PLAID index), whereas `NightOwl` is a drop-in single-vector
|
| 204 |
+
`sentence-transformers` model.
|
| 205 |
+
|
| 206 |
+
Per-task NDCG@10 (Γ100) on MTEB(Code, v1); both models are code-specialized and
|
| 207 |
+
in-domain (8% zero-shot), so this is a like-for-like comparison. **Bold** marks
|
| 208 |
+
the higher score on each task.
|
| 209 |
+
|
| 210 |
+
| Task | NightOwl-CodeEmbedding | LateOn-Code (multi-vec) |
|
| 211 |
+
| -------------------------- | ---------------------: | ----------------------: |
|
| 212 |
+
| AppsRetrieval | 39.18 | **54.76** |
|
| 213 |
+
| COIRCodeSearchNetRetrieval | 84.26 | **86.57** |
|
| 214 |
+
| CodeEditSearchRetrieval | **74.81** | 64.99 |
|
| 215 |
+
| CodeFeedbackMT | 76.69 | **82.22** |
|
| 216 |
+
| CodeFeedbackST | 85.21 | **90.40** |
|
| 217 |
+
| CodeSearchNetCCRetrieval | **91.81** | 89.32 |
|
| 218 |
+
| CodeSearchNetRetrieval | 89.24 | **90.40** |
|
| 219 |
+
| CodeTransOceanContest | 75.95 | **87.44** |
|
| 220 |
+
| CodeTransOceanDL | 36.06 | **41.00** |
|
| 221 |
+
| CosQA | 42.81 | **45.23** |
|
| 222 |
+
| StackOverflowQA | 86.61 | **93.43** |
|
| 223 |
+
| SyntheticText2SQL | **68.27** | 63.67 |
|
| 224 |
+
| **Average** | 70.91 | **74.12** |
|
| 225 |
+
|
| 226 |
+
`LateOn-Code` wins on average, driven mostly by AppsRetrieval and the
|
| 227 |
+
feedback/translation/QA tasks. However, `NightOwl-CodeEmbedding` wins on three
|
| 228 |
+
tasks that map directly to its design focus:
|
| 229 |
+
|
| 230 |
+
* **CodeEditSearchRetrieval** (+9.8): matching edit intents to code changes β
|
| 231 |
+
`NightOwl`'s dedicated code-edit training shows here.
|
| 232 |
+
* **CodeSearchNetCCRetrieval** (+2.5): code-to-code / similar-function retrieval.
|
| 233 |
+
* **SyntheticText2SQL** (+4.6): NL-to-SQL retrieval.
|
| 234 |
+
|
| 235 |
+
So for single-vector code-edit and code-to-code retrieval specifically,
|
| 236 |
+
`NightOwl` is competitive with or ahead of a higher-average multi-vector model,
|
| 237 |
+
while keeping a standard dense-vector index. (LateOn-Code scores sourced from
|
| 238 |
+
the model's
|
| 239 |
+
[MTEB(Code, v1) table](https://huggingface.co/lightonai/LateOn-Code).)
|
| 240 |
|
| 241 |
Because the benchmark suite consists of in-domain code retrieval tasks related
|
| 242 |
to the model's training distribution, these results should not be interpreted
|
| 243 |
as strictly zero-shot performance.
|
| 244 |
|
| 245 |
+
## Base Model: the NightOwl Backbone
|
| 246 |
+
|
| 247 |
+
`NightOwl-CodeEmbedding` is fine-tuned from
|
| 248 |
+
[`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl), a
|
| 249 |
+
ModernBERT-style code encoder that was **pre-trained from scratch** β including
|
| 250 |
+
its own tokenizer β rather than adapted from a general-purpose checkpoint. The
|
| 251 |
+
whole stack, from tokenization to the pre-training objective, is controlled for
|
| 252 |
+
code.
|
| 253 |
+
|
| 254 |
+
**Code-aware tokenizer.** NightOwl uses a custom 50,368-token BPE tokenizer in
|
| 255 |
+
which whitespace is tokenized **independently** of adjacent words, so
|
| 256 |
+
indentation is represented by its own tokens instead of being merged into
|
| 257 |
+
"leading-whitespace + word" pieces. In code the same identifier recurs at many
|
| 258 |
+
indentation depths; folding whitespace into those pieces would spend large parts
|
| 259 |
+
of the vocabulary on near-duplicate "indent + token" variants. Keeping
|
| 260 |
+
whitespace separate avoids that waste and lets the fixed vocabulary budget cover
|
| 261 |
+
more genuinely distinct subwords, while still representing indentation faithfully
|
| 262 |
+
β which matters for whitespace-significant languages such as Python.
|
| 263 |
+
|
| 264 |
+
**Two-phase pre-training with line-level masking.** NightOwl is trained with
|
| 265 |
+
masked-language modeling (`mlm_probability = 0.3`) in two phases:
|
| 266 |
+
|
| 267 |
+
* *Phase 1 β mixed pre-training:* standard random-token MLM over code, natural
|
| 268 |
+
language, and technical documentation (produces `NightOwl-Pre`).
|
| 269 |
+
* *Phase 2 β code-only continuation:* **line-level MLM**, where entire
|
| 270 |
+
source-code lines are masked instead of random tokens. This aligns the
|
| 271 |
+
pre-training objective with code search and retrieval, where the unit of
|
| 272 |
+
meaning is closer to a line or statement than an isolated token. The
|
| 273 |
+
recommended `NightOwl` checkpoint is this Phase-2 result.
|
| 274 |
+
|
| 275 |
+
Backbone architecture (base):
|
| 276 |
+
|
| 277 |
+
| Property | Value |
|
| 278 |
+
| ------------------------------ | ----------------------------------------------------- |
|
| 279 |
+
| Architecture | ModernBERT (alternating local/global attention, RoPE) |
|
| 280 |
+
| Parameters | β150M |
|
| 281 |
+
| `hidden_size` / layers / heads | 768 / 19 / 12 |
|
| 282 |
+
| Vocabulary | 50,368 (custom code BPE) |
|
| 283 |
+
| Max sequence length | 1,024 (Phase 1) β 2,048 (Phase 2) |
|
| 284 |
+
|
| 285 |
+
Pre-training data mixes `bigcode/starcoder2data-extras` (Kaggle notebooks,
|
| 286 |
+
StackOverflow threads, GitHub issues, technical documentation, β¦) with
|
| 287 |
+
whole-file source from `Shuu12121/github-file-programs-dataset` across the eight
|
| 288 |
+
supported languages (Python, JavaScript, TypeScript, Java, Go, Rust, Ruby, PHP).
|
| 289 |
+
Long examples are split into chunks so all tokens are used rather than truncated.
|
| 290 |
+
|
| 291 |
+
As a raw backbone β before any embedding fine-tuning β NightOwl reaches **0.8436
|
| 292 |
+
average MRR** on MTEB `CodeSearchNetRetrieval` under a fixed SentenceTransformer
|
| 293 |
+
fine-tuning protocol, ahead of CodeBERT-base (0.7944), GraphCodeBERT-base
|
| 294 |
+
(0.8325), UniXCoder-base (0.8003), and ModernBERT-base (0.8182) evaluated the
|
| 295 |
+
same way. `NightOwl-CodeEmbedding` builds the retrieval model described in this
|
| 296 |
+
card on top of that backbone.
|
| 297 |
+
|
| 298 |
## Training
|
| 299 |
|
| 300 |
The model was trained with `CachedMultipleNegativesRankingLoss` using
|
|
|
|
| 308 |
| Loss | `CachedMultipleNegativesRankingLoss` |
|
| 309 |
| Objective | Bidirectional retrieval training |
|
| 310 |
| Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B` |
|
| 311 |
+
| Epochs | 1 |
|
| 312 |
+
| Learning rate | 6e-5 |
|
| 313 |
+
| Batch size | 1024 |
|
| 314 |
|
| 315 |
### Training Data
|
| 316 |
|
|
|
|
| 376 |
|
| 377 |
* The model is specialized for code-related retrieval and may underperform
|
| 378 |
general-purpose text embedding models on unrelated natural language tasks.
|
| 379 |
+
* Inputs longer than 1,024 tokens are truncated. This is a shorter context
|
| 380 |
+
window than several models it competes with (e.g. the 8K+ token `F2LLM` and
|
| 381 |
+
`granite` models), so very long files must be chunked.
|
| 382 |
+
* MTEB(Code, v1) is largely in-domain for this model (8% zero-shot). On code
|
| 383 |
+
domains, query styles, or languages far from the training distribution,
|
| 384 |
+
expect lower performance than the leaderboard numbers suggest.
|
| 385 |
* Performance may vary by programming language, query style, and the
|
| 386 |
granularity of indexed code chunks; languages outside the eight supported
|
| 387 |
languages are untested.
|
| 388 |
* The model uses dense single-vector embeddings. For very fine-grained
|
| 389 |
+
matching, rerankers or late-interaction models (such as `LateOn-Code`) may
|
| 390 |
+
provide a higher average at the cost of a larger index and a non-standard
|
| 391 |
+
retrieval path β though, as the [head-to-head](#head-to-head-vs-lateon-code)
|
| 392 |
+
shows, single-vector `NightOwl` still leads on code-edit and code-to-code
|
| 393 |
+
retrieval.
|
| 394 |
|
| 395 |
## Recommended Indexing Settings
|
| 396 |
|