Shuu12121
/

NightOwl-CodeEmbedding

@@ -31,6 +31,11 @@ does **not** require `query:` / `passage:` style prefixes.
 ## Highlights
 * Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
 * Covers **eight programming languages**, including Rust and TypeScript in
   addition to the six CodeSearchNet languages
 * Handles a wide range of code retrieval scenarios: NL-to-code search,
@@ -125,14 +130,171 @@ so the official `train` split is used for evaluation. These examples were
 **not** used for fine-tuning. See
 [Data Decontamination](#data-decontamination) for details.
-<!-- TODO: Add a comparison table with the base NightOwl model and/or
-     similar-sized code embedding models (e.g. jina-embeddings-v2-base-code,
-     CodeXEmbed) to give readers a reference point. -->
 Because the benchmark suite consists of in-domain code retrieval tasks related
 to the model's training distribution, these results should not be interpreted
 as strictly zero-shot performance.
 ## Training
 The model was trained with `CachedMultipleNegativesRankingLoss` using
@@ -146,9 +308,9 @@ bidirectional query-to-document and document-to-query objectives.
 | Loss                       | `CachedMultipleNegativesRankingLoss`      |
 | Objective                  | Bidirectional retrieval training          |
 | Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B`               |
-| Epochs                     | 1                                    |
-| Learning rate              | 6e-5                                    |
-| Batch size                 | 1024                                    |
 ### Training Data
@@ -214,12 +376,21 @@ examples.
 * The model is specialized for code-related retrieval and may underperform
   general-purpose text embedding models on unrelated natural language tasks.
-* Inputs longer than 1,024 tokens are truncated.
 * Performance may vary by programming language, query style, and the
   granularity of indexed code chunks; languages outside the eight supported
   languages are untested.
 * The model uses dense single-vector embeddings. For very fine-grained
-  matching, rerankers or late-interaction models may provide better precision.
 ## Recommended Indexing Settings

 ## Highlights
 * Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
+* On the [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard)
+  leaderboard it ranks **18th out of 241 models overall** and is the
+  **top-scoring single-vector model under 300M parameters** among scored entries
+  on the official board, ahead of many models an order of magnitude larger (see
+  [Leaderboard Standing](#leaderboard-standing))
 * Covers **eight programming languages**, including Rust and TypeScript in
   addition to the six CodeSearchNet languages
 * Handles a wide range of code retrieval scenarios: NL-to-code search,
 **not** used for fine-tuning. See
 [Data Decontamination](#data-decontamination) for details.
+### Leaderboard Standing
+On the public [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard)
+leaderboard, the model's task-mean score (70.91, i.e. the 12-task macro average
+above ×100) places it as follows:
+* **#18 of 241 models overall**, ahead of many models that are an order of
+  magnitude larger
+* **#6 of 155 among sub-1B-parameter dense single-vector models** — and the
+  **smallest model in that top six**. The five models ranked above it are all
+  ≈0.33–0.6B parameters (`F2LLM-v2-0.6B/330M`, `pplx-embed-v1-0.6b`,
+  `C2LLM-0.5B`, `Qwen3-Embedding-0.6B`), i.e. 2–4× larger.
+* **#1 among ranked dense single-vector models under 300M parameters** (the
+  leaderboard's small-model view)
+* **#2 once late-interaction / multi-vector models are included**, behind only
+  `lightonai/LateOn-Code` (a multi-vector late-interaction model — see the
+  [head-to-head below](#head-to-head-vs-lateon-code))
+> **Reading the numbers fairly.** MTEB(Code, v1) reports a *zero-shot %* for
+> each model — the fraction of leaderboard tasks the model was *not* trained on.
+> `NightOwl-CodeEmbedding` is **8%** zero-shot: it was trained on most of these
+> task families, so its score reflects strong **in-domain** retrieval rather
+> than zero-shot transfer. Models marked **100%** (e.g. `embeddinggemma-300m`,
+> the `granite-embedding` r2 family, `Qwen3-Embedding`) are evaluated fully
+> out-of-domain, so a raw score comparison across rows with different
+> zero-shot % is not apples-to-apples. The fairest direct comparisons are to
+> other code-specialized models at similar zero-shot levels (e.g.
+> `LateOn-Code` at 8%, the `F2LLM` / `C2LLM` families at 8–58%).
+### Comparison with similar-sized models
+The table below compares `NightOwl-CodeEmbedding` with other compact code /
+general embedding models on MTEB(Code, v1), with a size ladder of larger models
+for reference. Score is the leaderboard task mean (higher is better); the
+*Zero-shot* column is the share of tasks the model did not train on.
+| Model                                                | Params  | Type           | Emb. dim       | Max tokens | Zero-shot | MTEB(Code, v1) ↑ |
+| ---------------------------------------------------- | ------: | -------------- | -------------- | ---------: | --------: | ---------------: |
+| **`NightOwl-CodeEmbedding`** (this model)            |  150.8M | single-vector  | 768            |      1,024 |        8% |        **70.91** |
+| `codefuse-ai/F2LLM-v2-160M`                          |    159M | single-vector  | 640            |     40,960 |       58% |            70.38 |
+| `google/embeddinggemma-300m`                         |    308M | single-vector  | 768            |      2,048 |      100% |            68.76 |
+| `codefuse-ai/F2LLM-v2-80M`                           |     80M | single-vector  | 320            |     40,960 |       58% |            67.97 |
+| `ibm-granite/granite-embedding-311m-multilingual-r2` |    312M | single-vector  | 768            |      8,192 |      100% |            63.84 |
+| _Late-interaction (multi-vector) reference_          |         |                |                |            |           |                  |
+| `lightonai/LateOn-Code`                              |    149M | multi-vector   | 128 (per-tok)  |      2,048 |        8% |            74.12 |
+| _Larger single-vector reference (size ladder)_       |         |                |                |            |           |                  |
+| `codefuse-ai/F2LLM-v2-0.6B` (#1 sub-1B)              |    596M | single-vector  | 1,024          |     40,960 |       58% |            77.41 |
+| `Qwen/Qwen3-Embedding-0.6B`                          |    596M | single-vector  | 1,024          |     32,768 |      100% |            75.42 |
+| `codefuse-ai/F2LLM-v2-14B` (#1 overall)              |  13.99B | single-vector  | 5,120          |     40,960 |       58% |            80.75 |
+Takeaways:
+* Among compact **single-vector dense** models, `NightOwl-CodeEmbedding` is the
+  strongest entry in the leaderboard's small-model view while also being one of
+  the smallest, edging out `F2LLM-v2-160M` and clearly ahead of
+  `embeddinggemma-300m`.
+* The sub-1B leaders (`F2LLM-v2-0.6B`, `Qwen3-Embedding-0.6B`) score ~4–6.5
+  points higher but are ~4× the parameter count and use larger embedding
+  dimensions, which directly increases index size and inference cost.
+* The 14B model at the top of the overall board is ~10 points higher but ~93×
+  larger, sitting in a different deployment cost regime entirely.
+### Head-to-head vs LateOn-Code
+`lightonai/LateOn-Code` is the only sub-0.5B model that outranks
+`NightOwl-CodeEmbedding` once multi-vector models are included, so it is worth a
+closer look. It is a **ColBERT-style late-interaction** model (built with PyLate
+on ModernBERT-base): it stores **one 128-dimensional vector per token** and
+scores with the MaxSim operator, rather than a single 768-d vector per text.
+That buys accuracy at the cost of a larger index and a different retrieval path
+(PyLate + a PLAID index), whereas `NightOwl` is a drop-in single-vector
+`sentence-transformers` model.
+Per-task NDCG@10 (×100) on MTEB(Code, v1); both models are code-specialized and
+in-domain (8% zero-shot), so this is a like-for-like comparison. **Bold** marks
+the higher score on each task.
+| Task                       | NightOwl-CodeEmbedding | LateOn-Code (multi-vec) |
+| -------------------------- | ---------------------: | ----------------------: |
+| AppsRetrieval              |                  39.18 |               **54.76** |
+| COIRCodeSearchNetRetrieval |                  84.26 |               **86.57** |
+| CodeEditSearchRetrieval    |              **74.81** |                   64.99 |
+| CodeFeedbackMT             |                  76.69 |               **82.22** |
+| CodeFeedbackST             |                  85.21 |               **90.40** |
+| CodeSearchNetCCRetrieval   |              **91.81** |                   89.32 |
+| CodeSearchNetRetrieval     |                  89.24 |               **90.40** |
+| CodeTransOceanContest      |                  75.95 |               **87.44** |
+| CodeTransOceanDL           |                  36.06 |               **41.00** |
+| CosQA                      |                  42.81 |               **45.23** |
+| StackOverflowQA            |                  86.61 |               **93.43** |
+| SyntheticText2SQL          |              **68.27** |                   63.67 |
+| **Average**                |                  70.91 |               **74.12** |
+`LateOn-Code` wins on average, driven mostly by AppsRetrieval and the
+feedback/translation/QA tasks. However, `NightOwl-CodeEmbedding` wins on three
+tasks that map directly to its design focus:
+* **CodeEditSearchRetrieval** (+9.8): matching edit intents to code changes —
+  `NightOwl`'s dedicated code-edit training shows here.
+* **CodeSearchNetCCRetrieval** (+2.5): code-to-code / similar-function retrieval.
+* **SyntheticText2SQL** (+4.6): NL-to-SQL retrieval.
+So for single-vector code-edit and code-to-code retrieval specifically,
+`NightOwl` is competitive with or ahead of a higher-average multi-vector model,
+while keeping a standard dense-vector index. (LateOn-Code scores sourced from
+the model's
+[MTEB(Code, v1) table](https://huggingface.co/lightonai/LateOn-Code).)
 Because the benchmark suite consists of in-domain code retrieval tasks related
 to the model's training distribution, these results should not be interpreted
 as strictly zero-shot performance.
+## Base Model: the NightOwl Backbone
+`NightOwl-CodeEmbedding` is fine-tuned from
+[`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl), a
+ModernBERT-style code encoder that was **pre-trained from scratch** — including
+its own tokenizer — rather than adapted from a general-purpose checkpoint. The
+whole stack, from tokenization to the pre-training objective, is controlled for
+code.
+**Code-aware tokenizer.** NightOwl uses a custom 50,368-token BPE tokenizer in
+which whitespace is tokenized **independently** of adjacent words, so
+indentation is represented by its own tokens instead of being merged into
+"leading-whitespace + word" pieces. In code the same identifier recurs at many
+indentation depths; folding whitespace into those pieces would spend large parts
+of the vocabulary on near-duplicate "indent + token" variants. Keeping
+whitespace separate avoids that waste and lets the fixed vocabulary budget cover
+more genuinely distinct subwords, while still representing indentation faithfully
+— which matters for whitespace-significant languages such as Python.
+**Two-phase pre-training with line-level masking.** NightOwl is trained with
+masked-language modeling (`mlm_probability = 0.3`) in two phases:
+* *Phase 1 — mixed pre-training:* standard random-token MLM over code, natural
+  language, and technical documentation (produces `NightOwl-Pre`).
+* *Phase 2 — code-only continuation:* **line-level MLM**, where entire
+  source-code lines are masked instead of random tokens. This aligns the
+  pre-training objective with code search and retrieval, where the unit of
+  meaning is closer to a line or statement than an isolated token. The
+  recommended `NightOwl` checkpoint is this Phase-2 result.
+Backbone architecture (base):
+| Property                       | Value                                                 |
+| ------------------------------ | ----------------------------------------------------- |
+| Architecture                   | ModernBERT (alternating local/global attention, RoPE) |
+| Parameters                     | ≈150M                                                 |
+| `hidden_size` / layers / heads | 768 / 19 / 12                                          |
+| Vocabulary                     | 50,368 (custom code BPE)                               |
+| Max sequence length            | 1,024 (Phase 1) → 2,048 (Phase 2)                     |
+Pre-training data mixes `bigcode/starcoder2data-extras` (Kaggle notebooks,
+StackOverflow threads, GitHub issues, technical documentation, …) with
+whole-file source from `Shuu12121/github-file-programs-dataset` across the eight
+supported languages (Python, JavaScript, TypeScript, Java, Go, Rust, Ruby, PHP).
+Long examples are split into chunks so all tokens are used rather than truncated.
+As a raw backbone — before any embedding fine-tuning — NightOwl reaches **0.8436
+average MRR** on MTEB `CodeSearchNetRetrieval` under a fixed SentenceTransformer
+fine-tuning protocol, ahead of CodeBERT-base (0.7944), GraphCodeBERT-base
+(0.8325), UniXCoder-base (0.8003), and ModernBERT-base (0.8182) evaluated the
+same way. `NightOwl-CodeEmbedding` builds the retrieval model described in this
+card on top of that backbone.
 ## Training
 The model was trained with `CachedMultipleNegativesRankingLoss` using
 | Loss                       | `CachedMultipleNegativesRankingLoss`      |
 | Objective                  | Bidirectional retrieval training          |
 | Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B`               |
+| Epochs                     | 1                                         |
+| Learning rate              | 6e-5                                      |
+| Batch size                 | 1024                                      |
 ### Training Data
 * The model is specialized for code-related retrieval and may underperform
   general-purpose text embedding models on unrelated natural language tasks.
+* Inputs longer than 1,024 tokens are truncated. This is a shorter context
+  window than several models it competes with (e.g. the 8K+ token `F2LLM` and
+  `granite` models), so very long files must be chunked.
+* MTEB(Code, v1) is largely in-domain for this model (8% zero-shot). On code
+  domains, query styles, or languages far from the training distribution,
+  expect lower performance than the leaderboard numbers suggest.
 * Performance may vary by programming language, query style, and the
   granularity of indexed code chunks; languages outside the eight supported
   languages are untested.
 * The model uses dense single-vector embeddings. For very fine-grained
+  matching, rerankers or late-interaction models (such as `LateOn-Code`) may
+  provide a higher average at the cost of a larger index and a non-standard
+  retrieval path — though, as the [head-to-head](#head-to-head-vs-lateon-code)
+  shows, single-vector `NightOwl` still leads on code-edit and code-to-code
+  retrieval.
 ## Recommended Indexing Settings