Feature Extraction
sentence-transformers
Safetensors
modernbert
code-search
code-embedding
retrieval
dense
text-embeddings-inference
Instructions to use Shuu12121/NightOwl-CodeEmbedding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Shuu12121/NightOwl-CodeEmbedding with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Shuu12121/NightOwl-CodeEmbedding") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| tags: | |
| - sentence-transformers | |
| - feature-extraction | |
| - code-search | |
| - code-embedding | |
| - retrieval | |
| - modernbert | |
| - dense | |
| base_model: Shuu12121/NightOwl | |
| pipeline_tag: feature-extraction | |
| library_name: sentence-transformers | |
| license: apache-2.0 | |
| datasets: | |
| - Shuu12121/coir_hard_negative_datasets_v3_kd | |
| - Shuu12121/owl_code_search_hard_negative_datasets_V2_kd | |
| - Shuu12121/codeedit_hard_negative_datasets_kd | |
| # NightOwl-CodeEmbedding 🦉 | |
| `NightOwl-CodeEmbedding` is a compact 768-dimensional dense embedding model | |
| specialized for code retrieval, code-edit retrieval, and technical question | |
| answering. | |
| The model is fine-tuned from | |
| [`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl), a | |
| ModernBERT-based code model. It uses CLS pooling with cosine similarity and | |
| does **not** require `query:` / `passage:` style prefixes. | |
| ## Highlights | |
| * Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks | |
| * On the [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard) | |
| leaderboard it ranks **18th out of 241 models overall** and is the | |
| **top-scoring single-vector model under 300M parameters** among scored entries | |
| on the official board, ahead of many models an order of magnitude larger (see | |
| [Leaderboard Standing](#leaderboard-standing)) | |
| * Covers **eight programming languages**, including Rust and TypeScript in | |
| addition to the six CodeSearchNet languages | |
| * Handles a wide range of code retrieval scenarios: NL-to-code search, | |
| code-to-code retrieval, **code-edit retrieval**, and technical QA | |
| * Trained with hard negatives mined by `Qwen/Qwen3-Embedding-0.6B` | |
| (15 hard negatives per anchor) | |
| * Decontaminated against CodeSearchNet test splits and the | |
| CodeEditSearchRetrieval benchmark (see [Data Decontamination](#data-decontamination)) | |
| * Drop-in compatible with `sentence-transformers`, Apache-2.0 license | |
| ## Supported Languages | |
| The training data covers the six CodeSearchNet languages plus two additional | |
| languages: | |
| * Go, Java, JavaScript, PHP, Python, Ruby (CodeSearchNet languages) | |
| * **Rust, TypeScript** (additional) | |
| Performance on languages outside this set is not guaranteed and may vary. | |
| ## Usage | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("Shuu12121/NightOwl-CodeEmbedding") | |
| queries = ["Python function that sorts a list in descending order"] | |
| documents = [ | |
| "def sort_desc(values): return sorted(values, reverse=True)", | |
| "def average(values): return sum(values) / len(values)", | |
| ] | |
| query_embeddings = model.encode(queries) | |
| document_embeddings = model.encode(documents) | |
| # Cosine similarity (embeddings are normalized internally by similarity()) | |
| scores = model.similarity(query_embeddings, document_embeddings) | |
| print(scores) | |
| ``` | |
| ## Model Details | |
| | Property | Value | | |
| | ----------------------- | -------------------- | | |
| | Base model | `Shuu12121/NightOwl` | | |
| | Architecture | ModernBERT | | |
| | Parameters | 150,779,136 | | |
| | Embedding dimension | 768 | | |
| | Pooling | CLS pooling | | |
| | Maximum sequence length | 1,024 tokens | | |
| | Similarity | Cosine similarity | | |
| | Query/document prefixes | Not required | | |
| | Weight dtype | FP32 | | |
| | Weight memory | 575 MiB | | |
| | License | Apache-2.0 | | |
| ## MTEB Results | |
| The model was evaluated with MTEB on code-related retrieval and technical QA | |
| tasks. | |
| Evaluation setup: | |
| * Model revision: `c7c8a57b9539297e192d5cf39b9aecf1fb376edd` | |
| * MTEB version: `2.15.1` | |
| * Metric: `NDCG@10` | |
| * Hardware: NVIDIA GeForce RTX 5090 | |
| * Batch size: 64 | |
| Multi-subset task scores are reported as macro averages. | |
| | Task | Split | NDCG@10 | | |
| | -------------------------------- | ------: | ----------: | | |
| | AppsRetrieval | test | 0.39177 | | |
| | COIRCodeSearchNetRetrieval | test | 0.84264 | | |
| | CodeEditSearchRetrieval | train¹ | 0.74808 | | |
| | CodeFeedbackMT | test | 0.76690 | | |
| | CodeFeedbackST | test | 0.85207 | | |
| | CodeSearchNetCCRetrieval | test | 0.91805 | | |
| | CodeSearchNetRetrieval | test | 0.89239 | | |
| | CodeTransOceanContest | test | 0.75953 | | |
| | CodeTransOceanDL | test | 0.36057 | | |
| | CosQA | test | 0.42810 | | |
| | StackOverflowQA | test | 0.86608 | | |
| | SyntheticText2SQL | test | 0.68266 | | |
| | **Macro average, all 12 tasks** | | **0.70907** | | |
| | **CoIR macro average, 10 tasks** | | **0.68684** | | |
| ¹ `CodeEditSearchRetrieval` does not provide a standard `test` split in MTEB, | |
| so the official `train` split is used for evaluation. These examples were | |
| **not** used for fine-tuning. See | |
| [Data Decontamination](#data-decontamination) for details. | |
| ### Leaderboard Standing | |
| On the public [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard) | |
| leaderboard, the model's task-mean score (70.91, i.e. the 12-task macro average | |
| above ×100) places it as follows: | |
| * **#18 of 241 models overall**, ahead of many models that are an order of | |
| magnitude larger | |
| * **#6 of 155 among sub-1B-parameter dense single-vector models** — and the | |
| **smallest model in that top six**. The five models ranked above it are all | |
| ≈0.33–0.6B parameters (`F2LLM-v2-0.6B/330M`, `pplx-embed-v1-0.6b`, | |
| `C2LLM-0.5B`, `Qwen3-Embedding-0.6B`), i.e. 2–4× larger. | |
| * **#1 among ranked dense single-vector models under 300M parameters** (the | |
| leaderboard's small-model view) | |
| * **#2 once late-interaction / multi-vector models are included**, behind only | |
| `lightonai/LateOn-Code` (a multi-vector late-interaction model — see the | |
| [head-to-head below](#head-to-head-vs-lateon-code)) | |
| > **Reading the numbers fairly.** MTEB(Code, v1) reports a *zero-shot %* for | |
| > each model — the fraction of leaderboard tasks the model was *not* trained on. | |
| > `NightOwl-CodeEmbedding` is **8%** zero-shot: it was trained on most of these | |
| > task families, so its score reflects strong **in-domain** retrieval rather | |
| > than zero-shot transfer. Models marked **100%** (e.g. `embeddinggemma-300m`, | |
| > the `granite-embedding` r2 family, `Qwen3-Embedding`) are evaluated fully | |
| > out-of-domain, so a raw score comparison across rows with different | |
| > zero-shot % is not apples-to-apples. The fairest direct comparisons are to | |
| > other code-specialized models at similar zero-shot levels (e.g. | |
| > `LateOn-Code` at 8%, the `F2LLM` / `C2LLM` families at 8–58%). | |
| ### Comparison with similar-sized models | |
| The table below compares `NightOwl-CodeEmbedding` with other compact code / | |
| general embedding models on MTEB(Code, v1), with a size ladder of larger models | |
| for reference. Score is the leaderboard task mean (higher is better); the | |
| *Zero-shot* column is the share of tasks the model did not train on. | |
| | Model | Params | Type | Emb. dim | Max tokens | Zero-shot | MTEB(Code, v1) ↑ | | |
| | ---------------------------------------------------- | ------: | -------------- | -------------- | ---------: | --------: | ---------------: | | |
| | **`NightOwl-CodeEmbedding`** (this model) | 150.8M | single-vector | 768 | 1,024 | 8% | **70.91** | | |
| | `codefuse-ai/F2LLM-v2-160M` | 159M | single-vector | 640 | 40,960 | 58% | 70.38 | | |
| | `google/embeddinggemma-300m` | 308M | single-vector | 768 | 2,048 | 100% | 68.76 | | |
| | `codefuse-ai/F2LLM-v2-80M` | 80M | single-vector | 320 | 40,960 | 58% | 67.97 | | |
| | `ibm-granite/granite-embedding-311m-multilingual-r2` | 312M | single-vector | 768 | 8,192 | 100% | 63.84 | | |
| | _Late-interaction (multi-vector) reference_ | | | | | | | | |
| | `lightonai/LateOn-Code` | 149M | multi-vector | 128 (per-tok) | 2,048 | 8% | 74.12 | | |
| | _Larger single-vector reference (size ladder)_ | | | | | | | | |
| | `codefuse-ai/F2LLM-v2-0.6B` (#1 sub-1B) | 596M | single-vector | 1,024 | 40,960 | 58% | 77.41 | | |
| | `Qwen/Qwen3-Embedding-0.6B` | 596M | single-vector | 1,024 | 32,768 | 100% | 75.42 | | |
| | `codefuse-ai/F2LLM-v2-14B` (#1 overall) | 13.99B | single-vector | 5,120 | 40,960 | 58% | 80.75 | | |
| Takeaways: | |
| * Among compact **single-vector dense** models, `NightOwl-CodeEmbedding` is the | |
| strongest entry in the leaderboard's small-model view while also being one of | |
| the smallest, edging out `F2LLM-v2-160M` and clearly ahead of | |
| `embeddinggemma-300m`. | |
| * The sub-1B leaders (`F2LLM-v2-0.6B`, `Qwen3-Embedding-0.6B`) score ~4–6.5 | |
| points higher but are ~4× the parameter count and use larger embedding | |
| dimensions, which directly increases index size and inference cost. | |
| * The 14B model at the top of the overall board is ~10 points higher but ~93× | |
| larger, sitting in a different deployment cost regime entirely. | |
| ### Head-to-head vs LateOn-Code | |
| `lightonai/LateOn-Code` is the only sub-0.5B model that outranks | |
| `NightOwl-CodeEmbedding` once multi-vector models are included, so it is worth a | |
| closer look. It is a **ColBERT-style late-interaction** model (built with PyLate | |
| on ModernBERT-base): it stores **one 128-dimensional vector per token** and | |
| scores with the MaxSim operator, rather than a single 768-d vector per text. | |
| That buys accuracy at the cost of a larger index and a different retrieval path | |
| (PyLate + a PLAID index), whereas `NightOwl` is a drop-in single-vector | |
| `sentence-transformers` model. | |
| Per-task NDCG@10 (×100) on MTEB(Code, v1); both models are code-specialized and | |
| in-domain (8% zero-shot), so this is a like-for-like comparison. **Bold** marks | |
| the higher score on each task. | |
| | Task | NightOwl-CodeEmbedding | LateOn-Code (multi-vec) | | |
| | -------------------------- | ---------------------: | ----------------------: | | |
| | AppsRetrieval | 39.18 | **54.76** | | |
| | COIRCodeSearchNetRetrieval | 84.26 | **86.57** | | |
| | CodeEditSearchRetrieval | **74.81** | 64.99 | | |
| | CodeFeedbackMT | 76.69 | **82.22** | | |
| | CodeFeedbackST | 85.21 | **90.40** | | |
| | CodeSearchNetCCRetrieval | **91.81** | 89.32 | | |
| | CodeSearchNetRetrieval | 89.24 | **90.40** | | |
| | CodeTransOceanContest | 75.95 | **87.44** | | |
| | CodeTransOceanDL | 36.06 | **41.00** | | |
| | CosQA | 42.81 | **45.23** | | |
| | StackOverflowQA | 86.61 | **93.43** | | |
| | SyntheticText2SQL | **68.27** | 63.67 | | |
| | **Average** | 70.91 | **74.12** | | |
| `LateOn-Code` wins on average, driven mostly by AppsRetrieval and the | |
| feedback/translation/QA tasks. However, `NightOwl-CodeEmbedding` wins on three | |
| tasks that map directly to its design focus: | |
| * **CodeEditSearchRetrieval** (+9.8): matching edit intents to code changes — | |
| `NightOwl`'s dedicated code-edit training shows here. | |
| * **CodeSearchNetCCRetrieval** (+2.5): code-to-code / similar-function retrieval. | |
| * **SyntheticText2SQL** (+4.6): NL-to-SQL retrieval. | |
| So for single-vector code-edit and code-to-code retrieval specifically, | |
| `NightOwl` is competitive with or ahead of a higher-average multi-vector model, | |
| while keeping a standard dense-vector index. (LateOn-Code scores sourced from | |
| the model's | |
| [MTEB(Code, v1) table](https://huggingface.co/lightonai/LateOn-Code).) | |
| Because the benchmark suite consists of in-domain code retrieval tasks related | |
| to the model's training distribution, these results should not be interpreted | |
| as strictly zero-shot performance. | |
| ## Base Model: the NightOwl Backbone | |
| `NightOwl-CodeEmbedding` is fine-tuned from | |
| [`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl), a | |
| ModernBERT-style code encoder that was **pre-trained from scratch** — including | |
| its own tokenizer — rather than adapted from a general-purpose checkpoint. The | |
| whole stack, from tokenization to the pre-training objective, is controlled for | |
| code. | |
| **Code-aware tokenizer.** NightOwl uses a custom 50,368-token BPE tokenizer in | |
| which whitespace is tokenized **independently** of adjacent words, so | |
| indentation is represented by its own tokens instead of being merged into | |
| "leading-whitespace + word" pieces. In code the same identifier recurs at many | |
| indentation depths; folding whitespace into those pieces would spend large parts | |
| of the vocabulary on near-duplicate "indent + token" variants. Keeping | |
| whitespace separate avoids that waste and lets the fixed vocabulary budget cover | |
| more genuinely distinct subwords, while still representing indentation faithfully | |
| — which matters for whitespace-significant languages such as Python. | |
| **Two-phase pre-training with line-level masking.** NightOwl is trained with | |
| masked-language modeling (`mlm_probability = 0.3`) in two phases: | |
| * *Phase 1 — mixed pre-training:* standard random-token MLM over code, natural | |
| language, and technical documentation (produces `NightOwl-Pre`). | |
| * *Phase 2 — code-only continuation:* **line-level MLM**, where entire | |
| source-code lines are masked instead of random tokens. This aligns the | |
| pre-training objective with code search and retrieval, where the unit of | |
| meaning is closer to a line or statement than an isolated token. The | |
| recommended `NightOwl` checkpoint is this Phase-2 result. | |
| Backbone architecture (base): | |
| | Property | Value | | |
| | ------------------------------ | ----------------------------------------------------- | | |
| | Architecture | ModernBERT (alternating local/global attention, RoPE) | | |
| | Parameters | ≈150M | | |
| | `hidden_size` / layers / heads | 768 / 19 / 12 | | |
| | Vocabulary | 50,368 (custom code BPE) | | |
| | Max sequence length | 1,024 (Phase 1) → 2,048 (Phase 2) | | |
| Pre-training data mixes `bigcode/starcoder2data-extras` (Kaggle notebooks, | |
| StackOverflow threads, GitHub issues, technical documentation, …) with | |
| whole-file source from `Shuu12121/github-file-programs-dataset` across the eight | |
| supported languages (Python, JavaScript, TypeScript, Java, Go, Rust, Ruby, PHP). | |
| Long examples are split into chunks so all tokens are used rather than truncated. | |
| As a raw backbone — before any embedding fine-tuning — NightOwl reaches **0.8436 | |
| average MRR** on MTEB `CodeSearchNetRetrieval` under a fixed SentenceTransformer | |
| fine-tuning protocol, ahead of CodeBERT-base (0.7944), GraphCodeBERT-base | |
| (0.8325), UniXCoder-base (0.8003), and ModernBERT-base (0.8182) evaluated the | |
| same way. `NightOwl-CodeEmbedding` builds the retrieval model described in this | |
| card on top of that backbone. | |
| ## Training | |
| The model was trained with `CachedMultipleNegativesRankingLoss` using | |
| bidirectional query-to-document and document-to-query objectives. | |
| | Property | Value | | |
| | -------------------------- | ----------------------------------------- | | |
| | Training samples | 2,534,400 | | |
| | Positives per anchor | 1 | | |
| | Negatives per anchor | 15 | | |
| | Loss | `CachedMultipleNegativesRankingLoss` | | |
| | Objective | Bidirectional retrieval training | | |
| | Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B` | | |
| | Epochs | 1 | | |
| | Learning rate | 6e-5 | | |
| | Batch size | 1024 | | |
| ### Training Data | |
| The training data is a mixture of: | |
| 1. **Public code-retrieval datasets** covering the following CoIR task | |
| families: AppsRetrieval, COIRCodeSearchNetRetrieval, CodeFeedbackMT, | |
| CodeFeedbackST, CodeSearchNetCCRetrieval, CodeSearchNetRetrieval, | |
| CodeTransOceanContest, CodeTransOceanDL, CosQA, StackOverflowQA, and | |
| SyntheticText2SQL. | |
| 2. **Custom code-comment pair data** consisting of code snippets paired with | |
| natural-language description comments across the eight supported languages | |
| (the six CodeSearchNet languages plus Rust and TypeScript). | |
| 3. **Code-edit data** derived from `commitpackft`, pairing edit intents with | |
| code changes. | |
| All datasets were constructed as hard-negative retrieval datasets: for each | |
| anchor, one positive and fifteen hard negatives were used. Hard negatives were | |
| mined with | |
| [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B), | |
| which retrieves semantically similar but non-matching candidates, producing | |
| negatives that are more difficult than random negatives. The mining model is | |
| used only during dataset construction and is not required at inference time. | |
| This setup is intended to improve discrimination between code snippets, | |
| programming questions, edit examples, and technically similar retrieval | |
| candidates. | |
| ### Data Decontamination | |
| To reduce benchmark contamination, the following overlaps were removed from | |
| the training data **before** training: | |
| * Overlaps between the custom code-comment pair data and the | |
| **CodeSearchNet test split** | |
| * Overlaps between the `commitpackft`-derived code-edit data and the | |
| **CodeEditSearchRetrieval** benchmark evaluation data | |
| For `CodeEditSearchRetrieval`, note that MTEB labels the evaluation split | |
| `train`. This refers only to the official split name available for the task; | |
| the evaluated examples were not included in this model's fine-tuning data. | |
| The reported score should therefore be interpreted as **in-domain | |
| generalization on held-out benchmark examples**, not as training-set | |
| performance — though, given the in-domain training distribution, also not as | |
| strictly zero-shot performance. | |
| ## Intended Use | |
| This model is intended for code-related retrieval tasks such as: | |
| * Natural language to code search | |
| * Code-to-code retrieval and similar function search | |
| * Code-edit retrieval (matching edit intents to code changes) | |
| * Retrieval over programming Q&A and technical questions | |
| * Local semantic code search systems | |
| * RAG systems over codebases and developer documentation | |
| Example use cases include indexing functions, snippets, programming solutions, | |
| StackOverflow-style answers, code review examples, and edit-related code | |
| examples. | |
| ## Limitations | |
| * The model is specialized for code-related retrieval and may underperform | |
| general-purpose text embedding models on unrelated natural language tasks. | |
| * Inputs longer than 1,024 tokens are truncated. This is a shorter context | |
| window than several models it competes with (e.g. the 8K+ token `F2LLM` and | |
| `granite` models), so very long files must be chunked. | |
| * MTEB(Code, v1) is largely in-domain for this model (8% zero-shot). On code | |
| domains, query styles, or languages far from the training distribution, | |
| expect lower performance than the leaderboard numbers suggest. | |
| * Performance may vary by programming language, query style, and the | |
| granularity of indexed code chunks; languages outside the eight supported | |
| languages are untested. | |
| * The model uses dense single-vector embeddings. For very fine-grained | |
| matching, rerankers or late-interaction models (such as `LateOn-Code`) may | |
| provide a higher average at the cost of a larger index and a non-standard | |
| retrieval path — though, as the [head-to-head](#head-to-head-vs-lateon-code) | |
| shows, single-vector `NightOwl` still leads on code-edit and code-to-code | |
| retrieval. | |
| ## Recommended Indexing Settings | |
| Encode both queries and documents with normalized embeddings: | |
| ```python | |
| embeddings = model.encode(texts, normalize_embeddings=True) | |
| ``` | |
| With normalized embeddings, dot product is equivalent to cosine similarity. | |
| For codebase search, indexing function-level or class-level chunks is usually | |
| recommended. Very long files may exceed the 1,024-token context limit and | |
| should be split into smaller semantic chunks. | |
| ## Citation | |
| If you use this model, please cite it together with the base model and | |
| Sentence Transformers. | |
| ```bibtex | |
| @misc{nightowl_codeembedding, | |
| title = {NightOwl-CodeEmbedding}, | |
| author = {Shuu12121}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/Shuu12121/NightOwl-CodeEmbedding} | |
| } | |
| ``` |