Shuu12121
/

NightOwl-CodeEmbedding

@@ -17,14 +17,39 @@ datasets:
 - Shuu12121/codeedit_hard_negative_datasets_kd
 ---
-# NightOwl CodeEmbedding
-`NightOwl-CodeEmbedding` is a 768-dimensional dense embedding model specialized
-for code retrieval, code-edit retrieval, and technical question answering. It
-is fine-tuned from [`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl)
-and uses CLS pooling with cosine similarity.
-The model does not require query or document prefixes.
 ## Usage
@@ -39,92 +64,188 @@ documents = [
     "def average(values): return sum(values) / len(values)",
 ]
-query_embeddings = model.encode(queries, normalize_embeddings=True)
-document_embeddings = model.encode(documents, normalize_embeddings=True)
-scores = query_embeddings @ document_embeddings.T
 print(scores)
 ```
 ## Model Details
-| Property | Value |
-|---|---|
-| Base model | `Shuu12121/NightOwl` |
-| Architecture | ModernBERT |
-| Parameters | 150,779,136 |
-| Embedding dimension | 768 |
-| Pooling | CLS |
-| Maximum sequence length | 1,024 tokens |
-| Similarity | Cosine |
-| Query/document prefixes | None |
-| Weight dtype | FP32 |
-| Weight memory | 575 MiB |
-| License | Apache-2.0 |
 ## MTEB Results
-The model was evaluated using:
-- Model revision: `c7c8a57b9539297e192d5cf39b9aecf1fb376edd`
-- MTEB version: `2.15.1`
-- Metric: NDCG@10
-- Hardware: NVIDIA GeForce RTX 5090
-- Batch size: 64
-Multi-subset task scores are macro averages. `CodeEditSearchRetrieval` uses its
-official `train` evaluation split; the other tasks use `test`.
-| Task | Split | NDCG@10 |
-|---|---:|---:|
-| AppsRetrieval | test | 0.39177 |
-| COIRCodeSearchNetRetrieval | test | 0.84264 |
-| CodeEditSearchRetrieval | train | 0.74808 |
-| CodeFeedbackMT | test | 0.76690 |
-| CodeFeedbackST | test | 0.85207 |
-| CodeSearchNetCCRetrieval | test | 0.91805 |
-| CodeSearchNetRetrieval | test | 0.89239 |
-| CodeTransOceanContest | test | 0.75953 |
-| CodeTransOceanDL | test | 0.36057 |
-| CosQA | test | 0.42810 |
-| StackOverflowQA | test | 0.86608 |
-| SyntheticText2SQL | test | 0.68266 |
-| **Macro average (all 12 tasks)** | | **0.70907** |
-| **CoIR macro average (10 tasks)** | | **0.68684** |
 ## Training
 The model was trained with `CachedMultipleNegativesRankingLoss` using
-bidirectional query-to-document and document-to-query objectives. The generated
-training metadata reports 2,534,400 training samples with one positive and
-fifteen negatives per anchor.
-The training data covers the following MTEB task families:
-- `AppsRetrieval`
-- `COIRCodeSearchNetRetrieval`
-- `CodeFeedbackMT`
-- `CodeFeedbackST`
-- `CodeSearchNetCCRetrieval`
-- `CodeSearchNetRetrieval`
-- `CodeTransOceanContest`
-- `CodeTransOceanDL`
-- `CosQA`
-- `StackOverflowQA`
-- `SyntheticText2SQL`
-`CodeEditSearchRetrieval` was evaluated separately and is not listed as a
-training dataset in the published model metadata.
 ## Limitations
-- The model is specialized for code-related retrieval and may underperform
-  general-purpose text embedding models on unrelated domains.
-- Inputs longer than 1,024 tokens are truncated.
-- Benchmark scores include in-domain tasks related to the training data and
-  should not be interpreted as strictly zero-shot results.
 ## Citation
-If you use this model, cite Sentence Transformers and the base model where
-appropriate.

 - Shuu12121/codeedit_hard_negative_datasets_kd
 ---
+# NightOwl-CodeEmbedding 🦉
+`NightOwl-CodeEmbedding` is a compact 768-dimensional dense embedding model
+specialized for code retrieval, code-edit retrieval, and technical question
+answering.
+The model is fine-tuned from
+[`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl), a
+ModernBERT-based code model. It uses CLS pooling with cosine similarity and
+does **not** require `query:` / `passage:` style prefixes.
+## Highlights
+* Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
+* Covers **eight programming languages**, including Rust and TypeScript in
+  addition to the six CodeSearchNet languages
+* Handles a wide range of code retrieval scenarios: NL-to-code search,
+  code-to-code retrieval, **code-edit retrieval**, and technical QA
+* Trained with hard negatives mined by `Qwen/Qwen3-Embedding-0.6B`
+  (15 hard negatives per anchor)
+* Decontaminated against CodeSearchNet test splits and the
+  CodeEditSearchRetrieval benchmark (see [Data Decontamination](#data-decontamination))
+* Drop-in compatible with `sentence-transformers`, Apache-2.0 license
+## Supported Languages
+The training data covers the six CodeSearchNet languages plus two additional
+languages:
+* Go, Java, JavaScript, PHP, Python, Ruby (CodeSearchNet languages)
+* **Rust, TypeScript** (additional)
+Performance on languages outside this set is not guaranteed and may vary.
 ## Usage
     "def average(values): return sum(values) / len(values)",
 ]
+query_embeddings = model.encode(queries)
+document_embeddings = model.encode(documents)
+# Cosine similarity (embeddings are normalized internally by similarity())
+scores = model.similarity(query_embeddings, document_embeddings)
 print(scores)
 ```
 ## Model Details
+| Property                | Value                |
+| ----------------------- | -------------------- |
+| Base model              | `Shuu12121/NightOwl` |
+| Architecture            | ModernBERT           |
+| Parameters              | 150,779,136          |
+| Embedding dimension     | 768                  |
+| Pooling                 | CLS pooling          |
+| Maximum sequence length | 1,024 tokens         |
+| Similarity              | Cosine similarity    |
+| Query/document prefixes | Not required         |
+| Weight dtype            | FP32                 |
+| Weight memory           | 575 MiB              |
+| License                 | Apache-2.0           |
 ## MTEB Results
+The model was evaluated with MTEB on code-related retrieval and technical QA
+tasks.
+Evaluation setup:
+* Model revision: `c7c8a57b9539297e192d5cf39b9aecf1fb376edd`
+* MTEB version: `2.15.1`
+* Metric: `NDCG@10`
+* Hardware: NVIDIA GeForce RTX 5090
+* Batch size: 64
+Multi-subset task scores are reported as macro averages.
+| Task                             |   Split |     NDCG@10 |
+| -------------------------------- | ------: | ----------: |
+| AppsRetrieval                    |    test |     0.39177 |
+| COIRCodeSearchNetRetrieval       |    test |     0.84264 |
+| CodeEditSearchRetrieval          | train¹ |     0.74808 |
+| CodeFeedbackMT                   |    test |     0.76690 |
+| CodeFeedbackST                   |    test |     0.85207 |
+| CodeSearchNetCCRetrieval         |    test |     0.91805 |
+| CodeSearchNetRetrieval           |    test |     0.89239 |
+| CodeTransOceanContest            |    test |     0.75953 |
+| CodeTransOceanDL                 |    test |     0.36057 |
+| CosQA                            |    test |     0.42810 |
+| StackOverflowQA                  |    test |     0.86608 |
+| SyntheticText2SQL                |    test |     0.68266 |
+| **Macro average, all 12 tasks**  |         | **0.70907** |
+| **CoIR macro average, 10 tasks** |         | **0.68684** |
+¹ `CodeEditSearchRetrieval` does not provide a standard `test` split in MTEB,
+so the official `train` split is used for evaluation. These examples were
+**not** used for fine-tuning. See
+[Data Decontamination](#data-decontamination) for details.
+<!-- TODO: Add a comparison table with the base NightOwl model and/or
+     similar-sized code embedding models (e.g. jina-embeddings-v2-base-code,
+     CodeXEmbed) to give readers a reference point. -->
+Because the benchmark suite consists of in-domain code retrieval tasks related
+to the model's training distribution, these results should not be interpreted
+as strictly zero-shot performance.
 ## Training
 The model was trained with `CachedMultipleNegativesRankingLoss` using
+bidirectional query-to-document and document-to-query objectives.
+| Property                   | Value                                     |
+| -------------------------- | ----------------------------------------- |
+| Training samples           | 2,534,400                                 |
+| Positives per anchor       | 1                                         |
+| Negatives per anchor       | 15                                        |
+| Loss                       | `CachedMultipleNegativesRankingLoss`      |
+| Objective                  | Bidirectional retrieval training          |
+| Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B`               |
+| Epochs                     | 1                                    |
+| Learning rate              | 6e-5                                    |
+| Batch size                 | 1024                                    |
+### Training Data
+The training data is a mixture of:
+1. **Public code-retrieval datasets** covering the following CoIR task
+   families: AppsRetrieval, COIRCodeSearchNetRetrieval, CodeFeedbackMT,
+   CodeFeedbackST, CodeSearchNetCCRetrieval, CodeSearchNetRetrieval,
+   CodeTransOceanContest, CodeTransOceanDL, CosQA, StackOverflowQA, and
+   SyntheticText2SQL.
+2. **Custom code-comment pair data** consisting of code snippets paired with
+   natural-language description comments across the eight supported languages
+   (the six CodeSearchNet languages plus Rust and TypeScript).
+3. **Code-edit data** derived from `commitpackft`, pairing edit intents with
+   code changes.
+All datasets were constructed as hard-negative retrieval datasets: for each
+anchor, one positive and fifteen hard negatives were used. Hard negatives were
+mined with
+[`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B),
+which retrieves semantically similar but non-matching candidates, producing
+negatives that are more difficult than random negatives. The mining model is
+used only during dataset construction and is not required at inference time.
+This setup is intended to improve discrimination between code snippets,
+programming questions, edit examples, and technically similar retrieval
+candidates.
+### Data Decontamination
+To reduce benchmark contamination, the following overlaps were removed from
+the training data **before** training:
+* Overlaps between the custom code-comment pair data and the
+  **CodeSearchNet test split**
+* Overlaps between the `commitpackft`-derived code-edit data and the
+  **CodeEditSearchRetrieval** benchmark evaluation data
+For `CodeEditSearchRetrieval`, note that MTEB labels the evaluation split
+`train`. This refers only to the official split name available for the task;
+the evaluated examples were not included in this model's fine-tuning data.
+The reported score should therefore be interpreted as **in-domain
+generalization on held-out benchmark examples**, not as training-set
+performance — though, given the in-domain training distribution, also not as
+strictly zero-shot performance.
+## Intended Use
+This model is intended for code-related retrieval tasks such as:
+* Natural language to code search
+* Code-to-code retrieval and similar function search
+* Code-edit retrieval (matching edit intents to code changes)
+* Retrieval over programming Q&A and technical questions
+* Local semantic code search systems
+* RAG systems over codebases and developer documentation
+Example use cases include indexing functions, snippets, programming solutions,
+StackOverflow-style answers, code review examples, and edit-related code
+examples.
 ## Limitations
+* The model is specialized for code-related retrieval and may underperform
+  general-purpose text embedding models on unrelated natural language tasks.
+* Inputs longer than 1,024 tokens are truncated.
+* Performance may vary by programming language, query style, and the
+  granularity of indexed code chunks; languages outside the eight supported
+  languages are untested.
+* The model uses dense single-vector embeddings. For very fine-grained
+  matching, rerankers or late-interaction models may provide better precision.
+## Recommended Indexing Settings
+Encode both queries and documents with normalized embeddings:
+```python
+embeddings = model.encode(texts, normalize_embeddings=True)
+```
+With normalized embeddings, dot product is equivalent to cosine similarity.
+For codebase search, indexing function-level or class-level chunks is usually
+recommended. Very long files may exceed the 1,024-token context limit and
+should be split into smaller semantic chunks.
 ## Citation
+If you use this model, please cite it together with the base model and
+Sentence Transformers.
+```bibtex
+@misc{nightowl_codeembedding,
+  title = {NightOwl-CodeEmbedding},
+  author = {Shuu12121},
+  year = {2026},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/Shuu12121/NightOwl-CodeEmbedding}
+}
+```