Feature Extraction
sentence-transformers
Safetensors
modernbert
code-search
code-embedding
retrieval
dense
text-embeddings-inference
Instructions to use Shuu12121/NightOwl-CodeEmbedding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Shuu12121/NightOwl-CodeEmbedding with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Shuu12121/NightOwl-CodeEmbedding") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -17,14 +17,39 @@ datasets:
|
|
| 17 |
- Shuu12121/codeedit_hard_negative_datasets_kd
|
| 18 |
---
|
| 19 |
|
| 20 |
-
# NightOwl
|
| 21 |
|
| 22 |
-
`NightOwl-CodeEmbedding` is a 768-dimensional dense embedding model
|
| 23 |
-
for code retrieval, code-edit retrieval, and technical question
|
| 24 |
-
|
| 25 |
-
and uses CLS pooling with cosine similarity.
|
| 26 |
|
| 27 |
-
The model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
## Usage
|
| 30 |
|
|
@@ -39,92 +64,188 @@ documents = [
|
|
| 39 |
"def average(values): return sum(values) / len(values)",
|
| 40 |
]
|
| 41 |
|
| 42 |
-
query_embeddings = model.encode(queries
|
| 43 |
-
document_embeddings = model.encode(documents
|
| 44 |
|
| 45 |
-
|
|
|
|
| 46 |
print(scores)
|
| 47 |
```
|
| 48 |
|
| 49 |
## Model Details
|
| 50 |
|
| 51 |
-
| Property
|
| 52 |
-
|---|---|
|
| 53 |
-
| Base model
|
| 54 |
-
| Architecture
|
| 55 |
-
| Parameters
|
| 56 |
-
| Embedding dimension
|
| 57 |
-
| Pooling
|
| 58 |
-
| Maximum sequence length | 1,024 tokens
|
| 59 |
-
| Similarity
|
| 60 |
-
| Query/document prefixes |
|
| 61 |
-
| Weight dtype
|
| 62 |
-
| Weight memory
|
| 63 |
-
| License
|
| 64 |
|
| 65 |
## MTEB Results
|
| 66 |
|
| 67 |
-
The model was evaluated
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
|
| 81 |
-
|
|
| 82 |
-
|
|
| 83 |
-
|
|
| 84 |
-
|
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
|
| 88 |
-
|
|
| 89 |
-
|
|
| 90 |
-
|
|
| 91 |
-
|
|
| 92 |
-
|
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
## Training
|
| 96 |
|
| 97 |
The model was trained with `CachedMultipleNegativesRankingLoss` using
|
| 98 |
-
bidirectional query-to-document and document-to-query objectives.
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
- `
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
## Limitations
|
| 120 |
|
| 121 |
-
|
| 122 |
-
general-purpose text embedding models on unrelated
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
## Citation
|
| 128 |
|
| 129 |
-
If you use this model, cite
|
| 130 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
- Shuu12121/codeedit_hard_negative_datasets_kd
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# NightOwl-CodeEmbedding 🦉
|
| 21 |
|
| 22 |
+
`NightOwl-CodeEmbedding` is a compact 768-dimensional dense embedding model
|
| 23 |
+
specialized for code retrieval, code-edit retrieval, and technical question
|
| 24 |
+
answering.
|
|
|
|
| 25 |
|
| 26 |
+
The model is fine-tuned from
|
| 27 |
+
[`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl), a
|
| 28 |
+
ModernBERT-based code model. It uses CLS pooling with cosine similarity and
|
| 29 |
+
does **not** require `query:` / `passage:` style prefixes.
|
| 30 |
+
|
| 31 |
+
## Highlights
|
| 32 |
+
|
| 33 |
+
* Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
|
| 34 |
+
* Covers **eight programming languages**, including Rust and TypeScript in
|
| 35 |
+
addition to the six CodeSearchNet languages
|
| 36 |
+
* Handles a wide range of code retrieval scenarios: NL-to-code search,
|
| 37 |
+
code-to-code retrieval, **code-edit retrieval**, and technical QA
|
| 38 |
+
* Trained with hard negatives mined by `Qwen/Qwen3-Embedding-0.6B`
|
| 39 |
+
(15 hard negatives per anchor)
|
| 40 |
+
* Decontaminated against CodeSearchNet test splits and the
|
| 41 |
+
CodeEditSearchRetrieval benchmark (see [Data Decontamination](#data-decontamination))
|
| 42 |
+
* Drop-in compatible with `sentence-transformers`, Apache-2.0 license
|
| 43 |
+
|
| 44 |
+
## Supported Languages
|
| 45 |
+
|
| 46 |
+
The training data covers the six CodeSearchNet languages plus two additional
|
| 47 |
+
languages:
|
| 48 |
+
|
| 49 |
+
* Go, Java, JavaScript, PHP, Python, Ruby (CodeSearchNet languages)
|
| 50 |
+
* **Rust, TypeScript** (additional)
|
| 51 |
+
|
| 52 |
+
Performance on languages outside this set is not guaranteed and may vary.
|
| 53 |
|
| 54 |
## Usage
|
| 55 |
|
|
|
|
| 64 |
"def average(values): return sum(values) / len(values)",
|
| 65 |
]
|
| 66 |
|
| 67 |
+
query_embeddings = model.encode(queries)
|
| 68 |
+
document_embeddings = model.encode(documents)
|
| 69 |
|
| 70 |
+
# Cosine similarity (embeddings are normalized internally by similarity())
|
| 71 |
+
scores = model.similarity(query_embeddings, document_embeddings)
|
| 72 |
print(scores)
|
| 73 |
```
|
| 74 |
|
| 75 |
## Model Details
|
| 76 |
|
| 77 |
+
| Property | Value |
|
| 78 |
+
| ----------------------- | -------------------- |
|
| 79 |
+
| Base model | `Shuu12121/NightOwl` |
|
| 80 |
+
| Architecture | ModernBERT |
|
| 81 |
+
| Parameters | 150,779,136 |
|
| 82 |
+
| Embedding dimension | 768 |
|
| 83 |
+
| Pooling | CLS pooling |
|
| 84 |
+
| Maximum sequence length | 1,024 tokens |
|
| 85 |
+
| Similarity | Cosine similarity |
|
| 86 |
+
| Query/document prefixes | Not required |
|
| 87 |
+
| Weight dtype | FP32 |
|
| 88 |
+
| Weight memory | 575 MiB |
|
| 89 |
+
| License | Apache-2.0 |
|
| 90 |
|
| 91 |
## MTEB Results
|
| 92 |
|
| 93 |
+
The model was evaluated with MTEB on code-related retrieval and technical QA
|
| 94 |
+
tasks.
|
| 95 |
+
|
| 96 |
+
Evaluation setup:
|
| 97 |
+
|
| 98 |
+
* Model revision: `c7c8a57b9539297e192d5cf39b9aecf1fb376edd`
|
| 99 |
+
* MTEB version: `2.15.1`
|
| 100 |
+
* Metric: `NDCG@10`
|
| 101 |
+
* Hardware: NVIDIA GeForce RTX 5090
|
| 102 |
+
* Batch size: 64
|
| 103 |
+
|
| 104 |
+
Multi-subset task scores are reported as macro averages.
|
| 105 |
+
|
| 106 |
+
| Task | Split | NDCG@10 |
|
| 107 |
+
| -------------------------------- | ------: | ----------: |
|
| 108 |
+
| AppsRetrieval | test | 0.39177 |
|
| 109 |
+
| COIRCodeSearchNetRetrieval | test | 0.84264 |
|
| 110 |
+
| CodeEditSearchRetrieval | train¹ | 0.74808 |
|
| 111 |
+
| CodeFeedbackMT | test | 0.76690 |
|
| 112 |
+
| CodeFeedbackST | test | 0.85207 |
|
| 113 |
+
| CodeSearchNetCCRetrieval | test | 0.91805 |
|
| 114 |
+
| CodeSearchNetRetrieval | test | 0.89239 |
|
| 115 |
+
| CodeTransOceanContest | test | 0.75953 |
|
| 116 |
+
| CodeTransOceanDL | test | 0.36057 |
|
| 117 |
+
| CosQA | test | 0.42810 |
|
| 118 |
+
| StackOverflowQA | test | 0.86608 |
|
| 119 |
+
| SyntheticText2SQL | test | 0.68266 |
|
| 120 |
+
| **Macro average, all 12 tasks** | | **0.70907** |
|
| 121 |
+
| **CoIR macro average, 10 tasks** | | **0.68684** |
|
| 122 |
+
|
| 123 |
+
¹ `CodeEditSearchRetrieval` does not provide a standard `test` split in MTEB,
|
| 124 |
+
so the official `train` split is used for evaluation. These examples were
|
| 125 |
+
**not** used for fine-tuning. See
|
| 126 |
+
[Data Decontamination](#data-decontamination) for details.
|
| 127 |
+
|
| 128 |
+
<!-- TODO: Add a comparison table with the base NightOwl model and/or
|
| 129 |
+
similar-sized code embedding models (e.g. jina-embeddings-v2-base-code,
|
| 130 |
+
CodeXEmbed) to give readers a reference point. -->
|
| 131 |
+
|
| 132 |
+
Because the benchmark suite consists of in-domain code retrieval tasks related
|
| 133 |
+
to the model's training distribution, these results should not be interpreted
|
| 134 |
+
as strictly zero-shot performance.
|
| 135 |
|
| 136 |
## Training
|
| 137 |
|
| 138 |
The model was trained with `CachedMultipleNegativesRankingLoss` using
|
| 139 |
+
bidirectional query-to-document and document-to-query objectives.
|
| 140 |
+
|
| 141 |
+
| Property | Value |
|
| 142 |
+
| -------------------------- | ----------------------------------------- |
|
| 143 |
+
| Training samples | 2,534,400 |
|
| 144 |
+
| Positives per anchor | 1 |
|
| 145 |
+
| Negatives per anchor | 15 |
|
| 146 |
+
| Loss | `CachedMultipleNegativesRankingLoss` |
|
| 147 |
+
| Objective | Bidirectional retrieval training |
|
| 148 |
+
| Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B` |
|
| 149 |
+
| Epochs | 1 |
|
| 150 |
+
| Learning rate | 6e-5 |
|
| 151 |
+
| Batch size | 1024 |
|
| 152 |
+
|
| 153 |
+
### Training Data
|
| 154 |
+
|
| 155 |
+
The training data is a mixture of:
|
| 156 |
+
|
| 157 |
+
1. **Public code-retrieval datasets** covering the following CoIR task
|
| 158 |
+
families: AppsRetrieval, COIRCodeSearchNetRetrieval, CodeFeedbackMT,
|
| 159 |
+
CodeFeedbackST, CodeSearchNetCCRetrieval, CodeSearchNetRetrieval,
|
| 160 |
+
CodeTransOceanContest, CodeTransOceanDL, CosQA, StackOverflowQA, and
|
| 161 |
+
SyntheticText2SQL.
|
| 162 |
+
2. **Custom code-comment pair data** consisting of code snippets paired with
|
| 163 |
+
natural-language description comments across the eight supported languages
|
| 164 |
+
(the six CodeSearchNet languages plus Rust and TypeScript).
|
| 165 |
+
3. **Code-edit data** derived from `commitpackft`, pairing edit intents with
|
| 166 |
+
code changes.
|
| 167 |
+
|
| 168 |
+
All datasets were constructed as hard-negative retrieval datasets: for each
|
| 169 |
+
anchor, one positive and fifteen hard negatives were used. Hard negatives were
|
| 170 |
+
mined with
|
| 171 |
+
[`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B),
|
| 172 |
+
which retrieves semantically similar but non-matching candidates, producing
|
| 173 |
+
negatives that are more difficult than random negatives. The mining model is
|
| 174 |
+
used only during dataset construction and is not required at inference time.
|
| 175 |
+
|
| 176 |
+
This setup is intended to improve discrimination between code snippets,
|
| 177 |
+
programming questions, edit examples, and technically similar retrieval
|
| 178 |
+
candidates.
|
| 179 |
+
|
| 180 |
+
### Data Decontamination
|
| 181 |
+
|
| 182 |
+
To reduce benchmark contamination, the following overlaps were removed from
|
| 183 |
+
the training data **before** training:
|
| 184 |
+
|
| 185 |
+
* Overlaps between the custom code-comment pair data and the
|
| 186 |
+
**CodeSearchNet test split**
|
| 187 |
+
* Overlaps between the `commitpackft`-derived code-edit data and the
|
| 188 |
+
**CodeEditSearchRetrieval** benchmark evaluation data
|
| 189 |
+
|
| 190 |
+
For `CodeEditSearchRetrieval`, note that MTEB labels the evaluation split
|
| 191 |
+
`train`. This refers only to the official split name available for the task;
|
| 192 |
+
the evaluated examples were not included in this model's fine-tuning data.
|
| 193 |
+
The reported score should therefore be interpreted as **in-domain
|
| 194 |
+
generalization on held-out benchmark examples**, not as training-set
|
| 195 |
+
performance — though, given the in-domain training distribution, also not as
|
| 196 |
+
strictly zero-shot performance.
|
| 197 |
+
|
| 198 |
+
## Intended Use
|
| 199 |
+
|
| 200 |
+
This model is intended for code-related retrieval tasks such as:
|
| 201 |
+
|
| 202 |
+
* Natural language to code search
|
| 203 |
+
* Code-to-code retrieval and similar function search
|
| 204 |
+
* Code-edit retrieval (matching edit intents to code changes)
|
| 205 |
+
* Retrieval over programming Q&A and technical questions
|
| 206 |
+
* Local semantic code search systems
|
| 207 |
+
* RAG systems over codebases and developer documentation
|
| 208 |
+
|
| 209 |
+
Example use cases include indexing functions, snippets, programming solutions,
|
| 210 |
+
StackOverflow-style answers, code review examples, and edit-related code
|
| 211 |
+
examples.
|
| 212 |
|
| 213 |
## Limitations
|
| 214 |
|
| 215 |
+
* The model is specialized for code-related retrieval and may underperform
|
| 216 |
+
general-purpose text embedding models on unrelated natural language tasks.
|
| 217 |
+
* Inputs longer than 1,024 tokens are truncated.
|
| 218 |
+
* Performance may vary by programming language, query style, and the
|
| 219 |
+
granularity of indexed code chunks; languages outside the eight supported
|
| 220 |
+
languages are untested.
|
| 221 |
+
* The model uses dense single-vector embeddings. For very fine-grained
|
| 222 |
+
matching, rerankers or late-interaction models may provide better precision.
|
| 223 |
+
|
| 224 |
+
## Recommended Indexing Settings
|
| 225 |
+
|
| 226 |
+
Encode both queries and documents with normalized embeddings:
|
| 227 |
+
|
| 228 |
+
```python
|
| 229 |
+
embeddings = model.encode(texts, normalize_embeddings=True)
|
| 230 |
+
```
|
| 231 |
+
|
| 232 |
+
With normalized embeddings, dot product is equivalent to cosine similarity.
|
| 233 |
+
|
| 234 |
+
For codebase search, indexing function-level or class-level chunks is usually
|
| 235 |
+
recommended. Very long files may exceed the 1,024-token context limit and
|
| 236 |
+
should be split into smaller semantic chunks.
|
| 237 |
|
| 238 |
## Citation
|
| 239 |
|
| 240 |
+
If you use this model, please cite it together with the base model and
|
| 241 |
+
Sentence Transformers.
|
| 242 |
+
|
| 243 |
+
```bibtex
|
| 244 |
+
@misc{nightowl_codeembedding,
|
| 245 |
+
title = {NightOwl-CodeEmbedding},
|
| 246 |
+
author = {Shuu12121},
|
| 247 |
+
year = {2026},
|
| 248 |
+
publisher = {Hugging Face},
|
| 249 |
+
url = {https://huggingface.co/Shuu12121/NightOwl-CodeEmbedding}
|
| 250 |
+
}
|
| 251 |
+
```
|