Instructions to use Shuu12121/NightOwl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Shuu12121/NightOwl with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Shuu12121/NightOwl")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Shuu12121/NightOwl") model = AutoModelForMaskedLM.from_pretrained("Shuu12121/NightOwl") - Notebooks
- Google Colab
- Kaggle
NightOwl
NightOwl is a ModernBERT-style code encoder pre-trained from scratch on a diverse mix of source code, natural language, and technical documentation.
NightOwl-large reaches 0.8508 average MRR on MTEB
CodeSearchNetRetrieval, exceeding CodeBERT-base (0.7944), GraphCodeBERT-base
(0.8325), UniXCoder-base (0.8003), and ModernBERT-base (0.8182) under an
identical fine-tuning protocol.
Checkpoints
The NightOwl family is pre-trained in two phases. Both the intermediate (Phase 1) and final (Phase 2) checkpoints are released.
| Repo | Size | Phase | Description |
|---|---|---|---|
Shuu12121/NightOwl-Pre |
base | Phase 1 | Mixed-data MLM pre-training (code + NL + docs) |
Shuu12121/NightOwl |
base | Phase 2 | Code-only line-level MLM continuation — recommended |
Shuu12121/NightOwl-large-Pre |
large | Phase 1 | Mixed-data MLM pre-training (code + NL + docs) |
Shuu12121/NightOwl-large |
large | Phase 2 | Code-only line-level MLM continuation — recommended |
For downstream code search, start from the Phase 2 checkpoints
(NightOwl / NightOwl-large).
Model size
Both variants are ModernBERT encoders (alternating local/global attention, RoPE positional embeddings) with a custom 50,368-token BPE tokenizer.
| NightOwl | NightOwl-large | |
|---|---|---|
| Architecture | ModernBERT | ModernBERT |
| Parameters (approx.) | ≈150M | ≈300M |
hidden_size |
768 | 1024 |
num_hidden_layers |
19 | 28 |
num_attention_heads |
12 | 16 |
intermediate_size |
1536 | 1536 |
vocab_size |
50,368 | 50,368 |
| Max sequence length | 1024 (Phase 1) / 2048 (Phase 2) | 1024 (Phase 1) / 2048 (Phase 2) |
Parameter counts are approximate, derived from the architecture configuration (token embeddings + transformer layers + MLM head).
Training data
NightOwl is trained on two source families. Phase 1 uses every source below; Phase 2 continues on the code-related subsets only.
1. bigcode/starcoder2data-extras (12 subsets)
Diverse code, natural-language, and technical-knowledge subsets. max_samples
caps the rows sampled per subset; max_chars truncates very long documents to
control memory.
| Subset | max_samples |
Priority | Notes | Phase 2 |
|---|---|---|---|---|
kaggle |
2,000,000 | high | Notebook-style code | ✅ |
stackoverflow |
2,000,000 | high | Q&A code threads | ✅ |
issues |
1,000,000 | medium | GitHub issue text | ✅ |
owm |
1,000,000 | medium | Open web math | — |
lhq |
3,000,000 | high | High-quality text | — |
wikipedia |
1,000,000 | medium | Encyclopedic NL | — |
arxiv |
600,000 | low | Long LaTeX docs (max_chars=10,000) |
— |
documentation |
2,000,000 | high | Technical docs | ✅ |
ir_cpp |
100,000 | low | C++ IR (max_chars=5,000) |
— |
ir_low_resource |
100,000 | low | Low-resource IR (max_chars=5,000) |
— |
ir_python |
100,000 | low | Python IR (max_chars=5,000) |
— |
ir_rust |
100,000 | low | Rust IR (max_chars=5,000) |
— |
2. Shuu12121/github-file-programs-dataset (8 languages)
Whole-file source code, one Hugging Face repo per language (text field:
content). Used in both phases.
python, javascript, typescript, java, go, rust, ruby, php
Sampling caps: Phase 1 — up to 1,000,000 files per language; Phase 2 — up to 2,000,000 files per language.
Training procedure
NightOwl is pre-trained in two phases, both using masked-language modeling
with mlm_probability = 0.3.
- Phase 1 — Mixed pre-training. Standard random-token MLM (
mlmcollator) over all data sources (code + NL + docs). ProducesNightOwl-Pre/NightOwl-large-Pre. - Phase 2 — Code-only continuation. Line-level MLM (
line_no_spacecollator) over the code-related subsets only. Entire source-code lines are masked rather than random tokens, aligning the objective with code-search downstream tasks. ProducesNightOwl/NightOwl-large.
Long examples are split into chunks (split_long_examples: true) so all
tokens are used rather than truncated.
| Hyperparameter | NightOwl (base) | NightOwl-large |
|---|---|---|
mlm_probability |
0.3 | 0.3 |
| Optimizer schedule | cosine, warmup ratio 0.05 | cosine, warmup ratio 0.05 |
| Learning rate | 5e-5 | 5e-5 |
| Weight decay | 0.01 | 0.01 |
| Precision | fp16 | fp16 |
| Epochs | 1 | 1 |
per_device_train_batch_size |
8 | 4 |
gradient_accumulation_steps |
32 | 64 |
| Effective batch size | 256 | 256 |
Phase 1 max_length |
1024 | 1024 |
Phase 2 max_length |
2048 | 2048 |
Evaluation
Evaluated on MTEB CodeSearchNetRetrieval after SentenceTransformer
fine-tuning on CodeSearchNet pairs (10,000 samples per language, Multiple
Negatives Ranking loss). Each model is swept over six learning rates; the
best per-model result is reported. Only the pre-trained backbone differs
between rows — the fine-tuning and evaluation recipe is held fixed.
CodeSearchNetRetrieval — MRR by language (best across learning rates)
| Model | Go | Java | JS | PHP | Python | Ruby | Avg | best-lr |
|---|---|---|---|---|---|---|---|---|
| CodeBERT-base | 0.9242 | 0.7176 | 0.7007 | 0.8089 | 0.8499 | 0.7651 | 0.7944 | 3e-5 |
| GraphCodeBERT-base | 0.9373 | 0.7991 | 0.7402 | 0.8339 | 0.8785 | 0.8059 | 0.8325 | 3e-5 |
| UniXCoder-base | 0.8674 | 0.8276 | 0.6949 | 0.8115 | 0.8643 | 0.7360 | 0.8003 | 5e-5 |
| ModernBERT-base | 0.9278 | 0.7663 | 0.7465 | 0.8153 | 0.8731 | 0.7802 | 0.8182 | 3e-5 |
| NightOwl | 0.9412 | 0.8232 | 0.7554 | 0.8309 | 0.8984 | 0.8124 | 0.8436 | 1e-5 |
| NightOwl-large | 0.9393 | 0.8314 | 0.7753 | 0.8398 | 0.9023 | 0.8169 | 0.8508 | 1e-5 |
NightOwl-large takes five of seven score columns, including the average.
Usage
NightOwl is an encoder backbone. Load it directly for masked-LM / feature-extraction, or wrap it as a SentenceTransformer for code search.
Masked language modeling
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/NightOwl")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/NightOwl")
Feature extraction
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/NightOwl")
model = AutoModel.from_pretrained("Shuu12121/NightOwl")
code = "def add(a, b):\n return a + b"
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=1024)
with torch.no_grad():
embeddings = model(**inputs).last_hidden_state # [1, seq_len, hidden_size]
Limitations
- NightOwl is an encoder for understanding/retrieval, not a generative model — it does not produce code.
- Code-search strength is best realized after SentenceTransformer fine-tuning; the raw backbone is not a ready-to-use retriever.
- Training data is dominated by 8 programming languages; performance on other languages may be lower.
- Pre-training data is sampled from public sources and may contain bugs, insecure patterns, or biased content.
Citation
@misc{owl_code_pretraining,
author = {Shun0212},
title = {Owl Code Pretraining: A minimal toolkit for building code-specialized pretrained encoders},
year = {2026},
publisher = {GitHub},
howpublished = {\url{https://github.com/Shun0212/codeowl-training-core}}
}}
- Downloads last month
- 297