NightOwl

NightOwl is a ModernBERT-style code encoder pre-trained from scratch on a diverse mix of source code, natural language, and technical documentation.

NightOwl-large reaches 0.8508 average MRR on MTEB CodeSearchNetRetrieval, exceeding CodeBERT-base (0.7944), GraphCodeBERT-base (0.8325), UniXCoder-base (0.8003), and ModernBERT-base (0.8182) under an identical fine-tuning protocol.

Checkpoints

The NightOwl family is pre-trained in two phases. Both the intermediate (Phase 1) and final (Phase 2) checkpoints are released.

Repo	Size	Phase	Description
`Shuu12121/NightOwl-Pre`	base	Phase 1	Mixed-data MLM pre-training (code + NL + docs)
`Shuu12121/NightOwl`	base	Phase 2	Code-only line-level MLM continuation — recommended
`Shuu12121/NightOwl-large-Pre`	large	Phase 1	Mixed-data MLM pre-training (code + NL + docs)
`Shuu12121/NightOwl-large`	large	Phase 2	Code-only line-level MLM continuation — recommended

For downstream code search, start from the Phase 2 checkpoints (NightOwl / NightOwl-large).

Model size

Both variants are ModernBERT encoders (alternating local/global attention, RoPE positional embeddings) with a custom 50,368-token BPE tokenizer.

	NightOwl	NightOwl-large
Architecture	ModernBERT	ModernBERT
Parameters (approx.)	≈150M	≈300M
`hidden_size`	768	1024
`num_hidden_layers`	19	28
`num_attention_heads`	12	16
`intermediate_size`	1536	1536
`vocab_size`	50,368	50,368
Max sequence length	1024 (Phase 1) / 2048 (Phase 2)	1024 (Phase 1) / 2048 (Phase 2)

Parameter counts are approximate, derived from the architecture configuration (token embeddings + transformer layers + MLM head).

Training data

NightOwl is trained on two source families. Phase 1 uses every source below; Phase 2 continues on the code-related subsets only.

1. `bigcode/starcoder2data-extras` (12 subsets)

Diverse code, natural-language, and technical-knowledge subsets. max_samples caps the rows sampled per subset; max_chars truncates very long documents to control memory.

Subset	`max_samples`	Priority	Notes	Phase 2
`kaggle`	2,000,000	high	Notebook-style code	✅
`stackoverflow`	2,000,000	high	Q&A code threads	✅
`issues`	1,000,000	medium	GitHub issue text	✅
`owm`	1,000,000	medium	Open web math	—
`lhq`	3,000,000	high	High-quality text	—
`wikipedia`	1,000,000	medium	Encyclopedic NL	—
`arxiv`	600,000	low	Long LaTeX docs (`max_chars=10,000`)	—
`documentation`	2,000,000	high	Technical docs	✅
`ir_cpp`	100,000	low	C++ IR (`max_chars=5,000`)	—
`ir_low_resource`	100,000	low	Low-resource IR (`max_chars=5,000`)	—
`ir_python`	100,000	low	Python IR (`max_chars=5,000`)	—
`ir_rust`	100,000	low	Rust IR (`max_chars=5,000`)	—

2. `Shuu12121/github-file-programs-dataset` (8 languages)

Whole-file source code, one Hugging Face repo per language (text field: content). Used in both phases.

python, javascript, typescript, java, go, rust, ruby, php

Sampling caps: Phase 1 — up to 1,000,000 files per language; Phase 2 — up to 2,000,000 files per language.

Training procedure

NightOwl is pre-trained in two phases, both using masked-language modeling with mlm_probability = 0.3.

Phase 1 — Mixed pre-training. Standard random-token MLM (mlm collator) over all data sources (code + NL + docs). Produces NightOwl-Pre / NightOwl-large-Pre.
Phase 2 — Code-only continuation. Line-level MLM (line_no_space collator) over the code-related subsets only. Entire source-code lines are masked rather than random tokens, aligning the objective with code-search downstream tasks. Produces NightOwl / NightOwl-large.

Long examples are split into chunks (split_long_examples: true) so all tokens are used rather than truncated.

Hyperparameter	NightOwl (base)	NightOwl-large
`mlm_probability`	0.3	0.3
Optimizer schedule	cosine, warmup ratio 0.05	cosine, warmup ratio 0.05
Learning rate	5e-5	5e-5
Weight decay	0.01	0.01
Precision	fp16	fp16
Epochs	1	1
`per_device_train_batch_size`	8	4
`gradient_accumulation_steps`	32	64
Effective batch size	256	256
Phase 1 `max_length`	1024	1024
Phase 2 `max_length`	2048	2048

Evaluation

Evaluated on MTEB CodeSearchNetRetrieval after SentenceTransformer fine-tuning on CodeSearchNet pairs (10,000 samples per language, Multiple Negatives Ranking loss). Each model is swept over six learning rates; the best per-model result is reported. Only the pre-trained backbone differs between rows — the fine-tuning and evaluation recipe is held fixed.

CodeSearchNetRetrieval — MRR by language (best across learning rates)

Model	Go	Java	JS	PHP	Python	Ruby	Avg	best-lr
CodeBERT-base	0.9242	0.7176	0.7007	0.8089	0.8499	0.7651	0.7944	`3e-5`
GraphCodeBERT-base	0.9373	0.7991	0.7402	0.8339	0.8785	0.8059	0.8325	`3e-5`
UniXCoder-base	0.8674	0.8276	0.6949	0.8115	0.8643	0.7360	0.8003	`5e-5`
ModernBERT-base	0.9278	0.7663	0.7465	0.8153	0.8731	0.7802	0.8182	`3e-5`
NightOwl	0.9412	0.8232	0.7554	0.8309	0.8984	0.8124	0.8436	`1e-5`
NightOwl-large	0.9393	0.8314	0.7753	0.8398	0.9023	0.8169	0.8508	`1e-5`

NightOwl-large takes five of seven score columns, including the average.

Usage

NightOwl is an encoder backbone. Load it directly for masked-LM / feature-extraction, or wrap it as a SentenceTransformer for code search.

Masked language modeling

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/NightOwl")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/NightOwl")

Feature extraction

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/NightOwl")
model = AutoModel.from_pretrained("Shuu12121/NightOwl")

code = "def add(a, b):\n    return a + b"
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=1024)
with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state  # [1, seq_len, hidden_size]

Limitations

NightOwl is an encoder for understanding/retrieval, not a generative model — it does not produce code.
Code-search strength is best realized after SentenceTransformer fine-tuning; the raw backbone is not a ready-to-use retriever.
Training data is dominated by 8 programming languages; performance on other languages may be lower.
Pre-training data is sampled from public sources and may contain bugs, insecure patterns, or biased content.

Citation

@misc{owl_code_pretraining,
  author       = {Shun0212},
  title        = {Owl Code Pretraining: A minimal toolkit for building code-specialized pretrained encoders},
  year         = {2026},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/Shun0212/codeowl-training-core}}
}}

Downloads last month: 40

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Shuu12121/NightOwl

Finetunes

2 models

Shuu12121
/

NightOwl

NightOwl

Checkpoints

Model size

Training data

1. `bigcode/starcoder2data-extras` (12 subsets)

2. `Shuu12121/github-file-programs-dataset` (8 languages)

Training procedure

Evaluation

CodeSearchNetRetrieval — MRR by language (best across learning rates)

Usage

Masked language modeling

Feature extraction

Limitations

Citation

Model tree for Shuu12121/NightOwl

Datasets used to train Shuu12121/NightOwl

NightOwl

Checkpoints

Model size

Training data

1. bigcode/starcoder2data-extras (12 subsets)

2. Shuu12121/github-file-programs-dataset (8 languages)

Training procedure

Evaluation

CodeSearchNetRetrieval — MRR by language (best across learning rates)

Usage

Masked language modeling

Feature extraction

Limitations

Citation

Model tree for Shuu12121/NightOwl

Datasets used to train Shuu12121/NightOwl

1. `bigcode/starcoder2data-extras` (12 subsets)

2. `Shuu12121/github-file-programs-dataset` (8 languages)