NightOwl

NightOwl is a ModernBERT-style code encoder pre-trained from scratch on a diverse mix of source code, natural language, and technical documentation.

NightOwl-large reaches 0.8508 average MRR on MTEB CodeSearchNetRetrieval, exceeding CodeBERT-base (0.7944), GraphCodeBERT-base (0.8325), UniXCoder-base (0.8003), and ModernBERT-base (0.8182) under an identical fine-tuning protocol.

Checkpoints

The NightOwl family is pre-trained in two phases. Both the intermediate (Phase 1) and final (Phase 2) checkpoints are released.

Repo Size Phase Description
Shuu12121/NightOwl-Pre base Phase 1 Mixed-data MLM pre-training (code + NL + docs)
Shuu12121/NightOwl base Phase 2 Code-only line-level MLM continuation — recommended
Shuu12121/NightOwl-large-Pre large Phase 1 Mixed-data MLM pre-training (code + NL + docs)
Shuu12121/NightOwl-large large Phase 2 Code-only line-level MLM continuation — recommended

For downstream code search, start from the Phase 2 checkpoints (NightOwl / NightOwl-large).

Model size

Both variants are ModernBERT encoders (alternating local/global attention, RoPE positional embeddings) with a custom 50,368-token BPE tokenizer.

NightOwl NightOwl-large
Architecture ModernBERT ModernBERT
Parameters (approx.) ≈150M ≈300M
hidden_size 768 1024
num_hidden_layers 19 28
num_attention_heads 12 16
intermediate_size 1536 1536
vocab_size 50,368 50,368
Max sequence length 1024 (Phase 1) / 2048 (Phase 2) 1024 (Phase 1) / 2048 (Phase 2)

Parameter counts are approximate, derived from the architecture configuration (token embeddings + transformer layers + MLM head).

Training data

NightOwl is trained on two source families. Phase 1 uses every source below; Phase 2 continues on the code-related subsets only.

1. bigcode/starcoder2data-extras (12 subsets)

Diverse code, natural-language, and technical-knowledge subsets. max_samples caps the rows sampled per subset; max_chars truncates very long documents to control memory.

Subset max_samples Priority Notes Phase 2
kaggle 2,000,000 high Notebook-style code ✅
stackoverflow 2,000,000 high Q&A code threads ✅
issues 1,000,000 medium GitHub issue text ✅
owm 1,000,000 medium Open web math —
lhq 3,000,000 high High-quality text —
wikipedia 1,000,000 medium Encyclopedic NL —
arxiv 600,000 low Long LaTeX docs (max_chars=10,000) —
documentation 2,000,000 high Technical docs ✅
ir_cpp 100,000 low C++ IR (max_chars=5,000) —
ir_low_resource 100,000 low Low-resource IR (max_chars=5,000) —
ir_python 100,000 low Python IR (max_chars=5,000) —
ir_rust 100,000 low Rust IR (max_chars=5,000) —

2. Shuu12121/github-file-programs-dataset (8 languages)

Whole-file source code, one Hugging Face repo per language (text field: content). Used in both phases.

python, javascript, typescript, java, go, rust, ruby, php

Sampling caps: Phase 1 — up to 1,000,000 files per language; Phase 2 — up to 2,000,000 files per language.

Training procedure

NightOwl is pre-trained in two phases, both using masked-language modeling with mlm_probability = 0.3.

  • Phase 1 — Mixed pre-training. Standard random-token MLM (mlm collator) over all data sources (code + NL + docs). Produces NightOwl-Pre / NightOwl-large-Pre.
  • Phase 2 — Code-only continuation. Line-level MLM (line_no_space collator) over the code-related subsets only. Entire source-code lines are masked rather than random tokens, aligning the objective with code-search downstream tasks. Produces NightOwl / NightOwl-large.

Long examples are split into chunks (split_long_examples: true) so all tokens are used rather than truncated.

Hyperparameter NightOwl (base) NightOwl-large
mlm_probability 0.3 0.3
Optimizer schedule cosine, warmup ratio 0.05 cosine, warmup ratio 0.05
Learning rate 5e-5 5e-5
Weight decay 0.01 0.01
Precision fp16 fp16
Epochs 1 1
per_device_train_batch_size 8 4
gradient_accumulation_steps 32 64
Effective batch size 256 256
Phase 1 max_length 1024 1024
Phase 2 max_length 2048 2048

Evaluation

Evaluated on MTEB CodeSearchNetRetrieval after SentenceTransformer fine-tuning on CodeSearchNet pairs (10,000 samples per language, Multiple Negatives Ranking loss). Each model is swept over six learning rates; the best per-model result is reported. Only the pre-trained backbone differs between rows — the fine-tuning and evaluation recipe is held fixed.

CodeSearchNetRetrieval — MRR by language (best across learning rates)

Model Go Java JS PHP Python Ruby Avg best-lr
CodeBERT-base 0.9242 0.7176 0.7007 0.8089 0.8499 0.7651 0.7944 3e-5
GraphCodeBERT-base 0.9373 0.7991 0.7402 0.8339 0.8785 0.8059 0.8325 3e-5
UniXCoder-base 0.8674 0.8276 0.6949 0.8115 0.8643 0.7360 0.8003 5e-5
ModernBERT-base 0.9278 0.7663 0.7465 0.8153 0.8731 0.7802 0.8182 3e-5
NightOwl 0.9412 0.8232 0.7554 0.8309 0.8984 0.8124 0.8436 1e-5
NightOwl-large 0.9393 0.8314 0.7753 0.8398 0.9023 0.8169 0.8508 1e-5

NightOwl-large takes five of seven score columns, including the average.

Usage

NightOwl is an encoder backbone. Load it directly for masked-LM / feature-extraction, or wrap it as a SentenceTransformer for code search.

Masked language modeling

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/NightOwl")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/NightOwl")

Feature extraction

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/NightOwl")
model = AutoModel.from_pretrained("Shuu12121/NightOwl")

code = "def add(a, b):\n    return a + b"
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=1024)
with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state  # [1, seq_len, hidden_size]

Limitations

  • NightOwl is an encoder for understanding/retrieval, not a generative model — it does not produce code.
  • Code-search strength is best realized after SentenceTransformer fine-tuning; the raw backbone is not a ready-to-use retriever.
  • Training data is dominated by 8 programming languages; performance on other languages may be lower.
  • Pre-training data is sampled from public sources and may contain bugs, insecure patterns, or biased content.

Citation

@misc{owl_code_pretraining,
  author       = {Shun0212},
  title        = {Owl Code Pretraining: A minimal toolkit for building code-specialized pretrained encoders},
  year         = {2026},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/Shun0212/codeowl-training-core}}
}}
Downloads last month
297
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shuu12121/NightOwl

Finetunes
1 model

Datasets used to train Shuu12121/NightOwl