ctrltokyo's picture
Update README.md
a817643 verified
metadata
tags:
  - ColBERT
  - PyLate
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - code-search
  - knowledge-distillation
  - modernbert
  - apple-silicon
  - mps
pipeline_tag: sentence-similarity
library_name: PyLate
license: apache-2.0
language:
  - en
datasets:
  - sentence-transformers/codesearchnet
base_model: lightonai/ColBERT-Zero

ColBERT-Zero-6L-CodeSearch

A 6-layer ColBERT model distilled from ColBERT-Zero (22 layers) for code search, achieving 85% of the teacher's retrieval quality at 13x faster query speed.

Model Details

Parameter Value
Architecture ModernBERT (6 layers, 768 hidden, 12 heads)
Base Model lightonai/ColBERT-Zero
Output Dimensionality 128 per-token embeddings
Similarity Function MaxSim (late interaction)
Parameters ~38M (vs ~100M teacher)
Query Length 32 tokens
Document Length 180 tokens
License Apache 2.0

Benchmark Results

Evaluated on 3 code search corpora (150 questions total) via litembeddings:

Corpus Teacher MRR Student MRR % of Teacher Student Query Speed
jq (C) 0.539 0.355 65.9% ~7ms
Rails (Ruby) 0.679 0.581 85.6% ~3ms
FastAPI (Python) 0.782 0.766 98.0% ~4ms
Aggregate 0.667 0.568 85.1% ~5ms

The student model is approximately 13x faster at query time than the teacher while retaining 85% of retrieval quality. Performance is particularly strong on Python code search (98% of teacher).

How the Student Was Built

Architecture: Layer Pruning from Teacher

The student was created by selecting 6 layers from ColBERT-Zero's 22-layer ModernBERT backbone using a skewed-late strategy that preserves more upper layers (which encode retrieval-relevant semantics):

Teacher layers: [0, 1, 2, ..., 21]  (22 total)
Student layers: [0, 8, 14, 17, 19, 21]  (6 selected)

The student inherits:

  • All embedding weights from the teacher
  • The 768-to-128 ColBERT projection layer
  • Selected transformer layers with full weight copying

Training: Knowledge Distillation

  • Dataset: CodeSearchNet (10,000 comment-code pairs)
  • Teacher scoring: ColBERT-Zero generates MaxSim relevance scores for each query against 1 positive + 3 random negative documents
  • Loss: PyLate Distillation loss (KL divergence between teacher and student score distributions)
  • Optimizer: AdamW, lr=5e-5, weight_decay=0.01, warmup_ratio=0.1
  • Training: 1000 steps, batch_size=8, gradient_accumulation=4 (effective batch size 32)
  • Hardware: Apple Silicon (M4 Max) via PyTorch MPS backend, ~17 minutes total

Hyperparameter Search

The optimal configuration was found through 30 autonomous experiments sweeping learning rate, layer selection strategy, batch size, gradient accumulation, weight decay, warmup ratio, number of negatives, training steps, and embedding dimensions. Key findings:

  • Teacher initialization is critical: Starting from ColBERT-Zero's weights (MRR 0.46) vs raw ModernBERT (MRR 0.08) — a 5.6x improvement
  • Skewed-late layer selection outperforms evenly-spaced, last-6, and other strategies
  • Effective batch size 32 (bs=8, grad_accum=4) is optimal
  • Weight decay 0.01 provides regularization benefit

Usage

Installation

pip install pylate

Encoding & Retrieval

from pylate import indexes, models, retrieve

# Load model
model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")

# Encode documents
doc_embeddings = model.encode(
    ["def hello():\n    print('Hello, World!')", "class UserAuth:\n    ..."],
    batch_size=32,
    is_query=False,
    show_progress_bar=True,
)

# Encode queries
query_embeddings = model.encode(
    ["function that prints a greeting"],
    batch_size=32,
    is_query=True,
    show_progress_bar=True,
)

# Score with MaxSim
from pylate.scores import colbert_scores
scores = colbert_scores(query_embeddings, doc_embeddings)
print(scores)  # Higher = more relevant

Reranking

from pylate import rank, models

model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")

queries = ["how to authenticate users"]
documents = [["def login(user, pwd): ...", "def sort_list(arr): ...", "class AuthMiddleware: ..."]]
documents_ids = [["doc1", "doc2", "doc3"]]

queries_embeddings = model.encode(queries, is_query=True)
documents_embeddings = model.encode(documents, is_query=False)

reranked = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

GGUF / litembeddings

This model can be converted to GGUF format for use with litembeddings (SQLite-based embedding engine with SIMD-accelerated MaxSim):

# Convert to GGUF
python convert_hf_to_gguf.py ctrltokyo/ColBERT-Zero-6L-CodeSearch --outfile model-f16.gguf --outtype f16

# Extract projection
python -c "
from safetensors import safe_open
import numpy as np
f = safe_open('1_Dense/model.safetensors', framework='numpy')
f.get_tensor('linear.weight').astype(np.float32).tofile('model.projection')
"

Then in SQL:

SELECT lembed_model('codesearch', 'model-f16.gguf', '{"colbert_projection": "model.projection"}');
SELECT lembed_maxsim(
    lembed_tokens('search_query: how to sort a list'),
    lembed_tokens('search_document: def quicksort(arr): ...')
);

Limitations

  • Weakest on C code search (65.9% of teacher on jq corpus) — likely because CodeSearchNet training data is Python-heavy
  • Trained on 10k pairs only — larger training sets or hard negative mining could improve quality further
  • English only — inherits ColBERT-Zero's language capabilities
  • No asymmetric prompts — unlike the teacher, this model does not use search_query:/search_document: prompts (uses [Q]/[D] prefixes instead)

Citation

@misc{colbert-zero-6l-codesearch,
  title={ColBERT-Zero-6L-CodeSearch: A Distilled ColBERT Model for Code Search},
  author={Alexander Nicholson},
  year={2026},
  note={Distilled from ColBERT-Zero (Chaffin et al., 2026) using PyLate on Apple Silicon}
}

Acknowledgments

  • ColBERT-Zero by LightOn AI — the teacher model
  • PyLate — ColBERT training framework
  • litembeddings — SQLite embedding engine used for benchmarking
  • Training and experimentation performed entirely on Apple Silicon (M4 Max) using PyTorch MPS backend