Update README.md

a817643 verified 1 day ago

7.02 kB

tags:
  - ColBERT
  - PyLate
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - code-search
  - knowledge-distillation
  - modernbert
  - apple-silicon
  - mps
pipeline_tag: sentence-similarity
library_name: PyLate
license: apache-2.0
language:
  - en
datasets:
  - sentence-transformers/codesearchnet
base_model: lightonai/ColBERT-Zero

ColBERT-Zero-6L-CodeSearch

A 6-layer ColBERT model distilled from ColBERT-Zero (22 layers) for code search, achieving 85% of the teacher's retrieval quality at 13x faster query speed.

Model Details

Parameter	Value
Architecture	ModernBERT (6 layers, 768 hidden, 12 heads)
Base Model	lightonai/ColBERT-Zero
Output Dimensionality	128 per-token embeddings
Similarity Function	MaxSim (late interaction)
Parameters	~38M (vs ~100M teacher)
Query Length	32 tokens
Document Length	180 tokens
License	Apache 2.0

Benchmark Results

Evaluated on 3 code search corpora (150 questions total) via litembeddings:

Corpus	Teacher MRR	Student MRR	% of Teacher	Student Query Speed
jq (C)	0.539	0.355	65.9%	~7ms
Rails (Ruby)	0.679	0.581	85.6%	~3ms
FastAPI (Python)	0.782	0.766	98.0%	~4ms
Aggregate	0.667	0.568	85.1%	~5ms

The student model is approximately 13x faster at query time than the teacher while retaining 85% of retrieval quality. Performance is particularly strong on Python code search (98% of teacher).

How the Student Was Built

Architecture: Layer Pruning from Teacher

The student was created by selecting 6 layers from ColBERT-Zero's 22-layer ModernBERT backbone using a skewed-late strategy that preserves more upper layers (which encode retrieval-relevant semantics):

Teacher layers: [0, 1, 2, ..., 21]  (22 total)
Student layers: [0, 8, 14, 17, 19, 21]  (6 selected)

The student inherits:

All embedding weights from the teacher
The 768-to-128 ColBERT projection layer
Selected transformer layers with full weight copying

Training: Knowledge Distillation

Dataset: CodeSearchNet (10,000 comment-code pairs)
Teacher scoring: ColBERT-Zero generates MaxSim relevance scores for each query against 1 positive + 3 random negative documents
Loss: PyLate Distillation loss (KL divergence between teacher and student score distributions)
Optimizer: AdamW, lr=5e-5, weight_decay=0.01, warmup_ratio=0.1
Training: 1000 steps, batch_size=8, gradient_accumulation=4 (effective batch size 32)
Hardware: Apple Silicon (M4 Max) via PyTorch MPS backend, ~17 minutes total

Hyperparameter Search

The optimal configuration was found through 30 autonomous experiments sweeping learning rate, layer selection strategy, batch size, gradient accumulation, weight decay, warmup ratio, number of negatives, training steps, and embedding dimensions. Key findings:

Teacher initialization is critical: Starting from ColBERT-Zero's weights (MRR 0.46) vs raw ModernBERT (MRR 0.08) — a 5.6x improvement
Skewed-late layer selection outperforms evenly-spaced, last-6, and other strategies
Effective batch size 32 (bs=8, grad_accum=4) is optimal
Weight decay 0.01 provides regularization benefit

Usage

Installation

pip install pylate

Encoding & Retrieval

from pylate import indexes, models, retrieve

# Load model
model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")

# Encode documents
doc_embeddings = model.encode(
    ["def hello():\n    print('Hello, World!')", "class UserAuth:\n    ..."],
    batch_size=32,
    is_query=False,
    show_progress_bar=True,
)

# Encode queries
query_embeddings = model.encode(
    ["function that prints a greeting"],
    batch_size=32,
    is_query=True,
    show_progress_bar=True,
)

# Score with MaxSim
from pylate.scores import colbert_scores
scores = colbert_scores(query_embeddings, doc_embeddings)
print(scores)  # Higher = more relevant

Reranking

from pylate import rank, models

model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")

queries = ["how to authenticate users"]
documents = [["def login(user, pwd): ...", "def sort_list(arr): ...", "class AuthMiddleware: ..."]]
documents_ids = [["doc1", "doc2", "doc3"]]

queries_embeddings = model.encode(queries, is_query=True)
documents_embeddings = model.encode(documents, is_query=False)

reranked = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

GGUF / litembeddings

This model can be converted to GGUF format for use with litembeddings (SQLite-based embedding engine with SIMD-accelerated MaxSim):

# Convert to GGUF
python convert_hf_to_gguf.py ctrltokyo/ColBERT-Zero-6L-CodeSearch --outfile model-f16.gguf --outtype f16

# Extract projection
python -c "
from safetensors import safe_open
import numpy as np
f = safe_open('1_Dense/model.safetensors', framework='numpy')
f.get_tensor('linear.weight').astype(np.float32).tofile('model.projection')
"

Then in SQL:

SELECT lembed_model('codesearch', 'model-f16.gguf', '{"colbert_projection": "model.projection"}');
SELECT lembed_maxsim(
    lembed_tokens('search_query: how to sort a list'),
    lembed_tokens('search_document: def quicksort(arr): ...')
);

Limitations

Weakest on C code search (65.9% of teacher on jq corpus) — likely because CodeSearchNet training data is Python-heavy
Trained on 10k pairs only — larger training sets or hard negative mining could improve quality further
English only — inherits ColBERT-Zero's language capabilities
No asymmetric prompts — unlike the teacher, this model does not use search_query:/search_document: prompts (uses [Q]/[D] prefixes instead)

Citation

@misc{colbert-zero-6l-codesearch,
  title={ColBERT-Zero-6L-CodeSearch: A Distilled ColBERT Model for Code Search},
  author={Alexander Nicholson},
  year={2026},
  note={Distilled from ColBERT-Zero (Chaffin et al., 2026) using PyLate on Apple Silicon}
}

Acknowledgments

ColBERT-Zero by LightOn AI — the teacher model
PyLate — ColBERT training framework
litembeddings — SQLite embedding engine used for benchmarking
Training and experimentation performed entirely on Apple Silicon (M4 Max) using PyTorch MPS backend