--- tags: - ColBERT - PyLate - sentence-transformers - sentence-similarity - feature-extraction - code-search - knowledge-distillation - modernbert - apple-silicon - mps pipeline_tag: sentence-similarity library_name: PyLate license: apache-2.0 language: - en datasets: - sentence-transformers/codesearchnet base_model: lightonai/ColBERT-Zero --- # ColBERT-Zero-6L-CodeSearch A **6-layer ColBERT model** distilled from [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) (22 layers) for code search, achieving **85% of the teacher's retrieval quality at 13x faster query speed**. ## Model Details | Parameter | Value | |-----------|-------| | **Architecture** | ModernBERT (6 layers, 768 hidden, 12 heads) | | **Base Model** | [lightonai/ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) | | **Output Dimensionality** | 128 per-token embeddings | | **Similarity Function** | MaxSim (late interaction) | | **Parameters** | ~38M (vs ~100M teacher) | | **Query Length** | 32 tokens | | **Document Length** | 180 tokens | | **License** | Apache 2.0 | ## Benchmark Results Evaluated on 3 code search corpora (150 questions total) via [litembeddings](https://github.com/alexandernicholson/litembeddings): | Corpus | Teacher MRR | Student MRR | % of Teacher | Student Query Speed | |--------|------------|-------------|--------------|---------------------| | jq (C) | 0.539 | 0.355 | 65.9% | ~7ms | | Rails (Ruby) | 0.679 | 0.581 | 85.6% | ~3ms | | FastAPI (Python) | 0.782 | 0.766 | **98.0%** | ~4ms | | **Aggregate** | **0.667** | **0.568** | **85.1%** | **~5ms** | The student model is approximately **13x faster** at query time than the teacher while retaining 85% of retrieval quality. Performance is particularly strong on Python code search (98% of teacher). ## How the Student Was Built ### Architecture: Layer Pruning from Teacher The student was created by selecting 6 layers from ColBERT-Zero's 22-layer ModernBERT backbone using a **skewed-late** strategy that preserves more upper layers (which encode retrieval-relevant semantics): ``` Teacher layers: [0, 1, 2, ..., 21] (22 total) Student layers: [0, 8, 14, 17, 19, 21] (6 selected) ``` The student inherits: - All embedding weights from the teacher - The 768-to-128 ColBERT projection layer - Selected transformer layers with full weight copying ### Training: Knowledge Distillation - **Dataset**: [CodeSearchNet](https://huggingface.co/datasets/sentence-transformers/codesearchnet) (10,000 comment-code pairs) - **Teacher scoring**: ColBERT-Zero generates MaxSim relevance scores for each query against 1 positive + 3 random negative documents - **Loss**: PyLate Distillation loss (KL divergence between teacher and student score distributions) - **Optimizer**: AdamW, lr=5e-5, weight_decay=0.01, warmup_ratio=0.1 - **Training**: 1000 steps, batch_size=8, gradient_accumulation=4 (effective batch size 32) - **Hardware**: Apple Silicon (M4 Max) via PyTorch MPS backend, ~17 minutes total ### Hyperparameter Search The optimal configuration was found through **30 autonomous experiments** sweeping learning rate, layer selection strategy, batch size, gradient accumulation, weight decay, warmup ratio, number of negatives, training steps, and embedding dimensions. Key findings: - **Teacher initialization is critical**: Starting from ColBERT-Zero's weights (MRR 0.46) vs raw ModernBERT (MRR 0.08) — a 5.6x improvement - **Skewed-late layer selection** outperforms evenly-spaced, last-6, and other strategies - **Effective batch size 32** (bs=8, grad_accum=4) is optimal - **Weight decay 0.01** provides regularization benefit ## Usage ### Installation ```bash pip install pylate ``` ### Encoding & Retrieval ```python from pylate import indexes, models, retrieve # Load model model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch") # Encode documents doc_embeddings = model.encode( ["def hello():\n print('Hello, World!')", "class UserAuth:\n ..."], batch_size=32, is_query=False, show_progress_bar=True, ) # Encode queries query_embeddings = model.encode( ["function that prints a greeting"], batch_size=32, is_query=True, show_progress_bar=True, ) # Score with MaxSim from pylate.scores import colbert_scores scores = colbert_scores(query_embeddings, doc_embeddings) print(scores) # Higher = more relevant ``` ### Reranking ```python from pylate import rank, models model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch") queries = ["how to authenticate users"] documents = [["def login(user, pwd): ...", "def sort_list(arr): ...", "class AuthMiddleware: ..."]] documents_ids = [["doc1", "doc2", "doc3"]] queries_embeddings = model.encode(queries, is_query=True) documents_embeddings = model.encode(documents, is_query=False) reranked = rank.rerank( documents_ids=documents_ids, queries_embeddings=queries_embeddings, documents_embeddings=documents_embeddings, ) ``` ## GGUF / litembeddings This model can be converted to GGUF format for use with [litembeddings](https://github.com/alexandernicholson/litembeddings) (SQLite-based embedding engine with SIMD-accelerated MaxSim): ```bash # Convert to GGUF python convert_hf_to_gguf.py ctrltokyo/ColBERT-Zero-6L-CodeSearch --outfile model-f16.gguf --outtype f16 # Extract projection python -c " from safetensors import safe_open import numpy as np f = safe_open('1_Dense/model.safetensors', framework='numpy') f.get_tensor('linear.weight').astype(np.float32).tofile('model.projection') " ``` Then in SQL: ```sql SELECT lembed_model('codesearch', 'model-f16.gguf', '{"colbert_projection": "model.projection"}'); SELECT lembed_maxsim( lembed_tokens('search_query: how to sort a list'), lembed_tokens('search_document: def quicksort(arr): ...') ); ``` ## Limitations - **Weakest on C code search** (65.9% of teacher on jq corpus) — likely because CodeSearchNet training data is Python-heavy - **Trained on 10k pairs only** — larger training sets or hard negative mining could improve quality further - **English only** — inherits ColBERT-Zero's language capabilities - **No asymmetric prompts** — unlike the teacher, this model does not use `search_query:`/`search_document:` prompts (uses `[Q]`/`[D]` prefixes instead) ## Citation ```bibtex @misc{colbert-zero-6l-codesearch, title={ColBERT-Zero-6L-CodeSearch: A Distilled ColBERT Model for Code Search}, author={Alexander Nicholson}, year={2026}, note={Distilled from ColBERT-Zero (Chaffin et al., 2026) using PyLate on Apple Silicon} } ``` ## Acknowledgments - [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) by LightOn AI — the teacher model - [PyLate](https://github.com/lightonai/pylate) — ColBERT training framework - [litembeddings](https://github.com/alexandernicholson/litembeddings) — SQLite embedding engine used for benchmarking - Training and experimentation performed entirely on Apple Silicon (M4 Max) using PyTorch MPS backend