| | --- |
| | tags: |
| | - ColBERT |
| | - PyLate |
| | - sentence-transformers |
| | - sentence-similarity |
| | - feature-extraction |
| | - code-search |
| | - knowledge-distillation |
| | - modernbert |
| | - apple-silicon |
| | - mps |
| | pipeline_tag: sentence-similarity |
| | library_name: PyLate |
| | license: apache-2.0 |
| | language: |
| | - en |
| | datasets: |
| | - sentence-transformers/codesearchnet |
| | base_model: lightonai/ColBERT-Zero |
| | --- |
| | |
| | # ColBERT-Zero-6L-CodeSearch |
| |
|
| | A **6-layer ColBERT model** distilled from [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) (22 layers) for code search, achieving **85% of the teacher's retrieval quality at 13x faster query speed**. |
| |
|
| | ## Model Details |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | **Architecture** | ModernBERT (6 layers, 768 hidden, 12 heads) | |
| | | **Base Model** | [lightonai/ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) | |
| | | **Output Dimensionality** | 128 per-token embeddings | |
| | | **Similarity Function** | MaxSim (late interaction) | |
| | | **Parameters** | ~38M (vs ~100M teacher) | |
| | | **Query Length** | 32 tokens | |
| | | **Document Length** | 180 tokens | |
| | | **License** | Apache 2.0 | |
| |
|
| | ## Benchmark Results |
| |
|
| | Evaluated on 3 code search corpora (150 questions total) via [litembeddings](https://github.com/alexandernicholson/litembeddings): |
| |
|
| | | Corpus | Teacher MRR | Student MRR | % of Teacher | Student Query Speed | |
| | |--------|------------|-------------|--------------|---------------------| |
| | | jq (C) | 0.539 | 0.355 | 65.9% | ~7ms | |
| | | Rails (Ruby) | 0.679 | 0.581 | 85.6% | ~3ms | |
| | | FastAPI (Python) | 0.782 | 0.766 | **98.0%** | ~4ms | |
| | | **Aggregate** | **0.667** | **0.568** | **85.1%** | **~5ms** | |
| |
|
| | The student model is approximately **13x faster** at query time than the teacher while retaining 85% of retrieval quality. Performance is particularly strong on Python code search (98% of teacher). |
| |
|
| | ## How the Student Was Built |
| |
|
| | ### Architecture: Layer Pruning from Teacher |
| |
|
| | The student was created by selecting 6 layers from ColBERT-Zero's 22-layer ModernBERT backbone using a **skewed-late** strategy that preserves more upper layers (which encode retrieval-relevant semantics): |
| |
|
| | ``` |
| | Teacher layers: [0, 1, 2, ..., 21] (22 total) |
| | Student layers: [0, 8, 14, 17, 19, 21] (6 selected) |
| | ``` |
| |
|
| | The student inherits: |
| | - All embedding weights from the teacher |
| | - The 768-to-128 ColBERT projection layer |
| | - Selected transformer layers with full weight copying |
| |
|
| | ### Training: Knowledge Distillation |
| |
|
| | - **Dataset**: [CodeSearchNet](https://huggingface.co/datasets/sentence-transformers/codesearchnet) (10,000 comment-code pairs) |
| | - **Teacher scoring**: ColBERT-Zero generates MaxSim relevance scores for each query against 1 positive + 3 random negative documents |
| | - **Loss**: PyLate Distillation loss (KL divergence between teacher and student score distributions) |
| | - **Optimizer**: AdamW, lr=5e-5, weight_decay=0.01, warmup_ratio=0.1 |
| | - **Training**: 1000 steps, batch_size=8, gradient_accumulation=4 (effective batch size 32) |
| | - **Hardware**: Apple Silicon (M4 Max) via PyTorch MPS backend, ~17 minutes total |
| |
|
| | ### Hyperparameter Search |
| |
|
| | The optimal configuration was found through **30 autonomous experiments** sweeping learning rate, layer selection strategy, batch size, gradient accumulation, weight decay, warmup ratio, number of negatives, training steps, and embedding dimensions. Key findings: |
| |
|
| | - **Teacher initialization is critical**: Starting from ColBERT-Zero's weights (MRR 0.46) vs raw ModernBERT (MRR 0.08) — a 5.6x improvement |
| | - **Skewed-late layer selection** outperforms evenly-spaced, last-6, and other strategies |
| | - **Effective batch size 32** (bs=8, grad_accum=4) is optimal |
| | - **Weight decay 0.01** provides regularization benefit |
| | |
| | ## Usage |
| | |
| | ### Installation |
| | |
| | ```bash |
| | pip install pylate |
| | ``` |
| | |
| | ### Encoding & Retrieval |
| | |
| | ```python |
| | from pylate import indexes, models, retrieve |
| | |
| | # Load model |
| | model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch") |
| |
|
| | # Encode documents |
| | doc_embeddings = model.encode( |
| | ["def hello():\n print('Hello, World!')", "class UserAuth:\n ..."], |
| | batch_size=32, |
| | is_query=False, |
| | show_progress_bar=True, |
| | ) |
| | |
| | # Encode queries |
| | query_embeddings = model.encode( |
| | ["function that prints a greeting"], |
| | batch_size=32, |
| | is_query=True, |
| | show_progress_bar=True, |
| | ) |
| | |
| | # Score with MaxSim |
| | from pylate.scores import colbert_scores |
| | scores = colbert_scores(query_embeddings, doc_embeddings) |
| | print(scores) # Higher = more relevant |
| | ``` |
| | |
| | ### Reranking |
| | |
| | ```python |
| | from pylate import rank, models |
| |
|
| | model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch") |
| | |
| | queries = ["how to authenticate users"] |
| | documents = [["def login(user, pwd): ...", "def sort_list(arr): ...", "class AuthMiddleware: ..."]] |
| | documents_ids = [["doc1", "doc2", "doc3"]] |
| | |
| | queries_embeddings = model.encode(queries, is_query=True) |
| | documents_embeddings = model.encode(documents, is_query=False) |
| | |
| | reranked = rank.rerank( |
| | documents_ids=documents_ids, |
| | queries_embeddings=queries_embeddings, |
| | documents_embeddings=documents_embeddings, |
| | ) |
| | ``` |
| | |
| | ## GGUF / litembeddings |
| | |
| | This model can be converted to GGUF format for use with [litembeddings](https://github.com/alexandernicholson/litembeddings) (SQLite-based embedding engine with SIMD-accelerated MaxSim): |
| | |
| | ```bash |
| | # Convert to GGUF |
| | python convert_hf_to_gguf.py ctrltokyo/ColBERT-Zero-6L-CodeSearch --outfile model-f16.gguf --outtype f16 |
| |
|
| | # Extract projection |
| | python -c " |
| | from safetensors import safe_open |
| | import numpy as np |
| | f = safe_open('1_Dense/model.safetensors', framework='numpy') |
| | f.get_tensor('linear.weight').astype(np.float32).tofile('model.projection') |
| | " |
| | ``` |
| | |
| | Then in SQL: |
| | ```sql |
| | SELECT lembed_model('codesearch', 'model-f16.gguf', '{"colbert_projection": "model.projection"}'); |
| | SELECT lembed_maxsim( |
| | lembed_tokens('search_query: how to sort a list'), |
| | lembed_tokens('search_document: def quicksort(arr): ...') |
| | ); |
| | ``` |
| | |
| | ## Limitations |
| | |
| | - **Weakest on C code search** (65.9% of teacher on jq corpus) — likely because CodeSearchNet training data is Python-heavy |
| | - **Trained on 10k pairs only** — larger training sets or hard negative mining could improve quality further |
| | - **English only** — inherits ColBERT-Zero's language capabilities |
| | - **No asymmetric prompts** — unlike the teacher, this model does not use `search_query:`/`search_document:` prompts (uses `[Q]`/`[D]` prefixes instead) |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @misc{colbert-zero-6l-codesearch, |
| | title={ColBERT-Zero-6L-CodeSearch: A Distilled ColBERT Model for Code Search}, |
| | author={Alexander Nicholson}, |
| | year={2026}, |
| | note={Distilled from ColBERT-Zero (Chaffin et al., 2026) using PyLate on Apple Silicon} |
| | } |
| | ``` |
| | |
| | ## Acknowledgments |
| | |
| | - [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) by LightOn AI — the teacher model |
| | - [PyLate](https://github.com/lightonai/pylate) — ColBERT training framework |
| | - [litembeddings](https://github.com/alexandernicholson/litembeddings) — SQLite embedding engine used for benchmarking |
| | - Training and experimentation performed entirely on Apple Silicon (M4 Max) using PyTorch MPS backend |
| | |