ctrltokyo's picture
Update README.md
a817643 verified
---
tags:
- ColBERT
- PyLate
- sentence-transformers
- sentence-similarity
- feature-extraction
- code-search
- knowledge-distillation
- modernbert
- apple-silicon
- mps
pipeline_tag: sentence-similarity
library_name: PyLate
license: apache-2.0
language:
- en
datasets:
- sentence-transformers/codesearchnet
base_model: lightonai/ColBERT-Zero
---
# ColBERT-Zero-6L-CodeSearch
A **6-layer ColBERT model** distilled from [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) (22 layers) for code search, achieving **85% of the teacher's retrieval quality at 13x faster query speed**.
## Model Details
| Parameter | Value |
|-----------|-------|
| **Architecture** | ModernBERT (6 layers, 768 hidden, 12 heads) |
| **Base Model** | [lightonai/ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) |
| **Output Dimensionality** | 128 per-token embeddings |
| **Similarity Function** | MaxSim (late interaction) |
| **Parameters** | ~38M (vs ~100M teacher) |
| **Query Length** | 32 tokens |
| **Document Length** | 180 tokens |
| **License** | Apache 2.0 |
## Benchmark Results
Evaluated on 3 code search corpora (150 questions total) via [litembeddings](https://github.com/alexandernicholson/litembeddings):
| Corpus | Teacher MRR | Student MRR | % of Teacher | Student Query Speed |
|--------|------------|-------------|--------------|---------------------|
| jq (C) | 0.539 | 0.355 | 65.9% | ~7ms |
| Rails (Ruby) | 0.679 | 0.581 | 85.6% | ~3ms |
| FastAPI (Python) | 0.782 | 0.766 | **98.0%** | ~4ms |
| **Aggregate** | **0.667** | **0.568** | **85.1%** | **~5ms** |
The student model is approximately **13x faster** at query time than the teacher while retaining 85% of retrieval quality. Performance is particularly strong on Python code search (98% of teacher).
## How the Student Was Built
### Architecture: Layer Pruning from Teacher
The student was created by selecting 6 layers from ColBERT-Zero's 22-layer ModernBERT backbone using a **skewed-late** strategy that preserves more upper layers (which encode retrieval-relevant semantics):
```
Teacher layers: [0, 1, 2, ..., 21] (22 total)
Student layers: [0, 8, 14, 17, 19, 21] (6 selected)
```
The student inherits:
- All embedding weights from the teacher
- The 768-to-128 ColBERT projection layer
- Selected transformer layers with full weight copying
### Training: Knowledge Distillation
- **Dataset**: [CodeSearchNet](https://huggingface.co/datasets/sentence-transformers/codesearchnet) (10,000 comment-code pairs)
- **Teacher scoring**: ColBERT-Zero generates MaxSim relevance scores for each query against 1 positive + 3 random negative documents
- **Loss**: PyLate Distillation loss (KL divergence between teacher and student score distributions)
- **Optimizer**: AdamW, lr=5e-5, weight_decay=0.01, warmup_ratio=0.1
- **Training**: 1000 steps, batch_size=8, gradient_accumulation=4 (effective batch size 32)
- **Hardware**: Apple Silicon (M4 Max) via PyTorch MPS backend, ~17 minutes total
### Hyperparameter Search
The optimal configuration was found through **30 autonomous experiments** sweeping learning rate, layer selection strategy, batch size, gradient accumulation, weight decay, warmup ratio, number of negatives, training steps, and embedding dimensions. Key findings:
- **Teacher initialization is critical**: Starting from ColBERT-Zero's weights (MRR 0.46) vs raw ModernBERT (MRR 0.08) — a 5.6x improvement
- **Skewed-late layer selection** outperforms evenly-spaced, last-6, and other strategies
- **Effective batch size 32** (bs=8, grad_accum=4) is optimal
- **Weight decay 0.01** provides regularization benefit
## Usage
### Installation
```bash
pip install pylate
```
### Encoding & Retrieval
```python
from pylate import indexes, models, retrieve
# Load model
model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")
# Encode documents
doc_embeddings = model.encode(
["def hello():\n print('Hello, World!')", "class UserAuth:\n ..."],
batch_size=32,
is_query=False,
show_progress_bar=True,
)
# Encode queries
query_embeddings = model.encode(
["function that prints a greeting"],
batch_size=32,
is_query=True,
show_progress_bar=True,
)
# Score with MaxSim
from pylate.scores import colbert_scores
scores = colbert_scores(query_embeddings, doc_embeddings)
print(scores) # Higher = more relevant
```
### Reranking
```python
from pylate import rank, models
model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")
queries = ["how to authenticate users"]
documents = [["def login(user, pwd): ...", "def sort_list(arr): ...", "class AuthMiddleware: ..."]]
documents_ids = [["doc1", "doc2", "doc3"]]
queries_embeddings = model.encode(queries, is_query=True)
documents_embeddings = model.encode(documents, is_query=False)
reranked = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
```
## GGUF / litembeddings
This model can be converted to GGUF format for use with [litembeddings](https://github.com/alexandernicholson/litembeddings) (SQLite-based embedding engine with SIMD-accelerated MaxSim):
```bash
# Convert to GGUF
python convert_hf_to_gguf.py ctrltokyo/ColBERT-Zero-6L-CodeSearch --outfile model-f16.gguf --outtype f16
# Extract projection
python -c "
from safetensors import safe_open
import numpy as np
f = safe_open('1_Dense/model.safetensors', framework='numpy')
f.get_tensor('linear.weight').astype(np.float32).tofile('model.projection')
"
```
Then in SQL:
```sql
SELECT lembed_model('codesearch', 'model-f16.gguf', '{"colbert_projection": "model.projection"}');
SELECT lembed_maxsim(
lembed_tokens('search_query: how to sort a list'),
lembed_tokens('search_document: def quicksort(arr): ...')
);
```
## Limitations
- **Weakest on C code search** (65.9% of teacher on jq corpus) — likely because CodeSearchNet training data is Python-heavy
- **Trained on 10k pairs only** — larger training sets or hard negative mining could improve quality further
- **English only** — inherits ColBERT-Zero's language capabilities
- **No asymmetric prompts** — unlike the teacher, this model does not use `search_query:`/`search_document:` prompts (uses `[Q]`/`[D]` prefixes instead)
## Citation
```bibtex
@misc{colbert-zero-6l-codesearch,
title={ColBERT-Zero-6L-CodeSearch: A Distilled ColBERT Model for Code Search},
author={Alexander Nicholson},
year={2026},
note={Distilled from ColBERT-Zero (Chaffin et al., 2026) using PyLate on Apple Silicon}
}
```
## Acknowledgments
- [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) by LightOn AI — the teacher model
- [PyLate](https://github.com/lightonai/pylate) — ColBERT training framework
- [litembeddings](https://github.com/alexandernicholson/litembeddings) — SQLite embedding engine used for benchmarking
- Training and experimentation performed entirely on Apple Silicon (M4 Max) using PyTorch MPS backend