ctrltokyo's picture
Add CodeSearchNet MRR benchmark results
b9a34e5 verified
|
raw
history blame
5.01 kB
metadata
license: apache-2.0
language:
  - en
  - code
library_name: PyLate
tags:
  - ColBERT
  - PyLate
  - sentence-transformers
  - code-search
  - code-retrieval
  - late-interaction
  - reasoning
base_model: lightonai/GTE-ModernColBERT-v1
datasets:
  - nomic-ai/cornstack-python-v1
  - nomic-ai/cornstack-java-v1
  - nomic-ai/cornstack-javascript-v1
  - nomic-ai/cornstack-php-v1
  - nomic-ai/cornstack-go-v1
  - nomic-ai/cornstack-ruby-v1
pipeline_tag: sentence-similarity

Reason-Code-ModernColBERT

The first ColBERT (late-interaction) model specifically designed for code search and retrieval.

Combines the token-granular matching advantages of ColBERT with reasoning-enhanced queries, extending the ReasonIR methodology to the code domain.

Why Late-Interaction for Code?

All existing SOTA code search models (CodeXEmbed, Nomic Embed Code, Voyage Code) use bi-encoder / single-vector architectures. ColBERT's late-interaction approach computes token-level similarity (MaxSim), which is particularly well-suited for code because:

  • Code has rich token-level structure (identifiers, operators, keywords, types)
  • A query like "sort array in reverse order" needs to match specific code tokens (.sort(), reverse=True)
  • MaxSim naturally captures partial matches between NL query tokens and code tokens
  • On reasoning tasks, Reason-ModernColBERT (150M) outperformed 7B dense models

Model Details

Property Value
Base model lightonai/GTE-ModernColBERT-v1
Architecture ColBERT (late-interaction, multi-vector)
Parameters 150M
Embedding dim 128 per token
Document length 512 tokens
Query length 128 tokens
Similarity MaxSim
Languages Python, Java, JavaScript, PHP, Go, Ruby
License Apache 2.0

Training

Two-Stage Training Pipeline

Stage 1: CoRNStack Base (1 epoch)

  • 100,000 high-quality code search pairs from CoRNStack (Apache 2.0)
  • 6 languages: Python (25K), Java (20K), JavaScript (15K), PHP (15K), Go (15K), Ruby (10K)
  • Loss: 2.42 → 0.63

Stage 2: Reasoning-Enhanced Fine-Tuning (3 epochs)

  • 9,959 reasoning-intensive code search queries generated from CoRNStack code samples
  • Queries require understanding algorithms, edge cases, design patterns, and complexity
  • Each query includes a chain-of-thought reasoning prefix (ReasonIR methodology)
  • Loss: 2.36 → 0.54

Training Configuration

# Both stages
model = ColBERT(document_length=512, query_length=128)
loss = CachedContrastive(temperature=1.0, mini_batch_size=32)
batch_size = 256
optim = "adamw_torch"
bf16 = True

# Stage 1: lr=1e-5, 1 epoch, warmup=5%
# Stage 2: lr=5e-6, 3 epochs, warmup=5%

Hardware

Trained on a single NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory).

  • Stage 1: ~130 min (391 steps)
  • Stage 2: ~37 min (117 steps)

Benchmark Results

CodeSearchNet MRR (500 queries per language, 500 candidates)

Language GTE-ModernColBERT (base) Reason-Code-ModernColBERT (ours) Δ
Python 0.991 0.989 -0.002
Java 0.829 0.866 +0.037
JavaScript 0.802 0.839 +0.037
PHP 0.841 0.862 +0.021
Go 0.879 0.887 +0.008
Ruby 0.773 0.831 +0.058
Average 0.853 0.879 +0.026

Improves on the base model in 5 of 6 languages. Largest gains in Ruby (+5.8pp), Java (+3.7pp), and JavaScript (+3.7pp) — languages that benefited most from reasoning-enhanced training data. Python is near-ceiling at 0.99.

Usage

from pylate import models

model = models.ColBERT(model_name_or_path="ctrltokyo/Reason-Code-ModernColBERT")

queries = ["function that sorts an array in descending order using a comparison-based algorithm"]
code_docs = ["def sort_desc(arr):\n    return sorted(arr, reverse=True)"]

query_embeddings = model.encode(queries, is_query=True)
doc_embeddings = model.encode(code_docs, is_query=False)

Citation

This model extends the methodology from:

@article{shao2025reasonir,
  title={ReasonIR: Training Retrievers for Reasoning Tasks},
  author={Shao, Rulin and Jiang, Rui and Yu, Tao and Hashimoto, Tatsunori},
  journal={arXiv preprint arXiv:2504.20595},
  year={2025}
}

@misc{Reason-ModernColBERT,
  title={Reason-ModernColBERT},
  author={LightOn AI},
  year={2025},
  url={https://huggingface.co/lightonai/Reason-ModernColBERT}
}

@inproceedings{cornstack2025,
  title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
  author={Gangisetty, Zach and others},
  booktitle={ICLR},
  year={2025}
}

Built with PyLate and Sentence Transformers.