bge-small-code-search-v1

A BGE-small-en-v1.5 model fine-tuned on CodeSearchNet (Python) for semantic code search.

🔍 What It Does

Maps natural language queries and code snippets into the same 384-dimensional vector space. Search your codebase by describing what a function does.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Matthieufromparis/bge-small-code-search-v1")

query = "parse JSON config file and return a dictionary"
code_snippets = [...]  # your codebase
query_emb = model.encode(query)
code_embs = model.encode(code_snippets)
similarities = model.similarity(query_emb, code_embs)

📊 Performance

Metric Base (BGE-small) Fine-Tuned Improvement
NDCG@10 0.9761 0.9849 +0.9%
Accuracy@1 0.948 0.960 +1.3%
MRR@10 0.975 0.978 +0.3%

Evaluated on 500 held-out Python code-comment pairs from CodeSearchNet.

🏗️ Training

  • Base Model: BAAI/bge-small-en-v1.5 (33M params, 384 dims)
  • Dataset: CodeSearchNet — Python subset, 6,000 pairs
  • Loss: MultipleNegativesRankingLoss
  • Epochs: 3 | Batch Size: 4 | LR: 2e-5
  • Hardware: Apple M4 (MPS), ~33 min

🚀 Quick Start

pip install sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Matthieufromparis/bge-small-code-search-v1")
query_embedding = model.encode("function that sorts a list of dictionaries by a key")
code_embedding = model.encode("def sort_dicts_by_key(dicts, key): return sorted(dicts, key=lambda x: x.get(key, ''))")
similarity = model.similarity(query_embedding, code_embedding)
print(f"Similarity: {similarity.item():.4f}")

📦 Use Cases

  • Semantic Code Search — Find functions by describing what they do
  • Code Documentation Lookup — Match docs to relevant code
  • Code Deduplication — Find similar implementations across repos
  • RAG for Coding Assistants — Retrieve relevant code for LLM context

🎯 Intended Use

Designed for asymmetric search — queries are natural language, documents are code.

⚠️ Limitations

  • Trained on Python only — may not generalize to other languages
  • 384 dimensions — trades quality for speed vs larger models
  • Training data from CodeSearchNet (2019 vintage)

📚 Resources


Author: Matthieu.AI (Matthieufromparis) — License: Apache 2.0

Downloads last month
31
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Matthieufromparis/bge-small-code-search-v1

Finetuned
(365)
this model

Dataset used to train Matthieufromparis/bge-small-code-search-v1

Space using Matthieufromparis/bge-small-code-search-v1 1

Papers for Matthieufromparis/bge-small-code-search-v1

Evaluation results