BGE-large Code Search (LoRA fine-tuned)

A fine-tuned code search embedding model based on BAAI/bge-large-en-v1.5 (335M parameters, 1024 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for cqs — code intelligence and RAG for AI agents.

Production Eval (v3.v2 fixture, 2026-05-02)

The headline results below are from cqs's production fixture — 218 queries (109 test + 109 dev) curated from real agent telemetry and LLM-generated retrieval cases on the cqs codebase itself. This is the eval that drives default-model decisions.

split	metric	BGE-large (base)	BGE-large + LoRA (this)	Δ vs base
test	R@1	43.1%	45.0%	+1.9
test	R@5	69.7%	73.4%	+3.7
test	R@20	83.5%	83.5%	0.0
dev	R@1	45.9%	46.8%	+0.9
dev	R@5	77.1%	70.6%	−6.5
dev	R@20	86.2%	82.6%	−3.6

Wins test R@5 by 3.7pp, loses dev R@5 by 6.5pp. This is the canonical fine-tune trade-off: training on a code-pair distribution helps in-distribution retrieval (test split, where queries pattern-match cqs's own code) but hurts on out-of-distribution generalization (dev split, which deliberately includes harder, more natural-language-shaped queries). For agent-facing search where queries are mostly code-shaped, this is a net win on R@5; for queries that drift into open-ended reasoning, BGE-base's broader pre-training is the safer hedge.

Decision (cqs default): stays at BGE-base for the dev R@5 hedge; ship this as opt-in via CQS_EMBEDDING_MODEL=bge-large-ft or cqs slot create bge-ft --model bge-large-ft. Pick this preset when (a) latency / model size is fixed (same architecture as the base, no extra cost) AND (b) your query distribution is skewed toward concrete code search rather than open-ended exploration.

Historical results (296q synthetic fixture)

These are from an earlier synthetic eval (296q across 7 languages, enriched chunks). They show the model's strength on cleanly-curated code-search pairs:

Eval	Metric	This Model	BGE-large Baseline	v9-200k (110M)
Fixture (296q, 7 languages, enriched)	R@1	91.6%	90.9%	90.5%
Fixture	MRR	0.952	0.949	0.948
Raw code embedding (55q, no enrichment)	R@1	66.2%	61.8%	70.9%
Real codebase (100q lookup)	R@1	50.0%	50.0%	26.0%
Real codebase	R@5	73.0%	72.0%	51.0%
CoIR 9-task (19 subtasks)	Overall	57.5	55.7	52.7
CoIR CodeSearchNet (6 languages)	NDCG@10	0.779	0.721	0.615

Note: the synthetic-fixture numbers above were the original justification for "new best on every metric except raw R@1." That holds on cleanly-curated pairs. The v3.v2 production fixture (above) is harder and more diverse, and that's where the in-vs-out-of-distribution trade-off shows up. Both stories are real; the production fixture is the one that drives the default-model decision.

Training Details

Base Model: BAAI/bge-large-en-v1.5 (335M params, 1024 dimensions)
Data: 200K balanced pairs (22,222 per language × 9 languages) from cqs-indexed Stack repos
Key Technique: Call-graph false-negative filtering — excludes structurally related functions from contrastive negatives
Loss: CachedGISTEmbedLoss (guide: intfloat/e5-base-v2, margin 0.05) + MatryoshkaLoss (1024/512/256/128 dims)
LoRA: rank 16, alpha 32, dropout 0.1 (targets: query, key, value, dense)
Epochs: 1 (5938 steps, batch size 32)
Hardware: NVIDIA RTX A6000 (48GB), ~12.75 hours
Final loss: train 0.161, eval 0.068
Dataset: jamie8johnson/cqs-code-search-200k

Enrichment Ablation

Fine-tuning slightly increases enrichment dependency vs baseline:

Layer Skipped	R@1	Delta	Baseline Delta
None (full)	91.6%	—	—
doc	84.1%	-7.5pp	-6.8pp
filecontext	86.8%	-4.8pp	-4.1pp
signatures	89.9%	-1.7pp	-1.4pp
callgraph	90.9%	-0.7pp	-0.4pp

Supported Languages

Go, Java, JavaScript, PHP, Python, Ruby, Rust, TypeScript, C++

Usage

With cqs

# Download and use as custom model
export CQS_ONNX_DIR=/path/to/this/model/onnx
export CQS_EMBEDDING_DIM=1024
cqs index --force

With sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("jamie8johnson/bge-large-v1.5-code-search")
query_emb = model.encode("Represent this sentence for searching relevant passages: find functions that validate email addresses")
code_emb = model.encode("def validate_email(addr): ...")

Note: BGE-large uses an instruction prefix for queries: "Represent this sentence for searching relevant passages: ". Passages have no prefix.

Files

merged_model/ — full merged weights (sentence-transformers compatible)
lora_adapter/ — LoRA adapter only (for PEFT)
onnx/model.onnx — ONNX format for cqs/ORT inference (1.27GB, opset 11)
onnx/tokenizer.json — tokenizer for ONNX inference

License

Apache 2.0 (same as base model)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for jamie8johnson/bge-large-v1.5-code-search

Base model

BAAI/bge-large-en-v1.5

Quantized

(16)

this model

jamie8johnson
/

bge-large-v1.5-code-search