BGE-large Code Search (LoRA fine-tuned)

A fine-tuned code search embedding model based on BAAI/bge-large-en-v1.5 (335M parameters, 1024 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for cqs β€” code intelligence and RAG for AI agents.

Production Eval (v3.v2 fixture, 2026-05-02)

The headline results below are from cqs's production fixture β€” 218 queries (109 test + 109 dev) curated from real agent telemetry and LLM-generated retrieval cases on the cqs codebase itself. This is the eval that drives default-model decisions.

split metric BGE-large (base) BGE-large + LoRA (this) Ξ” vs base
test R@1 43.1% 45.0% +1.9
test R@5 69.7% 73.4% +3.7
test R@20 83.5% 83.5% 0.0
dev R@1 45.9% 46.8% +0.9
dev R@5 77.1% 70.6% βˆ’6.5
dev R@20 86.2% 82.6% βˆ’3.6

Wins test R@5 by 3.7pp, loses dev R@5 by 6.5pp. This is the canonical fine-tune trade-off: training on a code-pair distribution helps in-distribution retrieval (test split, where queries pattern-match cqs's own code) but hurts on out-of-distribution generalization (dev split, which deliberately includes harder, more natural-language-shaped queries). For agent-facing search where queries are mostly code-shaped, this is a net win on R@5; for queries that drift into open-ended reasoning, BGE-base's broader pre-training is the safer hedge.

Decision (cqs default): stays at BGE-base for the dev R@5 hedge; ship this as opt-in via CQS_EMBEDDING_MODEL=bge-large-ft or cqs slot create bge-ft --model bge-large-ft. Pick this preset when (a) latency / model size is fixed (same architecture as the base, no extra cost) AND (b) your query distribution is skewed toward concrete code search rather than open-ended exploration.

Historical results (296q synthetic fixture)

These are from an earlier synthetic eval (296q across 7 languages, enriched chunks). They show the model's strength on cleanly-curated code-search pairs:

Eval Metric This Model BGE-large Baseline v9-200k (110M)
Fixture (296q, 7 languages, enriched) R@1 91.6% 90.9% 90.5%
Fixture MRR 0.952 0.949 0.948
Raw code embedding (55q, no enrichment) R@1 66.2% 61.8% 70.9%
Real codebase (100q lookup) R@1 50.0% 50.0% 26.0%
Real codebase R@5 73.0% 72.0% 51.0%
CoIR 9-task (19 subtasks) Overall 57.5 55.7 52.7
CoIR CodeSearchNet (6 languages) NDCG@10 0.779 0.721 0.615

Note: the synthetic-fixture numbers above were the original justification for "new best on every metric except raw R@1." That holds on cleanly-curated pairs. The v3.v2 production fixture (above) is harder and more diverse, and that's where the in-vs-out-of-distribution trade-off shows up. Both stories are real; the production fixture is the one that drives the default-model decision.

Training Details

  • Base Model: BAAI/bge-large-en-v1.5 (335M params, 1024 dimensions)
  • Data: 200K balanced pairs (22,222 per language Γ— 9 languages) from cqs-indexed Stack repos
  • Key Technique: Call-graph false-negative filtering β€” excludes structurally related functions from contrastive negatives
  • Loss: CachedGISTEmbedLoss (guide: intfloat/e5-base-v2, margin 0.05) + MatryoshkaLoss (1024/512/256/128 dims)
  • LoRA: rank 16, alpha 32, dropout 0.1 (targets: query, key, value, dense)
  • Epochs: 1 (5938 steps, batch size 32)
  • Hardware: NVIDIA RTX A6000 (48GB), ~12.75 hours
  • Final loss: train 0.161, eval 0.068
  • Dataset: jamie8johnson/cqs-code-search-200k

Enrichment Ablation

Fine-tuning slightly increases enrichment dependency vs baseline:

Layer Skipped R@1 Delta Baseline Delta
None (full) 91.6% β€” β€”
doc 84.1% -7.5pp -6.8pp
filecontext 86.8% -4.8pp -4.1pp
signatures 89.9% -1.7pp -1.4pp
callgraph 90.9% -0.7pp -0.4pp

Supported Languages

Go, Java, JavaScript, PHP, Python, Ruby, Rust, TypeScript, C++

Usage

With cqs

# Download and use as custom model
export CQS_ONNX_DIR=/path/to/this/model/onnx
export CQS_EMBEDDING_DIM=1024
cqs index --force

With sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("jamie8johnson/bge-large-v1.5-code-search")
query_emb = model.encode("Represent this sentence for searching relevant passages: find functions that validate email addresses")
code_emb = model.encode("def validate_email(addr): ...")

Note: BGE-large uses an instruction prefix for queries: "Represent this sentence for searching relevant passages: ". Passages have no prefix.

Files

  • merged_model/ β€” full merged weights (sentence-transformers compatible)
  • lora_adapter/ β€” LoRA adapter only (for PEFT)
  • onnx/model.onnx β€” ONNX format for cqs/ORT inference (1.27GB, opset 11)
  • onnx/tokenizer.json β€” tokenizer for ONNX inference

License

Apache 2.0 (same as base model)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jamie8johnson/bge-large-v1.5-code-search

Quantized
(16)
this model

Dataset used to train jamie8johnson/bge-large-v1.5-code-search