BGE-large Code Search (LoRA fine-tuned)
A fine-tuned code search embedding model based on BAAI/bge-large-en-v1.5 (335M parameters, 1024 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for cqs β code intelligence and RAG for AI agents.
Production Eval (v3.v2 fixture, 2026-05-02)
The headline results below are from cqs's production fixture β 218 queries (109 test + 109 dev) curated from real agent telemetry and LLM-generated retrieval cases on the cqs codebase itself. This is the eval that drives default-model decisions.
| split | metric | BGE-large (base) | BGE-large + LoRA (this) | Ξ vs base |
|---|---|---|---|---|
| test | R@1 | 43.1% | 45.0% | +1.9 |
| test | R@5 | 69.7% | 73.4% | +3.7 |
| test | R@20 | 83.5% | 83.5% | 0.0 |
| dev | R@1 | 45.9% | 46.8% | +0.9 |
| dev | R@5 | 77.1% | 70.6% | β6.5 |
| dev | R@20 | 86.2% | 82.6% | β3.6 |
Wins test R@5 by 3.7pp, loses dev R@5 by 6.5pp. This is the canonical fine-tune trade-off: training on a code-pair distribution helps in-distribution retrieval (test split, where queries pattern-match cqs's own code) but hurts on out-of-distribution generalization (dev split, which deliberately includes harder, more natural-language-shaped queries). For agent-facing search where queries are mostly code-shaped, this is a net win on R@5; for queries that drift into open-ended reasoning, BGE-base's broader pre-training is the safer hedge.
Decision (cqs default): stays at BGE-base for the dev R@5 hedge; ship this as opt-in via CQS_EMBEDDING_MODEL=bge-large-ft or cqs slot create bge-ft --model bge-large-ft. Pick this preset when (a) latency / model size is fixed (same architecture as the base, no extra cost) AND (b) your query distribution is skewed toward concrete code search rather than open-ended exploration.
Historical results (296q synthetic fixture)
These are from an earlier synthetic eval (296q across 7 languages, enriched chunks). They show the model's strength on cleanly-curated code-search pairs:
| Eval | Metric | This Model | BGE-large Baseline | v9-200k (110M) |
|---|---|---|---|---|
| Fixture (296q, 7 languages, enriched) | R@1 | 91.6% | 90.9% | 90.5% |
| Fixture | MRR | 0.952 | 0.949 | 0.948 |
| Raw code embedding (55q, no enrichment) | R@1 | 66.2% | 61.8% | 70.9% |
| Real codebase (100q lookup) | R@1 | 50.0% | 50.0% | 26.0% |
| Real codebase | R@5 | 73.0% | 72.0% | 51.0% |
| CoIR 9-task (19 subtasks) | Overall | 57.5 | 55.7 | 52.7 |
| CoIR CodeSearchNet (6 languages) | NDCG@10 | 0.779 | 0.721 | 0.615 |
Note: the synthetic-fixture numbers above were the original justification for "new best on every metric except raw R@1." That holds on cleanly-curated pairs. The v3.v2 production fixture (above) is harder and more diverse, and that's where the in-vs-out-of-distribution trade-off shows up. Both stories are real; the production fixture is the one that drives the default-model decision.
Training Details
- Base Model: BAAI/bge-large-en-v1.5 (335M params, 1024 dimensions)
- Data: 200K balanced pairs (22,222 per language Γ 9 languages) from cqs-indexed Stack repos
- Key Technique: Call-graph false-negative filtering β excludes structurally related functions from contrastive negatives
- Loss: CachedGISTEmbedLoss (guide: intfloat/e5-base-v2, margin 0.05) + MatryoshkaLoss (1024/512/256/128 dims)
- LoRA: rank 16, alpha 32, dropout 0.1 (targets: query, key, value, dense)
- Epochs: 1 (5938 steps, batch size 32)
- Hardware: NVIDIA RTX A6000 (48GB), ~12.75 hours
- Final loss: train 0.161, eval 0.068
- Dataset: jamie8johnson/cqs-code-search-200k
Enrichment Ablation
Fine-tuning slightly increases enrichment dependency vs baseline:
| Layer Skipped | R@1 | Delta | Baseline Delta |
|---|---|---|---|
| None (full) | 91.6% | β | β |
| doc | 84.1% | -7.5pp | -6.8pp |
| filecontext | 86.8% | -4.8pp | -4.1pp |
| signatures | 89.9% | -1.7pp | -1.4pp |
| callgraph | 90.9% | -0.7pp | -0.4pp |
Supported Languages
Go, Java, JavaScript, PHP, Python, Ruby, Rust, TypeScript, C++
Usage
With cqs
# Download and use as custom model
export CQS_ONNX_DIR=/path/to/this/model/onnx
export CQS_EMBEDDING_DIM=1024
cqs index --force
With sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jamie8johnson/bge-large-v1.5-code-search")
query_emb = model.encode("Represent this sentence for searching relevant passages: find functions that validate email addresses")
code_emb = model.encode("def validate_email(addr): ...")
Note: BGE-large uses an instruction prefix for queries: "Represent this sentence for searching relevant passages: ". Passages have no prefix.
Files
merged_model/β full merged weights (sentence-transformers compatible)lora_adapter/β LoRA adapter only (for PEFT)onnx/model.onnxβ ONNX format for cqs/ORT inference (1.27GB, opset 11)onnx/tokenizer.jsonβ tokenizer for ONNX inference
License
Apache 2.0 (same as base model)
Model tree for jamie8johnson/bge-large-v1.5-code-search
Base model
BAAI/bge-large-en-v1.5