jamie8johnson
/

bge-large-v1.5-code-search

@@ -20,7 +20,26 @@ datasets:
 A fine-tuned code search embedding model based on **BAAI/bge-large-en-v1.5** (335M parameters, 1024 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for [cqs](https://github.com/jamie8johnson/cqs) — code intelligence and RAG for AI agents.
-## Key Results
 | Eval | Metric | This Model | BGE-large Baseline | v9-200k (110M) |
 |------|--------|-----------|-------------------|----------------|
@@ -32,7 +51,7 @@ A fine-tuned code search embedding model based on **BAAI/bge-large-en-v1.5** (33
 | CoIR 9-task (19 subtasks) | Overall | **57.5** | 55.7 | 52.7 |
 | CoIR CodeSearchNet (6 languages) | NDCG@10 | **0.779** | 0.721 | 0.615 |
-**New best on every metric except raw R@1** (where v9-200k's 70.9% still leads). Fine-tuning is additive with model capacity — BGE-large did not basin like E5-base variants.
 ## Training Details

 A fine-tuned code search embedding model based on **BAAI/bge-large-en-v1.5** (335M parameters, 1024 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for [cqs](https://github.com/jamie8johnson/cqs) — code intelligence and RAG for AI agents.
+## Production Eval (v3.v2 fixture, 2026-05-02)
+The headline results below are from cqs's production fixture — 218 queries (109 test + 109 dev) curated from real agent telemetry and LLM-generated retrieval cases on the cqs codebase itself. This is the eval that drives default-model decisions.
+| split | metric | BGE-large (base) | **BGE-large + LoRA (this)** | Δ vs base |
+|-------|--------|-----------------:|----------------------------:|----------:|
+| test  | R@1    | 43.1%            | **45.0%**                   | +1.9 |
+| test  | R@5    | 69.7%            | **73.4%**                   | **+3.7** |
+| test  | R@20   | 83.5%            | 83.5%                       |  0.0 |
+| dev   | R@1    | 45.9%            | **46.8%**                   | +0.9 |
+| dev   | R@5    | **77.1%**        | 70.6%                       | **−6.5** |
+| dev   | R@20   | **86.2%**        | 82.6%                       | −3.6 |
+**Wins test R@5 by 3.7pp, loses dev R@5 by 6.5pp.** This is the canonical fine-tune trade-off: training on a code-pair distribution helps in-distribution retrieval (test split, where queries pattern-match cqs's own code) but hurts on out-of-distribution generalization (dev split, which deliberately includes harder, more natural-language-shaped queries). For agent-facing search where queries are mostly code-shaped, this is a net win on R@5; for queries that drift into open-ended reasoning, BGE-base's broader pre-training is the safer hedge.
+**Decision (cqs default):** stays at BGE-base for the dev R@5 hedge; ship this as opt-in via `CQS_EMBEDDING_MODEL=bge-large-ft` or `cqs slot create bge-ft --model bge-large-ft`. Pick this preset when (a) latency / model size is fixed (same architecture as the base, no extra cost) AND (b) your query distribution is skewed toward concrete code search rather than open-ended exploration.
+## Historical results (296q synthetic fixture)
+These are from an earlier synthetic eval (296q across 7 languages, enriched chunks). They show the model's strength on cleanly-curated code-search pairs:
 | Eval | Metric | This Model | BGE-large Baseline | v9-200k (110M) |
 |------|--------|-----------|-------------------|----------------|
 | CoIR 9-task (19 subtasks) | Overall | **57.5** | 55.7 | 52.7 |
 | CoIR CodeSearchNet (6 languages) | NDCG@10 | **0.779** | 0.721 | 0.615 |
+**Note:** the synthetic-fixture numbers above were the original justification for "new best on every metric except raw R@1." That holds on cleanly-curated pairs. The v3.v2 production fixture (above) is harder and more diverse, and that's where the in-vs-out-of-distribution trade-off shows up. Both stories are real; the production fixture is the one that drives the default-model decision.
 ## Training Details