jamie8johnson commited on
Commit
bf1daa3
·
verified ·
1 Parent(s): bc832ef

Add v3.v2 production-eval results; document test+/dev- trade-off

Browse files
Files changed (1) hide show
  1. README.md +21 -2
README.md CHANGED
@@ -20,7 +20,26 @@ datasets:
20
 
21
  A fine-tuned code search embedding model based on **BAAI/bge-large-en-v1.5** (335M parameters, 1024 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for [cqs](https://github.com/jamie8johnson/cqs) — code intelligence and RAG for AI agents.
22
 
23
- ## Key Results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  | Eval | Metric | This Model | BGE-large Baseline | v9-200k (110M) |
26
  |------|--------|-----------|-------------------|----------------|
@@ -32,7 +51,7 @@ A fine-tuned code search embedding model based on **BAAI/bge-large-en-v1.5** (33
32
  | CoIR 9-task (19 subtasks) | Overall | **57.5** | 55.7 | 52.7 |
33
  | CoIR CodeSearchNet (6 languages) | NDCG@10 | **0.779** | 0.721 | 0.615 |
34
 
35
- **New best on every metric except raw R@1** (where v9-200k's 70.9% still leads). Fine-tuning is additive with model capacity BGE-large did not basin like E5-base variants.
36
 
37
  ## Training Details
38
 
 
20
 
21
  A fine-tuned code search embedding model based on **BAAI/bge-large-en-v1.5** (335M parameters, 1024 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for [cqs](https://github.com/jamie8johnson/cqs) — code intelligence and RAG for AI agents.
22
 
23
+ ## Production Eval (v3.v2 fixture, 2026-05-02)
24
+
25
+ The headline results below are from cqs's production fixture — 218 queries (109 test + 109 dev) curated from real agent telemetry and LLM-generated retrieval cases on the cqs codebase itself. This is the eval that drives default-model decisions.
26
+
27
+ | split | metric | BGE-large (base) | **BGE-large + LoRA (this)** | Δ vs base |
28
+ |-------|--------|-----------------:|----------------------------:|----------:|
29
+ | test | R@1 | 43.1% | **45.0%** | +1.9 |
30
+ | test | R@5 | 69.7% | **73.4%** | **+3.7** |
31
+ | test | R@20 | 83.5% | 83.5% | 0.0 |
32
+ | dev | R@1 | 45.9% | **46.8%** | +0.9 |
33
+ | dev | R@5 | **77.1%** | 70.6% | **−6.5** |
34
+ | dev | R@20 | **86.2%** | 82.6% | −3.6 |
35
+
36
+ **Wins test R@5 by 3.7pp, loses dev R@5 by 6.5pp.** This is the canonical fine-tune trade-off: training on a code-pair distribution helps in-distribution retrieval (test split, where queries pattern-match cqs's own code) but hurts on out-of-distribution generalization (dev split, which deliberately includes harder, more natural-language-shaped queries). For agent-facing search where queries are mostly code-shaped, this is a net win on R@5; for queries that drift into open-ended reasoning, BGE-base's broader pre-training is the safer hedge.
37
+
38
+ **Decision (cqs default):** stays at BGE-base for the dev R@5 hedge; ship this as opt-in via `CQS_EMBEDDING_MODEL=bge-large-ft` or `cqs slot create bge-ft --model bge-large-ft`. Pick this preset when (a) latency / model size is fixed (same architecture as the base, no extra cost) AND (b) your query distribution is skewed toward concrete code search rather than open-ended exploration.
39
+
40
+ ## Historical results (296q synthetic fixture)
41
+
42
+ These are from an earlier synthetic eval (296q across 7 languages, enriched chunks). They show the model's strength on cleanly-curated code-search pairs:
43
 
44
  | Eval | Metric | This Model | BGE-large Baseline | v9-200k (110M) |
45
  |------|--------|-----------|-------------------|----------------|
 
51
  | CoIR 9-task (19 subtasks) | Overall | **57.5** | 55.7 | 52.7 |
52
  | CoIR CodeSearchNet (6 languages) | NDCG@10 | **0.779** | 0.721 | 0.615 |
53
 
54
+ **Note:** the synthetic-fixture numbers above were the original justification for "new best on every metric except raw R@1." That holds on cleanly-curated pairs. The v3.v2 production fixture (above) is harder and more diverse, and that's where the in-vs-out-of-distribution trade-off shows up. Both stories are real; the production fixture is the one that drives the default-model decision.
55
 
56
  ## Training Details
57