Sentence Similarity
sentence-transformers
ONNX
Safetensors
English
code
code-search
embeddings
LoRA
BGE
cqs
Instructions to use jamie8johnson/bge-large-v1.5-code-search with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use jamie8johnson/bge-large-v1.5-code-search with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("jamie8johnson/bge-large-v1.5-code-search") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Add v3.v2 production-eval results; document test+/dev- trade-off
Browse files
README.md
CHANGED
|
@@ -20,7 +20,26 @@ datasets:
|
|
| 20 |
|
| 21 |
A fine-tuned code search embedding model based on **BAAI/bge-large-en-v1.5** (335M parameters, 1024 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for [cqs](https://github.com/jamie8johnson/cqs) — code intelligence and RAG for AI agents.
|
| 22 |
|
| 23 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
| Eval | Metric | This Model | BGE-large Baseline | v9-200k (110M) |
|
| 26 |
|------|--------|-----------|-------------------|----------------|
|
|
@@ -32,7 +51,7 @@ A fine-tuned code search embedding model based on **BAAI/bge-large-en-v1.5** (33
|
|
| 32 |
| CoIR 9-task (19 subtasks) | Overall | **57.5** | 55.7 | 52.7 |
|
| 33 |
| CoIR CodeSearchNet (6 languages) | NDCG@10 | **0.779** | 0.721 | 0.615 |
|
| 34 |
|
| 35 |
-
**
|
| 36 |
|
| 37 |
## Training Details
|
| 38 |
|
|
|
|
| 20 |
|
| 21 |
A fine-tuned code search embedding model based on **BAAI/bge-large-en-v1.5** (335M parameters, 1024 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for [cqs](https://github.com/jamie8johnson/cqs) — code intelligence and RAG for AI agents.
|
| 22 |
|
| 23 |
+
## Production Eval (v3.v2 fixture, 2026-05-02)
|
| 24 |
+
|
| 25 |
+
The headline results below are from cqs's production fixture — 218 queries (109 test + 109 dev) curated from real agent telemetry and LLM-generated retrieval cases on the cqs codebase itself. This is the eval that drives default-model decisions.
|
| 26 |
+
|
| 27 |
+
| split | metric | BGE-large (base) | **BGE-large + LoRA (this)** | Δ vs base |
|
| 28 |
+
|-------|--------|-----------------:|----------------------------:|----------:|
|
| 29 |
+
| test | R@1 | 43.1% | **45.0%** | +1.9 |
|
| 30 |
+
| test | R@5 | 69.7% | **73.4%** | **+3.7** |
|
| 31 |
+
| test | R@20 | 83.5% | 83.5% | 0.0 |
|
| 32 |
+
| dev | R@1 | 45.9% | **46.8%** | +0.9 |
|
| 33 |
+
| dev | R@5 | **77.1%** | 70.6% | **−6.5** |
|
| 34 |
+
| dev | R@20 | **86.2%** | 82.6% | −3.6 |
|
| 35 |
+
|
| 36 |
+
**Wins test R@5 by 3.7pp, loses dev R@5 by 6.5pp.** This is the canonical fine-tune trade-off: training on a code-pair distribution helps in-distribution retrieval (test split, where queries pattern-match cqs's own code) but hurts on out-of-distribution generalization (dev split, which deliberately includes harder, more natural-language-shaped queries). For agent-facing search where queries are mostly code-shaped, this is a net win on R@5; for queries that drift into open-ended reasoning, BGE-base's broader pre-training is the safer hedge.
|
| 37 |
+
|
| 38 |
+
**Decision (cqs default):** stays at BGE-base for the dev R@5 hedge; ship this as opt-in via `CQS_EMBEDDING_MODEL=bge-large-ft` or `cqs slot create bge-ft --model bge-large-ft`. Pick this preset when (a) latency / model size is fixed (same architecture as the base, no extra cost) AND (b) your query distribution is skewed toward concrete code search rather than open-ended exploration.
|
| 39 |
+
|
| 40 |
+
## Historical results (296q synthetic fixture)
|
| 41 |
+
|
| 42 |
+
These are from an earlier synthetic eval (296q across 7 languages, enriched chunks). They show the model's strength on cleanly-curated code-search pairs:
|
| 43 |
|
| 44 |
| Eval | Metric | This Model | BGE-large Baseline | v9-200k (110M) |
|
| 45 |
|------|--------|-----------|-------------------|----------------|
|
|
|
|
| 51 |
| CoIR 9-task (19 subtasks) | Overall | **57.5** | 55.7 | 52.7 |
|
| 52 |
| CoIR CodeSearchNet (6 languages) | NDCG@10 | **0.779** | 0.721 | 0.615 |
|
| 53 |
|
| 54 |
+
**Note:** the synthetic-fixture numbers above were the original justification for "new best on every metric except raw R@1." That holds on cleanly-curated pairs. The v3.v2 production fixture (above) is harder and more diverse, and that's where the in-vs-out-of-distribution trade-off shows up. Both stories are real; the production fixture is the one that drives the default-model decision.
|
| 55 |
|
| 56 |
## Training Details
|
| 57 |
|