File size: 9,209 Bytes
48a1ec5 e669757 48a1ec5 e669757 d0a4180 e669757 79db0e8 e669757 f900094 d0a4180 e669757 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | ---
license: mit
language:
- code
- multilingual
tags:
- code
- code-search
- code-retrieval
- embeddings
- feature-extraction
- sentence-similarity
- knowledge-distillation
pipeline_tag: feature-extraction
base_model:
- nomic-ai/CodeRankEmbed
datasets:
- Fsoft-AIC/the-vault-function
- unicamp-dl/mmarco
- sentence-transformers/all-nli
- sentence-transformers/gooaq
- jinaai/negation-dataset
---
# code-daemon-embed-v1
A small, fast **code embedding model** purpose-built to vectorize a **code graph** (functions, methods,
doc-chunks) for on-device semantic code search. It ships with the
[UltraCode](https://github.com/faxenoff/ultracode) MCP server, running as a TensorRT / TVM / OpenVINO / ONNX engine.
It is **deliberately specialized for short code units, not long documents** β long-text handling was
intentionally dropped (max sequence **128 tokens**) to maximize embedding throughput. Code-graph nodes
are short (entity names, signatures, doc-chunks); spending capacity and latency on a long-context path
would only slow the hot path it never uses.
- **768-dim** embeddings, **Matryoshka (MRL)** truncatable to **512 / 256** with graceful decay.
- **~54.5M params** β XLM-RoBERTa architecture, **4 layers / 768 hidden**, **code-only 32k SentencePiece vocab**.
- **Mean pooling** baked into the graph β output is already pooled (`[batch, 768]`); just **L2-normalize**.
- Trained at sequence length **128**; length buckets s/m/l = seq **40 / 64 / 128**.
## How it was made
Knowledge-distilled (embedding regression) from the teacher **[`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed)**
(MIT, 137M, a strong code retriever). The student is a fresh, shallow-wide XLM-R encoder trained
from scratch on the teacher's passage embeddings over a ~32M-sample code + text corpus, with a
custom 32k code-oriented SentencePiece vocabulary (syntax + identifier lexicon rather than prose).
Why shallow-wide (4l/768h) + code vocab: on an internal code-search golden set this **beat** both a
deeper 6-layer variant and the earlier 64k-prose-vocab cut β depth hurt, a code-tuned vocab and a
wide body helped.
## Built for speed
This model trades long-context capability for raw throughput on short code units:
- **Short context by design** β max **128 tokens**, no long-document path. Code-graph nodes are short
(entity names, signatures, doc-chunks), so the model and its engines are tuned only for that, avoiding
the cost of a wide dynamic shape range.
- **Rectangular TensorRT profiles** β each length bucket is built with a *fixed* shape (min == opt == max),
not a dynamic range, so the autotuner locks one optimal kernel set per bucket:
**s** = batch 64 Γ seq 40 Β· **m** = batch 128 Γ seq 64 Β· **l** = batch 256 Γ seq 128.
- **INT8 (W8A16)** weights; **mean-pool + projection + L2-norm fused into the graph** (one pass β `[B, 768]`).
## Intended use
- **Semantic code search / code retrieval**, and general (multilingual) text retrieval as a fallback.
- Embed **queries and documents the same way** (no instruction prefix β the student was distilled on
passage embeddings, unlike the teacher whose prefix is query-only). Mean-pool β **L2-normalize**.
- For smaller indexes, truncate to **256** or **512** dims (MRL) before normalizing.
The daemon runs the bundled engines directly (this repo is its CDN), but the FP32 `model.onnx` is
**also bundled** for standalone use. The recipe below runs it with `onnxruntime`: tokenize with the
bundled `sentencepiece.bpe.model`, run, and the pooled `[B,768]` is already produced β just
L2-normalize:
```python
import onnxruntime as ort, sentencepiece as spm, numpy as np
sp = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model") # pad=0 unk=1 bos=2 eos=3
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
def embed(texts, max_len=128, mrl_dim=768):
ids = [[2, *sp.encode(t)[: max_len - 2], 3] for t in texts] # bos β¦ eos
L = max(len(x) for x in ids)
inp = np.array([x + [0] * (L - len(x)) for x in ids], dtype=np.int64) # pad=0
mask = (inp != 0).astype(np.int64)
out = sess.run(None, {"input_ids": inp, "attention_mask": mask})[0] # already mean-pooled [B,768]
out = out[:, :mrl_dim] # MRL truncation (768/512/256)
return out / np.linalg.norm(out, axis=1, keepdims=True)
```
## What's in this repo β ready-to-run compiled engines
This repo holds **pre-compiled, ready-to-run engines**, named per
**runtime Γ GPU arch Γ OS Γ length-bucket** β grab the compiled model that matches your runtime and
hardware and use it directly, with no compilation on your machine.
- **TensorRT** `*.engine` β NVIDIA, INT8 W8A16, per arch Γ OS Γ bucket:
`code-daemon-embed-v1-{s,m,l}_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine`
(sm_86 β RTX 30xx / A-series Β· sm_89 β RTX 40xx / L4 Β· sm_120 β RTX 50xx).
- **TVM** `*_tvm_vulkan.{dll,so}` β Vulkan fallback for non-TRT / older NVIDIA & other GPUs, per bucket.
- **OpenVINO** `*.xml` + `*.bin` β Intel **CPU / iGPU / NPU**, per bucket.
- **Metal** `*_tvm_metal.*` β Apple Silicon (macOS), per bucket.
- **Tokenizer** β `sentencepiece.bpe.model` (the model's SentencePiece; specials baked at
pad=0 / unk=1 / bos=2 / eos=3, byte-fallback) + `tokenizer_config.json`. The daemon loads the SP directly.
- **ONNX source** β `model.onnx` (+ `model.onnx.data`) FP32 and `model_int8qdt.onnx` (INT8 W8A16) β for
standalone `onnxruntime` / optimum use, and the source the engines are compiled from.
## Evaluation β in-scope CoIR (sub-CoIR)
CoIR is a broad code-retrieval benchmark, but **4 of its 10 tasks are out of scope** for a code-graph
search engine (codeβcode translation, multi-turn dialogue, long problem-statements β the daemon never
performs these). The honest, relevant view is the **in-scope subset** β the retrieval patterns this
model is actually built for (NDCG@10, full corpora):
| CoIR task (in-scope) | NDCG@10 | Pattern |
|---|--:|---|
| codesearchnet (6-lang avg) | **74.64** | docstring / NL β code (the core path) |
| stackoverflow-qa | 53.18 | short question β code |
| synthetic-text2sql | 50.15 | NL β SQL |
| codefeedback-st | 47.71 | NL instruction β code |
| codesearchnet-ccr (6-lang avg) | 44.30 | code β related code (clone/dup) |
| cosqa | 32.14 | NL question β code (noisy / hard) |
| **In-scope average (sub-CoIR)** | **51.56** | |
codesearchnet per language (NLβcode): python **91.96**, go 82.27, java 76.02, php 68.98, ruby 65.94, js 62.66.
> The full 10-task official CoIR average (36.67) is dragged down by the 4 out-of-scope tasks and is not
> representative of the real query mix. For scale, the 1.5B-class `bge-code-v1` scores 81.77 on full
> CoIR β this is a **54.5M** model (27Γ smaller) tuned for one job.
On the daemon's own `search-gold` golden set (its real query distribution): **hit@5 0.692** β +80% over
the retired v1.1 cut (0.385). Binary (1-bit) vectors retain ~91% of float NDCG before rescore.
## Performance (embeddings / sec)
| Backend | Hardware | Throughput |
|---|---|--:|
| TensorRT INT8 | NVIDIA RTX 5060 (sm_120) | **~20,000 emb/s** |
| OpenVINO INT4 | Intel iGPU (Xe2, Lunar Lake) | ~580 emb/s |
| OpenVINO INT4 | Intel NPU (NPU4) | ~574 emb/s |
| OpenVINO INT8 | Intel CPU (Core Ultra) | ~375 emb/s |
| OpenVINO β **all 3 in parallel** | iGPU + NPU + CPU concurrently | ~1,290 emb/s |
The combined figure is **genuine concurrent multi-device execution**: three independent workers β one
bound to each of the iGPU, NPU and CPU β embed different batches **at the same time**, and the
throughputs add up. This is **not** OpenVINO's `AUTO` mode (which selects a *single* device per
inference and never runs the three simultaneously); the daemon length-sorts inputs and fans the buckets
across all three devices. TRT is infer throughput on the bucketed batch path; OV figures measured on a
Core Ultra (Lunar Lake) laptop.
## License & training data
Released under the **MIT license**.
The teacher (`nomic-ai/CodeRankEmbed`) is MIT, and the XLM-R architecture is MIT. As is standard
practice for distilled embedding models, the **weights are released under MIT**. For transparency,
the training corpus the teacher embedded includes:
| Dataset | License note |
|---|---|
| `Fsoft-AIC/the-vault-function` (code) | dataset MIT; underlying code has mixed upstream provenance |
| `unicamp-dl/mmarco` (EN/RU retrieval) | **MS MARCO-derived β non-commercial research terms** |
| `sentence-transformers/all-nli` | SNLI (CC BY-SA 4.0) + MultiNLI |
| `sentence-transformers/gooaq` | Apache-2.0 |
| `jinaai/negation-dataset` | see source repo |
β οΈ If your use requires strict training-data-license compliance, note that **mMARCO derives from
MS MARCO (non-commercial)**. Whether a distilled model inherits dataset-use terms is legally
unsettled; this is **not legal advice**. A data-clean variant can be retrained without the mMARCO
splits if needed.
## Attribution
Distilled from **[nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** (MIT). Backbone: XLM-RoBERTa (MIT).
|