File size: 9,209 Bytes
48a1ec5
 
e669757
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48a1ec5
e669757
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d0a4180
 
 
 
e669757
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79db0e8
 
 
e669757
 
 
 
 
 
 
f900094
d0a4180
 
 
e669757
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
license: mit
language:
- code
- multilingual
tags:
- code
- code-search
- code-retrieval
- embeddings
- feature-extraction
- sentence-similarity
- knowledge-distillation
pipeline_tag: feature-extraction
base_model:
- nomic-ai/CodeRankEmbed
datasets:
- Fsoft-AIC/the-vault-function
- unicamp-dl/mmarco
- sentence-transformers/all-nli
- sentence-transformers/gooaq
- jinaai/negation-dataset
---

# code-daemon-embed-v1

A small, fast **code embedding model** purpose-built to vectorize a **code graph** (functions, methods,
doc-chunks) for on-device semantic code search. It ships with the
[UltraCode](https://github.com/faxenoff/ultracode) MCP server, running as a TensorRT / TVM / OpenVINO / ONNX engine.

It is **deliberately specialized for short code units, not long documents** β€” long-text handling was
intentionally dropped (max sequence **128 tokens**) to maximize embedding throughput. Code-graph nodes
are short (entity names, signatures, doc-chunks); spending capacity and latency on a long-context path
would only slow the hot path it never uses.

- **768-dim** embeddings, **Matryoshka (MRL)** truncatable to **512 / 256** with graceful decay.
- **~54.5M params** β€” XLM-RoBERTa architecture, **4 layers / 768 hidden**, **code-only 32k SentencePiece vocab**.
- **Mean pooling** baked into the graph β€” output is already pooled (`[batch, 768]`); just **L2-normalize**.
- Trained at sequence length **128**; length buckets s/m/l = seq **40 / 64 / 128**.

## How it was made

Knowledge-distilled (embedding regression) from the teacher **[`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed)**
(MIT, 137M, a strong code retriever). The student is a fresh, shallow-wide XLM-R encoder trained
from scratch on the teacher's passage embeddings over a ~32M-sample code + text corpus, with a
custom 32k code-oriented SentencePiece vocabulary (syntax + identifier lexicon rather than prose).

Why shallow-wide (4l/768h) + code vocab: on an internal code-search golden set this **beat** both a
deeper 6-layer variant and the earlier 64k-prose-vocab cut β€” depth hurt, a code-tuned vocab and a
wide body helped.

## Built for speed

This model trades long-context capability for raw throughput on short code units:

- **Short context by design** β€” max **128 tokens**, no long-document path. Code-graph nodes are short
  (entity names, signatures, doc-chunks), so the model and its engines are tuned only for that, avoiding
  the cost of a wide dynamic shape range.
- **Rectangular TensorRT profiles** β€” each length bucket is built with a *fixed* shape (min == opt == max),
  not a dynamic range, so the autotuner locks one optimal kernel set per bucket:
  **s** = batch 64 Γ— seq 40 Β· **m** = batch 128 Γ— seq 64 Β· **l** = batch 256 Γ— seq 128.
- **INT8 (W8A16)** weights; **mean-pool + projection + L2-norm fused into the graph** (one pass β†’ `[B, 768]`).

## Intended use

- **Semantic code search / code retrieval**, and general (multilingual) text retrieval as a fallback.
- Embed **queries and documents the same way** (no instruction prefix β€” the student was distilled on
  passage embeddings, unlike the teacher whose prefix is query-only). Mean-pool β†’ **L2-normalize**.
- For smaller indexes, truncate to **256** or **512** dims (MRL) before normalizing.

The daemon runs the bundled engines directly (this repo is its CDN), but the FP32 `model.onnx` is
**also bundled** for standalone use. The recipe below runs it with `onnxruntime`: tokenize with the
bundled `sentencepiece.bpe.model`, run, and the pooled `[B,768]` is already produced β€” just
L2-normalize:

```python
import onnxruntime as ort, sentencepiece as spm, numpy as np

sp   = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model")  # pad=0 unk=1 bos=2 eos=3
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

def embed(texts, max_len=128, mrl_dim=768):
    ids  = [[2, *sp.encode(t)[: max_len - 2], 3] for t in texts]          # bos … eos
    L    = max(len(x) for x in ids)
    inp  = np.array([x + [0] * (L - len(x)) for x in ids], dtype=np.int64) # pad=0
    mask = (inp != 0).astype(np.int64)
    out  = sess.run(None, {"input_ids": inp, "attention_mask": mask})[0]   # already mean-pooled [B,768]
    out  = out[:, :mrl_dim]                                                # MRL truncation (768/512/256)
    return out / np.linalg.norm(out, axis=1, keepdims=True)
```

## What's in this repo β€” ready-to-run compiled engines

This repo holds **pre-compiled, ready-to-run engines**, named per
**runtime Γ— GPU arch Γ— OS Γ— length-bucket** β€” grab the compiled model that matches your runtime and
hardware and use it directly, with no compilation on your machine.

- **TensorRT** `*.engine` β€” NVIDIA, INT8 W8A16, per arch Γ— OS Γ— bucket:
  `code-daemon-embed-v1-{s,m,l}_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine`
  (sm_86 β‰ˆ RTX 30xx / A-series Β· sm_89 β‰ˆ RTX 40xx / L4 Β· sm_120 β‰ˆ RTX 50xx).
- **TVM** `*_tvm_vulkan.{dll,so}` β€” Vulkan fallback for non-TRT / older NVIDIA & other GPUs, per bucket.
- **OpenVINO** `*.xml` + `*.bin` β€” Intel **CPU / iGPU / NPU**, per bucket.
- **Metal** `*_tvm_metal.*` β€” Apple Silicon (macOS), per bucket.
- **Tokenizer** β€” `sentencepiece.bpe.model` (the model's SentencePiece; specials baked at
  pad=0 / unk=1 / bos=2 / eos=3, byte-fallback) + `tokenizer_config.json`. The daemon loads the SP directly.
- **ONNX source** β€” `model.onnx` (+ `model.onnx.data`) FP32 and `model_int8qdt.onnx` (INT8 W8A16) β€” for
  standalone `onnxruntime` / optimum use, and the source the engines are compiled from.

## Evaluation β€” in-scope CoIR (sub-CoIR)

CoIR is a broad code-retrieval benchmark, but **4 of its 10 tasks are out of scope** for a code-graph
search engine (code↔code translation, multi-turn dialogue, long problem-statements β€” the daemon never
performs these). The honest, relevant view is the **in-scope subset** β€” the retrieval patterns this
model is actually built for (NDCG@10, full corpora):

| CoIR task (in-scope) | NDCG@10 | Pattern |
|---|--:|---|
| codesearchnet (6-lang avg) | **74.64** | docstring / NL β†’ code (the core path) |
| stackoverflow-qa | 53.18 | short question β†’ code |
| synthetic-text2sql | 50.15 | NL β†’ SQL |
| codefeedback-st | 47.71 | NL instruction β†’ code |
| codesearchnet-ccr (6-lang avg) | 44.30 | code β†’ related code (clone/dup) |
| cosqa | 32.14 | NL question β†’ code (noisy / hard) |
| **In-scope average (sub-CoIR)** | **51.56** | |

codesearchnet per language (NL→code): python **91.96**, go 82.27, java 76.02, php 68.98, ruby 65.94, js 62.66.

> The full 10-task official CoIR average (36.67) is dragged down by the 4 out-of-scope tasks and is not
> representative of the real query mix. For scale, the 1.5B-class `bge-code-v1` scores 81.77 on full
> CoIR β€” this is a **54.5M** model (27Γ— smaller) tuned for one job.

On the daemon's own `search-gold` golden set (its real query distribution): **hit@5 0.692** β€” +80% over
the retired v1.1 cut (0.385). Binary (1-bit) vectors retain ~91% of float NDCG before rescore.

## Performance (embeddings / sec)

| Backend | Hardware | Throughput |
|---|---|--:|
| TensorRT INT8 | NVIDIA RTX 5060 (sm_120) | **~20,000 emb/s** |
| OpenVINO INT4 | Intel iGPU (Xe2, Lunar Lake) | ~580 emb/s |
| OpenVINO INT4 | Intel NPU (NPU4) | ~574 emb/s |
| OpenVINO INT8 | Intel CPU (Core Ultra) | ~375 emb/s |
| OpenVINO β€” **all 3 in parallel** | iGPU + NPU + CPU concurrently | ~1,290 emb/s |

The combined figure is **genuine concurrent multi-device execution**: three independent workers β€” one
bound to each of the iGPU, NPU and CPU β€” embed different batches **at the same time**, and the
throughputs add up. This is **not** OpenVINO's `AUTO` mode (which selects a *single* device per
inference and never runs the three simultaneously); the daemon length-sorts inputs and fans the buckets
across all three devices. TRT is infer throughput on the bucketed batch path; OV figures measured on a
Core Ultra (Lunar Lake) laptop.

## License & training data

Released under the **MIT license**.

The teacher (`nomic-ai/CodeRankEmbed`) is MIT, and the XLM-R architecture is MIT. As is standard
practice for distilled embedding models, the **weights are released under MIT**. For transparency,
the training corpus the teacher embedded includes:

| Dataset | License note |
|---|---|
| `Fsoft-AIC/the-vault-function` (code) | dataset MIT; underlying code has mixed upstream provenance |
| `unicamp-dl/mmarco` (EN/RU retrieval) | **MS MARCO-derived β†’ non-commercial research terms** |
| `sentence-transformers/all-nli` | SNLI (CC BY-SA 4.0) + MultiNLI |
| `sentence-transformers/gooaq` | Apache-2.0 |
| `jinaai/negation-dataset` | see source repo |

⚠️ If your use requires strict training-data-license compliance, note that **mMARCO derives from
MS MARCO (non-commercial)**. Whether a distilled model inherits dataset-use terms is legally
unsettled; this is **not legal advice**. A data-clean variant can be retrained without the mMARCO
splits if needed.

## Attribution

Distilled from **[nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** (MIT). Backbone: XLM-RoBERTa (MIT).