File size: 9,209 Bytes

---
license: mit
language:
- code
- multilingual
tags:
- code
- code-search
- code-retrieval
- embeddings
- feature-extraction
- sentence-similarity
- knowledge-distillation
pipeline_tag: feature-extraction
base_model:
- nomic-ai/CodeRankEmbed
datasets:
- Fsoft-AIC/the-vault-function
- unicamp-dl/mmarco
- sentence-transformers/all-nli
- sentence-transformers/gooaq
- jinaai/negation-dataset
---

# code-daemon-embed-v1

A small, fast **code embedding model** purpose-built to vectorize a **code graph** (functions, methods,
doc-chunks) for on-device semantic code search. It ships with the
[UltraCode](https://github.com/faxenoff/ultracode) MCP server, running as a TensorRT / TVM / OpenVINO / ONNX engine.

It is **deliberately specialized for short code units, not long documents** — long-text handling was
intentionally dropped (max sequence **128 tokens**) to maximize embedding throughput. Code-graph nodes
are short (entity names, signatures, doc-chunks); spending capacity and latency on a long-context path
would only slow the hot path it never uses.

- **768-dim** embeddings, **Matryoshka (MRL)** truncatable to **512 / 256** with graceful decay.
- **~54.5M params** — XLM-RoBERTa architecture, **4 layers / 768 hidden**, **code-only 32k SentencePiece vocab**.
- **Mean pooling** baked into the graph — output is already pooled (`[batch, 768]`); just **L2-normalize**.
- Trained at sequence length **128**; length buckets s/m/l = seq **40 / 64 / 128**.

## How it was made

Knowledge-distilled (embedding regression) from the teacher **[`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed)**
(MIT, 137M, a strong code retriever). The student is a fresh, shallow-wide XLM-R encoder trained
from scratch on the teacher's passage embeddings over a ~32M-sample code + text corpus, with a
custom 32k code-oriented SentencePiece vocabulary (syntax + identifier lexicon rather than prose).

Why shallow-wide (4l/768h) + code vocab: on an internal code-search golden set this **beat** both a
deeper 6-layer variant and the earlier 64k-prose-vocab cut — depth hurt, a code-tuned vocab and a
wide body helped.

## Built for speed

This model trades long-context capability for raw throughput on short code units:

- **Short context by design** — max **128 tokens**, no long-document path. Code-graph nodes are short
  (entity names, signatures, doc-chunks), so the model and its engines are tuned only for that, avoiding
  the cost of a wide dynamic shape range.
- **Rectangular TensorRT profiles** — each length bucket is built with a *fixed* shape (min == opt == max),
  not a dynamic range, so the autotuner locks one optimal kernel set per bucket:
  **s** = batch 64 × seq 40 · **m** = batch 128 × seq 64 · **l** = batch 256 × seq 128.
- **INT8 (W8A16)** weights; **mean-pool + projection + L2-norm fused into the graph** (one pass → `[B, 768]`).

## Intended use

- **Semantic code search / code retrieval**, and general (multilingual) text retrieval as a fallback.
- Embed **queries and documents the same way** (no instruction prefix — the student was distilled on
  passage embeddings, unlike the teacher whose prefix is query-only). Mean-pool → **L2-normalize**.
- For smaller indexes, truncate to **256** or **512** dims (MRL) before normalizing.

The daemon runs the bundled engines directly (this repo is its CDN), but the FP32 `model.onnx` is
**also bundled** for standalone use. The recipe below runs it with `onnxruntime`: tokenize with the
bundled `sentencepiece.bpe.model`, run, and the pooled `[B,768]` is already produced — just
L2-normalize:

```python
import onnxruntime as ort, sentencepiece as spm, numpy as np

sp   = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model")  # pad=0 unk=1 bos=2 eos=3
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

def embed(texts, max_len=128, mrl_dim=768):
    ids  = [[2, *sp.encode(t)[: max_len - 2], 3] for t in texts]          # bos … eos
    L    = max(len(x) for x in ids)
    inp  = np.array([x + [0] * (L - len(x)) for x in ids], dtype=np.int64) # pad=0
    mask = (inp != 0).astype(np.int64)
    out  = sess.run(None, {"input_ids": inp, "attention_mask": mask})[0]   # already mean-pooled [B,768]
    out  = out[:, :mrl_dim]                                                # MRL truncation (768/512/256)
    return out / np.linalg.norm(out, axis=1, keepdims=True)
```

## What's in this repo — ready-to-run compiled engines

This repo holds **pre-compiled, ready-to-run engines**, named per
**runtime × GPU arch × OS × length-bucket** — grab the compiled model that matches your runtime and
hardware and use it directly, with no compilation on your machine.

- **TensorRT** `*.engine` — NVIDIA, INT8 W8A16, per arch × OS × bucket:
  `code-daemon-embed-v1-{s,m,l}_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine`
  (sm_86 ≈ RTX 30xx / A-series · sm_89 ≈ RTX 40xx / L4 · sm_120 ≈ RTX 50xx).
- **TVM** `*_tvm_vulkan.{dll,so}` — Vulkan fallback for non-TRT / older NVIDIA & other GPUs, per bucket.
- **OpenVINO** `*.xml` + `*.bin` — Intel **CPU / iGPU / NPU**, per bucket.
- **Metal** `*_tvm_metal.*` — Apple Silicon (macOS), per bucket.
- **Tokenizer** — `sentencepiece.bpe.model` (the model's SentencePiece; specials baked at
  pad=0 / unk=1 / bos=2 / eos=3, byte-fallback) + `tokenizer_config.json`. The daemon loads the SP directly.
- **ONNX source** — `model.onnx` (+ `model.onnx.data`) FP32 and `model_int8qdt.onnx` (INT8 W8A16) — for
  standalone `onnxruntime` / optimum use, and the source the engines are compiled from.

## Evaluation — in-scope CoIR (sub-CoIR)

CoIR is a broad code-retrieval benchmark, but **4 of its 10 tasks are out of scope** for a code-graph
search engine (code↔code translation, multi-turn dialogue, long problem-statements — the daemon never
performs these). The honest, relevant view is the **in-scope subset** — the retrieval patterns this
model is actually built for (NDCG@10, full corpora):

| CoIR task (in-scope) | NDCG@10 | Pattern |
|---|--:|---|
| codesearchnet (6-lang avg) | **74.64** | docstring / NL → code (the core path) |
| stackoverflow-qa | 53.18 | short question → code |
| synthetic-text2sql | 50.15 | NL → SQL |
| codefeedback-st | 47.71 | NL instruction → code |
| codesearchnet-ccr (6-lang avg) | 44.30 | code → related code (clone/dup) |
| cosqa | 32.14 | NL question → code (noisy / hard) |
| **In-scope average (sub-CoIR)** | **51.56** | |

codesearchnet per language (NL→code): python **91.96**, go 82.27, java 76.02, php 68.98, ruby 65.94, js 62.66.

> The full 10-task official CoIR average (36.67) is dragged down by the 4 out-of-scope tasks and is not
> representative of the real query mix. For scale, the 1.5B-class `bge-code-v1` scores 81.77 on full
> CoIR — this is a **54.5M** model (27× smaller) tuned for one job.

On the daemon's own `search-gold` golden set (its real query distribution): **hit@5 0.692** — +80% over
the retired v1.1 cut (0.385). Binary (1-bit) vectors retain ~91% of float NDCG before rescore.

## Performance (embeddings / sec)

| Backend | Hardware | Throughput |
|---|---|--:|
| TensorRT INT8 | NVIDIA RTX 5060 (sm_120) | **~20,000 emb/s** |
| OpenVINO INT4 | Intel iGPU (Xe2, Lunar Lake) | ~580 emb/s |
| OpenVINO INT4 | Intel NPU (NPU4) | ~574 emb/s |
| OpenVINO INT8 | Intel CPU (Core Ultra) | ~375 emb/s |
| OpenVINO — **all 3 in parallel** | iGPU + NPU + CPU concurrently | ~1,290 emb/s |

The combined figure is **genuine concurrent multi-device execution**: three independent workers — one
bound to each of the iGPU, NPU and CPU — embed different batches **at the same time**, and the
throughputs add up. This is **not** OpenVINO's `AUTO` mode (which selects a *single* device per
inference and never runs the three simultaneously); the daemon length-sorts inputs and fans the buckets
across all three devices. TRT is infer throughput on the bucketed batch path; OV figures measured on a
Core Ultra (Lunar Lake) laptop.

## License & training data

Released under the **MIT license**.

The teacher (`nomic-ai/CodeRankEmbed`) is MIT, and the XLM-R architecture is MIT. As is standard
practice for distilled embedding models, the **weights are released under MIT**. For transparency,
the training corpus the teacher embedded includes:

| Dataset | License note |
|---|---|
| `Fsoft-AIC/the-vault-function` (code) | dataset MIT; underlying code has mixed upstream provenance |
| `unicamp-dl/mmarco` (EN/RU retrieval) | **MS MARCO-derived → non-commercial research terms** |
| `sentence-transformers/all-nli` | SNLI (CC BY-SA 4.0) + MultiNLI |
| `sentence-transformers/gooaq` | Apache-2.0 |
| `jinaai/negation-dataset` | see source repo |

⚠️ If your use requires strict training-data-license compliance, note that **mMARCO derives from
MS MARCO (non-commercial)**. Whether a distilled model inherits dataset-use terms is legally
unsettled; this is **not legal advice**. A data-clean variant can be retrained without the mMARCO
splits if needed.

## Attribution

Distilled from **[nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** (MIT). Backbone: XLM-RoBERTa (MIT).