--- license: mit language: - code - multilingual tags: - code - code-search - code-retrieval - embeddings - feature-extraction - sentence-similarity - knowledge-distillation pipeline_tag: feature-extraction base_model: - nomic-ai/CodeRankEmbed datasets: - Fsoft-AIC/the-vault-function - unicamp-dl/mmarco - sentence-transformers/all-nli - sentence-transformers/gooaq - jinaai/negation-dataset --- # code-daemon-embed-v1 A small, fast **code embedding model** purpose-built to vectorize a **code graph** (functions, methods, doc-chunks) for on-device semantic code search. It ships with the [UltraCode](https://github.com/faxenoff/ultracode) MCP server, running as a TensorRT / TVM / OpenVINO / ONNX engine. It is **deliberately specialized for short code units, not long documents** — long-text handling was intentionally dropped (max sequence **128 tokens**) to maximize embedding throughput. Code-graph nodes are short (entity names, signatures, doc-chunks); spending capacity and latency on a long-context path would only slow the hot path it never uses. - **768-dim** embeddings, **Matryoshka (MRL)** truncatable to **512 / 256** with graceful decay. - **~54.5M params** — XLM-RoBERTa architecture, **4 layers / 768 hidden**, **code-only 32k SentencePiece vocab**. - **Mean pooling** baked into the graph — output is already pooled (`[batch, 768]`); just **L2-normalize**. - Trained at sequence length **128**; length buckets s/m/l = seq **40 / 64 / 128**. ## How it was made Knowledge-distilled (embedding regression) from the teacher **[`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed)** (MIT, 137M, a strong code retriever). The student is a fresh, shallow-wide XLM-R encoder trained from scratch on the teacher's passage embeddings over a ~32M-sample code + text corpus, with a custom 32k code-oriented SentencePiece vocabulary (syntax + identifier lexicon rather than prose). Why shallow-wide (4l/768h) + code vocab: on an internal code-search golden set this **beat** both a deeper 6-layer variant and the earlier 64k-prose-vocab cut — depth hurt, a code-tuned vocab and a wide body helped. ## Built for speed This model trades long-context capability for raw throughput on short code units: - **Short context by design** — max **128 tokens**, no long-document path. Code-graph nodes are short (entity names, signatures, doc-chunks), so the model and its engines are tuned only for that, avoiding the cost of a wide dynamic shape range. - **Rectangular TensorRT profiles** — each length bucket is built with a *fixed* shape (min == opt == max), not a dynamic range, so the autotuner locks one optimal kernel set per bucket: **s** = batch 64 × seq 40 · **m** = batch 128 × seq 64 · **l** = batch 256 × seq 128. - **INT8 (W8A16)** weights; **mean-pool + projection + L2-norm fused into the graph** (one pass → `[B, 768]`). ## Intended use - **Semantic code search / code retrieval**, and general (multilingual) text retrieval as a fallback. - Embed **queries and documents the same way** (no instruction prefix — the student was distilled on passage embeddings, unlike the teacher whose prefix is query-only). Mean-pool → **L2-normalize**. - For smaller indexes, truncate to **256** or **512** dims (MRL) before normalizing. The daemon runs the bundled engines directly (this repo is its CDN), but the FP32 `model.onnx` is **also bundled** for standalone use. The recipe below runs it with `onnxruntime`: tokenize with the bundled `sentencepiece.bpe.model`, run, and the pooled `[B,768]` is already produced — just L2-normalize: ```python import onnxruntime as ort, sentencepiece as spm, numpy as np sp = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model") # pad=0 unk=1 bos=2 eos=3 sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"]) def embed(texts, max_len=128, mrl_dim=768): ids = [[2, *sp.encode(t)[: max_len - 2], 3] for t in texts] # bos … eos L = max(len(x) for x in ids) inp = np.array([x + [0] * (L - len(x)) for x in ids], dtype=np.int64) # pad=0 mask = (inp != 0).astype(np.int64) out = sess.run(None, {"input_ids": inp, "attention_mask": mask})[0] # already mean-pooled [B,768] out = out[:, :mrl_dim] # MRL truncation (768/512/256) return out / np.linalg.norm(out, axis=1, keepdims=True) ``` ## What's in this repo — ready-to-run compiled engines This repo holds **pre-compiled, ready-to-run engines**, named per **runtime × GPU arch × OS × length-bucket** — grab the compiled model that matches your runtime and hardware and use it directly, with no compilation on your machine. - **TensorRT** `*.engine` — NVIDIA, INT8 W8A16, per arch × OS × bucket: `code-daemon-embed-v1-{s,m,l}_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine` (sm_86 ≈ RTX 30xx / A-series · sm_89 ≈ RTX 40xx / L4 · sm_120 ≈ RTX 50xx). - **TVM** `*_tvm_vulkan.{dll,so}` — Vulkan fallback for non-TRT / older NVIDIA & other GPUs, per bucket. - **OpenVINO** `*.xml` + `*.bin` — Intel **CPU / iGPU / NPU**, per bucket. - **Metal** `*_tvm_metal.*` — Apple Silicon (macOS), per bucket. - **Tokenizer** — `sentencepiece.bpe.model` (the model's SentencePiece; specials baked at pad=0 / unk=1 / bos=2 / eos=3, byte-fallback) + `tokenizer_config.json`. The daemon loads the SP directly. - **ONNX source** — `model.onnx` (+ `model.onnx.data`) FP32 and `model_int8qdt.onnx` (INT8 W8A16) — for standalone `onnxruntime` / optimum use, and the source the engines are compiled from. ## Evaluation — in-scope CoIR (sub-CoIR) CoIR is a broad code-retrieval benchmark, but **4 of its 10 tasks are out of scope** for a code-graph search engine (code↔code translation, multi-turn dialogue, long problem-statements — the daemon never performs these). The honest, relevant view is the **in-scope subset** — the retrieval patterns this model is actually built for (NDCG@10, full corpora): | CoIR task (in-scope) | NDCG@10 | Pattern | |---|--:|---| | codesearchnet (6-lang avg) | **74.64** | docstring / NL → code (the core path) | | stackoverflow-qa | 53.18 | short question → code | | synthetic-text2sql | 50.15 | NL → SQL | | codefeedback-st | 47.71 | NL instruction → code | | codesearchnet-ccr (6-lang avg) | 44.30 | code → related code (clone/dup) | | cosqa | 32.14 | NL question → code (noisy / hard) | | **In-scope average (sub-CoIR)** | **51.56** | | codesearchnet per language (NL→code): python **91.96**, go 82.27, java 76.02, php 68.98, ruby 65.94, js 62.66. > The full 10-task official CoIR average (36.67) is dragged down by the 4 out-of-scope tasks and is not > representative of the real query mix. For scale, the 1.5B-class `bge-code-v1` scores 81.77 on full > CoIR — this is a **54.5M** model (27× smaller) tuned for one job. On the daemon's own `search-gold` golden set (its real query distribution): **hit@5 0.692** — +80% over the retired v1.1 cut (0.385). Binary (1-bit) vectors retain ~91% of float NDCG before rescore. ## Performance (embeddings / sec) | Backend | Hardware | Throughput | |---|---|--:| | TensorRT INT8 | NVIDIA RTX 5060 (sm_120) | **~20,000 emb/s** | | OpenVINO INT4 | Intel iGPU (Xe2, Lunar Lake) | ~580 emb/s | | OpenVINO INT4 | Intel NPU (NPU4) | ~574 emb/s | | OpenVINO INT8 | Intel CPU (Core Ultra) | ~375 emb/s | | OpenVINO — **all 3 in parallel** | iGPU + NPU + CPU concurrently | ~1,290 emb/s | The combined figure is **genuine concurrent multi-device execution**: three independent workers — one bound to each of the iGPU, NPU and CPU — embed different batches **at the same time**, and the throughputs add up. This is **not** OpenVINO's `AUTO` mode (which selects a *single* device per inference and never runs the three simultaneously); the daemon length-sorts inputs and fans the buckets across all three devices. TRT is infer throughput on the bucketed batch path; OV figures measured on a Core Ultra (Lunar Lake) laptop. ## License & training data Released under the **MIT license**. The teacher (`nomic-ai/CodeRankEmbed`) is MIT, and the XLM-R architecture is MIT. As is standard practice for distilled embedding models, the **weights are released under MIT**. For transparency, the training corpus the teacher embedded includes: | Dataset | License note | |---|---| | `Fsoft-AIC/the-vault-function` (code) | dataset MIT; underlying code has mixed upstream provenance | | `unicamp-dl/mmarco` (EN/RU retrieval) | **MS MARCO-derived → non-commercial research terms** | | `sentence-transformers/all-nli` | SNLI (CC BY-SA 4.0) + MultiNLI | | `sentence-transformers/gooaq` | Apache-2.0 | | `jinaai/negation-dataset` | see source repo | ⚠️ If your use requires strict training-data-license compliance, note that **mMARCO derives from MS MARCO (non-commercial)**. Whether a distilled model inherits dataset-use terms is legally unsettled; this is **not legal advice**. A data-clean variant can be retrained without the mMARCO splits if needed. ## Attribution Distilled from **[nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** (MIT). Backbone: XLM-RoBERTa (MIT).