| --- |
| license: mit |
| language: |
| - code |
| - multilingual |
| tags: |
| - code |
| - code-search |
| - code-retrieval |
| - embeddings |
| - feature-extraction |
| - sentence-similarity |
| - knowledge-distillation |
| pipeline_tag: feature-extraction |
| base_model: |
| - nomic-ai/CodeRankEmbed |
| datasets: |
| - Fsoft-AIC/the-vault-function |
| - unicamp-dl/mmarco |
| - sentence-transformers/all-nli |
| - sentence-transformers/gooaq |
| - jinaai/negation-dataset |
| --- |
| |
| # code-daemon-embed-v1 |
|
|
| A small, fast **code embedding model** purpose-built to vectorize a **code graph** (functions, methods, |
| doc-chunks) for on-device semantic code search. It ships with the |
| [UltraCode](https://github.com/faxenoff/ultracode) MCP server, running as a TensorRT / TVM / OpenVINO / ONNX engine. |
|
|
| It is **deliberately specialized for short code units, not long documents** β long-text handling was |
| intentionally dropped (max sequence **128 tokens**) to maximize embedding throughput. Code-graph nodes |
| are short (entity names, signatures, doc-chunks); spending capacity and latency on a long-context path |
| would only slow the hot path it never uses. |
|
|
| - **768-dim** embeddings, **Matryoshka (MRL)** truncatable to **512 / 256** with graceful decay. |
| - **~54.5M params** β XLM-RoBERTa architecture, **4 layers / 768 hidden**, **code-only 32k SentencePiece vocab**. |
| - **Mean pooling** baked into the graph β output is already pooled (`[batch, 768]`); just **L2-normalize**. |
| - Trained at sequence length **128**; length buckets s/m/l = seq **40 / 64 / 128**. |
|
|
| ## How it was made |
|
|
| Knowledge-distilled (embedding regression) from the teacher **[`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed)** |
| (MIT, 137M, a strong code retriever). The student is a fresh, shallow-wide XLM-R encoder trained |
| from scratch on the teacher's passage embeddings over a ~32M-sample code + text corpus, with a |
| custom 32k code-oriented SentencePiece vocabulary (syntax + identifier lexicon rather than prose). |
|
|
| Why shallow-wide (4l/768h) + code vocab: on an internal code-search golden set this **beat** both a |
| deeper 6-layer variant and the earlier 64k-prose-vocab cut β depth hurt, a code-tuned vocab and a |
| wide body helped. |
|
|
| ## Built for speed |
|
|
| This model trades long-context capability for raw throughput on short code units: |
|
|
| - **Short context by design** β max **128 tokens**, no long-document path. Code-graph nodes are short |
| (entity names, signatures, doc-chunks), so the model and its engines are tuned only for that, avoiding |
| the cost of a wide dynamic shape range. |
| - **Rectangular TensorRT profiles** β each length bucket is built with a *fixed* shape (min == opt == max), |
| not a dynamic range, so the autotuner locks one optimal kernel set per bucket: |
| **s** = batch 64 Γ seq 40 Β· **m** = batch 128 Γ seq 64 Β· **l** = batch 256 Γ seq 128. |
| - **INT8 (W8A16)** weights; **mean-pool + projection + L2-norm fused into the graph** (one pass β `[B, 768]`). |
|
|
| ## Intended use |
|
|
| - **Semantic code search / code retrieval**, and general (multilingual) text retrieval as a fallback. |
| - Embed **queries and documents the same way** (no instruction prefix β the student was distilled on |
| passage embeddings, unlike the teacher whose prefix is query-only). Mean-pool β **L2-normalize**. |
| - For smaller indexes, truncate to **256** or **512** dims (MRL) before normalizing. |
|
|
| The daemon runs the bundled engines directly (this repo is its CDN), but the FP32 `model.onnx` is |
| **also bundled** for standalone use. The recipe below runs it with `onnxruntime`: tokenize with the |
| bundled `sentencepiece.bpe.model`, run, and the pooled `[B,768]` is already produced β just |
| L2-normalize: |
|
|
| ```python |
| import onnxruntime as ort, sentencepiece as spm, numpy as np |
| |
| sp = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model") # pad=0 unk=1 bos=2 eos=3 |
| sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"]) |
| |
| def embed(texts, max_len=128, mrl_dim=768): |
| ids = [[2, *sp.encode(t)[: max_len - 2], 3] for t in texts] # bos β¦ eos |
| L = max(len(x) for x in ids) |
| inp = np.array([x + [0] * (L - len(x)) for x in ids], dtype=np.int64) # pad=0 |
| mask = (inp != 0).astype(np.int64) |
| out = sess.run(None, {"input_ids": inp, "attention_mask": mask})[0] # already mean-pooled [B,768] |
| out = out[:, :mrl_dim] # MRL truncation (768/512/256) |
| return out / np.linalg.norm(out, axis=1, keepdims=True) |
| ``` |
|
|
| ## What's in this repo β ready-to-run compiled engines |
|
|
| This repo holds **pre-compiled, ready-to-run engines**, named per |
| **runtime Γ GPU arch Γ OS Γ length-bucket** β grab the compiled model that matches your runtime and |
| hardware and use it directly, with no compilation on your machine. |
|
|
| - **TensorRT** `*.engine` β NVIDIA, INT8 W8A16, per arch Γ OS Γ bucket: |
| `code-daemon-embed-v1-{s,m,l}_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine` |
| (sm_86 β RTX 30xx / A-series Β· sm_89 β RTX 40xx / L4 Β· sm_120 β RTX 50xx). |
| - **TVM** `*_tvm_vulkan.{dll,so}` β Vulkan fallback for non-TRT / older NVIDIA & other GPUs, per bucket. |
| - **OpenVINO** `*.xml` + `*.bin` β Intel **CPU / iGPU / NPU**, per bucket. |
| - **Metal** `*_tvm_metal.*` β Apple Silicon (macOS), per bucket. |
| - **Tokenizer** β `sentencepiece.bpe.model` (the model's SentencePiece; specials baked at |
| pad=0 / unk=1 / bos=2 / eos=3, byte-fallback) + `tokenizer_config.json`. The daemon loads the SP directly. |
| - **ONNX source** β `model.onnx` (+ `model.onnx.data`) FP32 and `model_int8qdt.onnx` (INT8 W8A16) β for |
| standalone `onnxruntime` / optimum use, and the source the engines are compiled from. |
|
|
| ## Evaluation β in-scope CoIR (sub-CoIR) |
|
|
| CoIR is a broad code-retrieval benchmark, but **4 of its 10 tasks are out of scope** for a code-graph |
| search engine (codeβcode translation, multi-turn dialogue, long problem-statements β the daemon never |
| performs these). The honest, relevant view is the **in-scope subset** β the retrieval patterns this |
| model is actually built for (NDCG@10, full corpora): |
|
|
| | CoIR task (in-scope) | NDCG@10 | Pattern | |
| |---|--:|---| |
| | codesearchnet (6-lang avg) | **74.64** | docstring / NL β code (the core path) | |
| | stackoverflow-qa | 53.18 | short question β code | |
| | synthetic-text2sql | 50.15 | NL β SQL | |
| | codefeedback-st | 47.71 | NL instruction β code | |
| | codesearchnet-ccr (6-lang avg) | 44.30 | code β related code (clone/dup) | |
| | cosqa | 32.14 | NL question β code (noisy / hard) | |
| | **In-scope average (sub-CoIR)** | **51.56** | | |
|
|
| codesearchnet per language (NLβcode): python **91.96**, go 82.27, java 76.02, php 68.98, ruby 65.94, js 62.66. |
|
|
| > The full 10-task official CoIR average (36.67) is dragged down by the 4 out-of-scope tasks and is not |
| > representative of the real query mix. For scale, the 1.5B-class `bge-code-v1` scores 81.77 on full |
| > CoIR β this is a **54.5M** model (27Γ smaller) tuned for one job. |
|
|
| On the daemon's own `search-gold` golden set (its real query distribution): **hit@5 0.692** β +80% over |
| the retired v1.1 cut (0.385). Binary (1-bit) vectors retain ~91% of float NDCG before rescore. |
|
|
| ## Performance (embeddings / sec) |
|
|
| | Backend | Hardware | Throughput | |
| |---|---|--:| |
| | TensorRT INT8 | NVIDIA RTX 5060 (sm_120) | **~20,000 emb/s** | |
| | OpenVINO INT4 | Intel iGPU (Xe2, Lunar Lake) | ~580 emb/s | |
| | OpenVINO INT4 | Intel NPU (NPU4) | ~574 emb/s | |
| | OpenVINO INT8 | Intel CPU (Core Ultra) | ~375 emb/s | |
| | OpenVINO β **all 3 in parallel** | iGPU + NPU + CPU concurrently | ~1,290 emb/s | |
| |
| The combined figure is **genuine concurrent multi-device execution**: three independent workers β one |
| bound to each of the iGPU, NPU and CPU β embed different batches **at the same time**, and the |
| throughputs add up. This is **not** OpenVINO's `AUTO` mode (which selects a *single* device per |
| inference and never runs the three simultaneously); the daemon length-sorts inputs and fans the buckets |
| across all three devices. TRT is infer throughput on the bucketed batch path; OV figures measured on a |
| Core Ultra (Lunar Lake) laptop. |
| |
| ## License & training data |
| |
| Released under the **MIT license**. |
| |
| The teacher (`nomic-ai/CodeRankEmbed`) is MIT, and the XLM-R architecture is MIT. As is standard |
| practice for distilled embedding models, the **weights are released under MIT**. For transparency, |
| the training corpus the teacher embedded includes: |
| |
| | Dataset | License note | |
| |---|---| |
| | `Fsoft-AIC/the-vault-function` (code) | dataset MIT; underlying code has mixed upstream provenance | |
| | `unicamp-dl/mmarco` (EN/RU retrieval) | **MS MARCO-derived β non-commercial research terms** | |
| | `sentence-transformers/all-nli` | SNLI (CC BY-SA 4.0) + MultiNLI | |
| | `sentence-transformers/gooaq` | Apache-2.0 | |
| | `jinaai/negation-dataset` | see source repo | |
| |
| β οΈ If your use requires strict training-data-license compliance, note that **mMARCO derives from |
| MS MARCO (non-commercial)**. Whether a distilled model inherits dataset-use terms is legally |
| unsettled; this is **not legal advice**. A data-clean variant can be retrained without the mMARCO |
| splits if needed. |
| |
| ## Attribution |
| |
| Distilled from **[nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** (MIT). Backbone: XLM-RoBERTa (MIT). |
| |