faxenoff's picture
card: neutral what-is-in-repo wording
79db0e8 verified
|
Raw
History Blame Contribute Delete
9.21 kB
---
license: mit
language:
- code
- multilingual
tags:
- code
- code-search
- code-retrieval
- embeddings
- feature-extraction
- sentence-similarity
- knowledge-distillation
pipeline_tag: feature-extraction
base_model:
- nomic-ai/CodeRankEmbed
datasets:
- Fsoft-AIC/the-vault-function
- unicamp-dl/mmarco
- sentence-transformers/all-nli
- sentence-transformers/gooaq
- jinaai/negation-dataset
---
# code-daemon-embed-v1
A small, fast **code embedding model** purpose-built to vectorize a **code graph** (functions, methods,
doc-chunks) for on-device semantic code search. It ships with the
[UltraCode](https://github.com/faxenoff/ultracode) MCP server, running as a TensorRT / TVM / OpenVINO / ONNX engine.
It is **deliberately specialized for short code units, not long documents** β€” long-text handling was
intentionally dropped (max sequence **128 tokens**) to maximize embedding throughput. Code-graph nodes
are short (entity names, signatures, doc-chunks); spending capacity and latency on a long-context path
would only slow the hot path it never uses.
- **768-dim** embeddings, **Matryoshka (MRL)** truncatable to **512 / 256** with graceful decay.
- **~54.5M params** β€” XLM-RoBERTa architecture, **4 layers / 768 hidden**, **code-only 32k SentencePiece vocab**.
- **Mean pooling** baked into the graph β€” output is already pooled (`[batch, 768]`); just **L2-normalize**.
- Trained at sequence length **128**; length buckets s/m/l = seq **40 / 64 / 128**.
## How it was made
Knowledge-distilled (embedding regression) from the teacher **[`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed)**
(MIT, 137M, a strong code retriever). The student is a fresh, shallow-wide XLM-R encoder trained
from scratch on the teacher's passage embeddings over a ~32M-sample code + text corpus, with a
custom 32k code-oriented SentencePiece vocabulary (syntax + identifier lexicon rather than prose).
Why shallow-wide (4l/768h) + code vocab: on an internal code-search golden set this **beat** both a
deeper 6-layer variant and the earlier 64k-prose-vocab cut β€” depth hurt, a code-tuned vocab and a
wide body helped.
## Built for speed
This model trades long-context capability for raw throughput on short code units:
- **Short context by design** β€” max **128 tokens**, no long-document path. Code-graph nodes are short
(entity names, signatures, doc-chunks), so the model and its engines are tuned only for that, avoiding
the cost of a wide dynamic shape range.
- **Rectangular TensorRT profiles** β€” each length bucket is built with a *fixed* shape (min == opt == max),
not a dynamic range, so the autotuner locks one optimal kernel set per bucket:
**s** = batch 64 Γ— seq 40 Β· **m** = batch 128 Γ— seq 64 Β· **l** = batch 256 Γ— seq 128.
- **INT8 (W8A16)** weights; **mean-pool + projection + L2-norm fused into the graph** (one pass β†’ `[B, 768]`).
## Intended use
- **Semantic code search / code retrieval**, and general (multilingual) text retrieval as a fallback.
- Embed **queries and documents the same way** (no instruction prefix β€” the student was distilled on
passage embeddings, unlike the teacher whose prefix is query-only). Mean-pool β†’ **L2-normalize**.
- For smaller indexes, truncate to **256** or **512** dims (MRL) before normalizing.
The daemon runs the bundled engines directly (this repo is its CDN), but the FP32 `model.onnx` is
**also bundled** for standalone use. The recipe below runs it with `onnxruntime`: tokenize with the
bundled `sentencepiece.bpe.model`, run, and the pooled `[B,768]` is already produced β€” just
L2-normalize:
```python
import onnxruntime as ort, sentencepiece as spm, numpy as np
sp = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model") # pad=0 unk=1 bos=2 eos=3
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
def embed(texts, max_len=128, mrl_dim=768):
ids = [[2, *sp.encode(t)[: max_len - 2], 3] for t in texts] # bos … eos
L = max(len(x) for x in ids)
inp = np.array([x + [0] * (L - len(x)) for x in ids], dtype=np.int64) # pad=0
mask = (inp != 0).astype(np.int64)
out = sess.run(None, {"input_ids": inp, "attention_mask": mask})[0] # already mean-pooled [B,768]
out = out[:, :mrl_dim] # MRL truncation (768/512/256)
return out / np.linalg.norm(out, axis=1, keepdims=True)
```
## What's in this repo β€” ready-to-run compiled engines
This repo holds **pre-compiled, ready-to-run engines**, named per
**runtime Γ— GPU arch Γ— OS Γ— length-bucket** β€” grab the compiled model that matches your runtime and
hardware and use it directly, with no compilation on your machine.
- **TensorRT** `*.engine` β€” NVIDIA, INT8 W8A16, per arch Γ— OS Γ— bucket:
`code-daemon-embed-v1-{s,m,l}_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine`
(sm_86 β‰ˆ RTX 30xx / A-series Β· sm_89 β‰ˆ RTX 40xx / L4 Β· sm_120 β‰ˆ RTX 50xx).
- **TVM** `*_tvm_vulkan.{dll,so}` β€” Vulkan fallback for non-TRT / older NVIDIA & other GPUs, per bucket.
- **OpenVINO** `*.xml` + `*.bin` β€” Intel **CPU / iGPU / NPU**, per bucket.
- **Metal** `*_tvm_metal.*` β€” Apple Silicon (macOS), per bucket.
- **Tokenizer** β€” `sentencepiece.bpe.model` (the model's SentencePiece; specials baked at
pad=0 / unk=1 / bos=2 / eos=3, byte-fallback) + `tokenizer_config.json`. The daemon loads the SP directly.
- **ONNX source** β€” `model.onnx` (+ `model.onnx.data`) FP32 and `model_int8qdt.onnx` (INT8 W8A16) β€” for
standalone `onnxruntime` / optimum use, and the source the engines are compiled from.
## Evaluation β€” in-scope CoIR (sub-CoIR)
CoIR is a broad code-retrieval benchmark, but **4 of its 10 tasks are out of scope** for a code-graph
search engine (code↔code translation, multi-turn dialogue, long problem-statements β€” the daemon never
performs these). The honest, relevant view is the **in-scope subset** β€” the retrieval patterns this
model is actually built for (NDCG@10, full corpora):
| CoIR task (in-scope) | NDCG@10 | Pattern |
|---|--:|---|
| codesearchnet (6-lang avg) | **74.64** | docstring / NL β†’ code (the core path) |
| stackoverflow-qa | 53.18 | short question β†’ code |
| synthetic-text2sql | 50.15 | NL β†’ SQL |
| codefeedback-st | 47.71 | NL instruction β†’ code |
| codesearchnet-ccr (6-lang avg) | 44.30 | code β†’ related code (clone/dup) |
| cosqa | 32.14 | NL question β†’ code (noisy / hard) |
| **In-scope average (sub-CoIR)** | **51.56** | |
codesearchnet per language (NL→code): python **91.96**, go 82.27, java 76.02, php 68.98, ruby 65.94, js 62.66.
> The full 10-task official CoIR average (36.67) is dragged down by the 4 out-of-scope tasks and is not
> representative of the real query mix. For scale, the 1.5B-class `bge-code-v1` scores 81.77 on full
> CoIR β€” this is a **54.5M** model (27Γ— smaller) tuned for one job.
On the daemon's own `search-gold` golden set (its real query distribution): **hit@5 0.692** β€” +80% over
the retired v1.1 cut (0.385). Binary (1-bit) vectors retain ~91% of float NDCG before rescore.
## Performance (embeddings / sec)
| Backend | Hardware | Throughput |
|---|---|--:|
| TensorRT INT8 | NVIDIA RTX 5060 (sm_120) | **~20,000 emb/s** |
| OpenVINO INT4 | Intel iGPU (Xe2, Lunar Lake) | ~580 emb/s |
| OpenVINO INT4 | Intel NPU (NPU4) | ~574 emb/s |
| OpenVINO INT8 | Intel CPU (Core Ultra) | ~375 emb/s |
| OpenVINO β€” **all 3 in parallel** | iGPU + NPU + CPU concurrently | ~1,290 emb/s |
The combined figure is **genuine concurrent multi-device execution**: three independent workers β€” one
bound to each of the iGPU, NPU and CPU β€” embed different batches **at the same time**, and the
throughputs add up. This is **not** OpenVINO's `AUTO` mode (which selects a *single* device per
inference and never runs the three simultaneously); the daemon length-sorts inputs and fans the buckets
across all three devices. TRT is infer throughput on the bucketed batch path; OV figures measured on a
Core Ultra (Lunar Lake) laptop.
## License & training data
Released under the **MIT license**.
The teacher (`nomic-ai/CodeRankEmbed`) is MIT, and the XLM-R architecture is MIT. As is standard
practice for distilled embedding models, the **weights are released under MIT**. For transparency,
the training corpus the teacher embedded includes:
| Dataset | License note |
|---|---|
| `Fsoft-AIC/the-vault-function` (code) | dataset MIT; underlying code has mixed upstream provenance |
| `unicamp-dl/mmarco` (EN/RU retrieval) | **MS MARCO-derived β†’ non-commercial research terms** |
| `sentence-transformers/all-nli` | SNLI (CC BY-SA 4.0) + MultiNLI |
| `sentence-transformers/gooaq` | Apache-2.0 |
| `jinaai/negation-dataset` | see source repo |
⚠️ If your use requires strict training-data-license compliance, note that **mMARCO derives from
MS MARCO (non-commercial)**. Whether a distilled model inherits dataset-use terms is legally
unsettled; this is **not legal advice**. A data-clean variant can be retrained without the mMARCO
splits if needed.
## Attribution
Distilled from **[nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** (MIT). Backbone: XLM-RoBERTa (MIT).