card: neutral what-is-in-repo wording

79db0e8 verified 7 days ago

9.21 kB

	---
	license: mit
	language:
	- code
	- multilingual
	tags:
	- code
	- code-search
	- code-retrieval
	- embeddings
	- feature-extraction
	- sentence-similarity
	- knowledge-distillation
	pipeline_tag: feature-extraction
	base_model:
	- nomic-ai/CodeRankEmbed
	datasets:
	- Fsoft-AIC/the-vault-function
	- unicamp-dl/mmarco
	- sentence-transformers/all-nli
	- sentence-transformers/gooaq
	- jinaai/negation-dataset
	---

	# code-daemon-embed-v1

	A small, fast code embedding model purpose-built to vectorize a code graph (functions, methods,
	doc-chunks) for on-device semantic code search. It ships with the
	[UltraCode](https://github.com/faxenoff/ultracode) MCP server, running as a TensorRT / TVM / OpenVINO / ONNX engine.

	It is deliberately specialized for short code units, not long documents — long-text handling was
	intentionally dropped (max sequence 128 tokens) to maximize embedding throughput. Code-graph nodes
	are short (entity names, signatures, doc-chunks); spending capacity and latency on a long-context path
	would only slow the hot path it never uses.

	- 768-dim embeddings, Matryoshka (MRL) truncatable to 512 / 256 with graceful decay.
	- ~54.5M params — XLM-RoBERTa architecture, 4 layers / 768 hidden, code-only 32k SentencePiece vocab.
	- Mean pooling baked into the graph — output is already pooled (`[batch, 768]`); just L2-normalize.
	- Trained at sequence length 128; length buckets s/m/l = seq 40 / 64 / 128.

	## How it was made

	Knowledge-distilled (embedding regression) from the teacher [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed)
	(MIT, 137M, a strong code retriever). The student is a fresh, shallow-wide XLM-R encoder trained
	from scratch on the teacher's passage embeddings over a ~32M-sample code + text corpus, with a
	custom 32k code-oriented SentencePiece vocabulary (syntax + identifier lexicon rather than prose).

	Why shallow-wide (4l/768h) + code vocab: on an internal code-search golden set this beat both a
	deeper 6-layer variant and the earlier 64k-prose-vocab cut — depth hurt, a code-tuned vocab and a
	wide body helped.

	## Built for speed

	This model trades long-context capability for raw throughput on short code units:

	- Short context by design — max 128 tokens, no long-document path. Code-graph nodes are short
	(entity names, signatures, doc-chunks), so the model and its engines are tuned only for that, avoiding
	the cost of a wide dynamic shape range.
	- Rectangular TensorRT profiles — each length bucket is built with a fixed shape (min == opt == max),
	not a dynamic range, so the autotuner locks one optimal kernel set per bucket:
	s = batch 64 × seq 40 · m = batch 128 × seq 64 · l = batch 256 × seq 128.
	- INT8 (W8A16) weights; mean-pool + projection + L2-norm fused into the graph (one pass → `[B, 768]`).

	## Intended use

	- Semantic code search / code retrieval, and general (multilingual) text retrieval as a fallback.
	- Embed queries and documents the same way (no instruction prefix — the student was distilled on
	passage embeddings, unlike the teacher whose prefix is query-only). Mean-pool → L2-normalize.
	- For smaller indexes, truncate to 256 or 512 dims (MRL) before normalizing.

	The daemon runs the bundled engines directly (this repo is its CDN), but the FP32 `model.onnx` is
	also bundled for standalone use. The recipe below runs it with `onnxruntime`: tokenize with the
	bundled `sentencepiece.bpe.model`, run, and the pooled `[B,768]` is already produced — just
	L2-normalize:

	```python
	import onnxruntime as ort, sentencepiece as spm, numpy as np

	sp = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model") # pad=0 unk=1 bos=2 eos=3
	sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

	def embed(texts, max_len=128, mrl_dim=768):
	ids = [[2, *sp.encode(t)[: max_len - 2], 3] for t in texts] # bos … eos
	L = max(len(x) for x in ids)
	inp = np.array([x + [0] * (L - len(x)) for x in ids], dtype=np.int64) # pad=0
	mask = (inp != 0).astype(np.int64)
	out = sess.run(None, {"input_ids": inp, "attention_mask": mask})[0] # already mean-pooled [B,768]
	out = out[:, :mrl_dim] # MRL truncation (768/512/256)
	return out / np.linalg.norm(out, axis=1, keepdims=True)
	```

	## What's in this repo — ready-to-run compiled engines

	This repo holds pre-compiled, ready-to-run engines, named per
	runtime × GPU arch × OS × length-bucket — grab the compiled model that matches your runtime and
	hardware and use it directly, with no compilation on your machine.

	- TensorRT `*.engine` — NVIDIA, INT8 W8A16, per arch × OS × bucket:
	`code-daemon-embed-v1-{s,m,l}_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine`
	(sm_86 ≈ RTX 30xx / A-series · sm_89 ≈ RTX 40xx / L4 · sm_120 ≈ RTX 50xx).
	- TVM `*_tvm_vulkan.{dll,so}` — Vulkan fallback for non-TRT / older NVIDIA & other GPUs, per bucket.
	- OpenVINO `.xml` + `.bin` — Intel CPU / iGPU / NPU, per bucket.
	- Metal `_tvm_metal.` — Apple Silicon (macOS), per bucket.
	- Tokenizer — `sentencepiece.bpe.model` (the model's SentencePiece; specials baked at
	pad=0 / unk=1 / bos=2 / eos=3, byte-fallback) + `tokenizer_config.json`. The daemon loads the SP directly.
	- ONNX source — `model.onnx` (+ `model.onnx.data`) FP32 and `model_int8qdt.onnx` (INT8 W8A16) — for
	standalone `onnxruntime` / optimum use, and the source the engines are compiled from.

	## Evaluation — in-scope CoIR (sub-CoIR)

	CoIR is a broad code-retrieval benchmark, but 4 of its 10 tasks are out of scope for a code-graph
	search engine (code↔code translation, multi-turn dialogue, long problem-statements — the daemon never
	performs these). The honest, relevant view is the in-scope subset — the retrieval patterns this
	model is actually built for (NDCG@10, full corpora):

	\| CoIR task (in-scope) \| NDCG@10 \| Pattern \|
	\|---\|--:\|---\|
	\| codesearchnet (6-lang avg) \| 74.64 \| docstring / NL → code (the core path) \|
	\| stackoverflow-qa \| 53.18 \| short question → code \|
	\| synthetic-text2sql \| 50.15 \| NL → SQL \|
	\| codefeedback-st \| 47.71 \| NL instruction → code \|
	\| codesearchnet-ccr (6-lang avg) \| 44.30 \| code → related code (clone/dup) \|
	\| cosqa \| 32.14 \| NL question → code (noisy / hard) \|
	\| In-scope average (sub-CoIR) \| 51.56 \| \|

	codesearchnet per language (NL→code): python 91.96, go 82.27, java 76.02, php 68.98, ruby 65.94, js 62.66.

	> The full 10-task official CoIR average (36.67) is dragged down by the 4 out-of-scope tasks and is not
	> representative of the real query mix. For scale, the 1.5B-class `bge-code-v1` scores 81.77 on full
	> CoIR — this is a 54.5M model (27× smaller) tuned for one job.

	On the daemon's own `search-gold` golden set (its real query distribution): hit@5 0.692 — +80% over
	the retired v1.1 cut (0.385). Binary (1-bit) vectors retain ~91% of float NDCG before rescore.

	## Performance (embeddings / sec)

	\| Backend \| Hardware \| Throughput \|
	\|---\|---\|--:\|
	\| TensorRT INT8 \| NVIDIA RTX 5060 (sm_120) \| ~20,000 emb/s \|
	\| OpenVINO INT4 \| Intel iGPU (Xe2, Lunar Lake) \| ~580 emb/s \|
	\| OpenVINO INT4 \| Intel NPU (NPU4) \| ~574 emb/s \|
	\| OpenVINO INT8 \| Intel CPU (Core Ultra) \| ~375 emb/s \|
	\| OpenVINO — all 3 in parallel \| iGPU + NPU + CPU concurrently \| ~1,290 emb/s \|

	The combined figure is genuine concurrent multi-device execution: three independent workers — one
	bound to each of the iGPU, NPU and CPU — embed different batches at the same time, and the
	throughputs add up. This is not OpenVINO's `AUTO` mode (which selects a single device per
	inference and never runs the three simultaneously); the daemon length-sorts inputs and fans the buckets
	across all three devices. TRT is infer throughput on the bucketed batch path; OV figures measured on a
	Core Ultra (Lunar Lake) laptop.

	## License & training data

	Released under the MIT license.

	The teacher (`nomic-ai/CodeRankEmbed`) is MIT, and the XLM-R architecture is MIT. As is standard
	practice for distilled embedding models, the weights are released under MIT. For transparency,
	the training corpus the teacher embedded includes:

	\| Dataset \| License note \|
	\|---\|---\|
	\| `Fsoft-AIC/the-vault-function` (code) \| dataset MIT; underlying code has mixed upstream provenance \|
	\| `unicamp-dl/mmarco` (EN/RU retrieval) \| MS MARCO-derived → non-commercial research terms \|
	\| `sentence-transformers/all-nli` \| SNLI (CC BY-SA 4.0) + MultiNLI \|
	\| `sentence-transformers/gooaq` \| Apache-2.0 \|
	\| `jinaai/negation-dataset` \| see source repo \|

	⚠️ If your use requires strict training-data-license compliance, note that **mMARCO derives from
	MS MARCO (non-commercial)**. Whether a distilled model inherits dataset-use terms is legally
	unsettled; this is not legal advice. A data-clean variant can be retrained without the mMARCO
	splits if needed.

	## Attribution

	Distilled from [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) (MIT). Backbone: XLM-RoBERTa (MIT).