Initial upload: Vortex-Embed v2 (R@1 0.745, +137% over v1)

7c7111f verified 6 days ago

6.19 kB

	---
	license: apache-2.0
	tags:
	- sentence-similarity
	- feature-extraction
	- static-embeddings
	- lf4-quantization
	- retrieval
	- code-search
	model_name: Vortex-Embed v2
	datasets:
	- VTXAI/Vortex-Embed
	metrics:
	- recall@1
	- recall@5
	- recall@10
	- mrr
	---

	# Vortex-Embed v2

	Retrieval-optimized 4-bit static embeddings for code search.

	Built on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M)
	(29528 vocab × 256 dim, 4-bit LF4 packed = 4.7 MB on disk) with a
	set of training-free retrieval upgrades that lift R@1 from 0.314 → 0.745
	on the Webscout codebase benchmark (51 hand-verified code queries,
	5,168 chunks across 349 files).

	## What changed vs the v1 model

	All four upgrades are inference-time only — the underlying 4-bit weights are
	bit-identical to the v1 artifact. They are:

	1. SIF IDF weighting. Each token's contribution is scaled by
	`a / (a + p(t))` where `p(t)` is its corpus frequency. Common tokens
	("import", "def", "class") are down-weighted; rare tokens are amplified.
	2. Top-8 principal component removal. The dominant common-topic
	direction of the corpus is fitted once via SVD and projected out of
	every chunk/query vector (Arora et al. 2017).
	3. File-path header injection. Before encoding each chunk, its file
	path tokens (e.g. `model_fetcher`, `search`, `engines`) are prepended
	×15. The file name effectively becomes a "tag" the chunk retrieves on.
	4. Search-time file-extension score bias. Within the top-50 dense
	candidates, `.py` chunks get `+0.05` and `.md` chunks get `-0.02`. This
	fixes the common failure where README.md and docs/*.md outrank the
	actual code (higher topic overlap but lower action relevance).

	## Benchmark

	Corpus: 5,168 chunks × 256-dim across 349 files in the Webscout codebase.
	Queries: 51 hand-verified natural-language → file-path pairs.

	\| Model \| R@1 \| R@5 \| R@10 \| MRR \| enc@1 \| enc@64 \| search@64 \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| Vortex-Embed v1 (baseline) \| 0.314 \| 0.667 \| 0.863 \| 0.478 \| 6.2 ms \| 227 ms \| 4.2 ms \|
	\| Vortex-Embed v2 (this) \| 0.745 \| 0.843 \| 0.882 \| 0.779 \| 6.4 ms \| 107 ms \| 9.1 ms \|

	+137% R@1, +63% MRR. Encode of 64 chunks is 2.1× faster thanks
	to the same `torch.scatter_add_` (ATen) and sorted `reduceat` kernels
	used in v1.

	## Usage

	```python
	from huggingface_hub import snapshot_download
	from lf4_v2 import VortexEmbedV2

	# Download model + tokenizer + config
	path = snapshot_download("VTXAI/Vortex-Embed-v2")

	# Load
	model = VortexEmbedV2.from_pretrained(path)
	print(f"vocab={model.vocab_size}, dim={model.dim}, size={model.model_size_mb:.1f} MB")

	# Single-query encode
	vec = model.encode("find python json parser", normalize=True)
	# vec.shape == (256,)

	# Batch encode
	docs = [
	"def parse_json(s): return json.loads(s)",
	"class WeatherAPI: pass",
	"import requests",
	]
	doc_embs = model.encode(docs, normalize=True) # (3, 256)

	# Search
	import numpy as np
	scores, indices = model.search(vec, doc_embs, top_k=3)
	# scores.shape == (1, 3), indices.shape == (1, 3)
	```

	### Codebase retrieval (the real use case)

	```python
	from pathlib import Path
	from lf4_v2 import VortexEmbedV2

	# 1. Chunk a codebase (line-based, 40 lines/chunk, 5 line overlap)
	chunks, texts = [], []
	for path in Path("./src").rglob("*.py"):
	for i, line in enumerate(path.read_text().splitlines()):
	chunk_start = max(0, i - 40)
	chunk = "\n".join(path.read_text().splitlines()[chunk_start:i+5])
	chunks.append((str(path), chunk_start, chunk))
	texts.append(chunk)

	# 2. Load + bind paths (this enables file-path header injection)
	model = VortexEmbedV2.from_pretrained("VTXAI/Vortex-Embed-v2")
	model.set_file_paths([c[0] for c in chunks]) # critical for v2 quality

	# 3. Fit IDF on the corpus (one-time, ~200 ms)
	token_lists = [model.tokenizer.encode(t).ids for t in texts]
	model.fit_idf(token_lists)

	# 4. Encode corpus
	import_emb = model.encode_batch(texts, normalize=True) # (n, 256)

	# 5. Fit top-K PC on the corpus (one-time, ~300 ms)
	model.fit_pc(import_emb, k=8)

	# 6. Re-encode with PC removal applied
	import_emb = model.encode_batch(texts, normalize=True)

	# 7. Query
	query = "where do we parse JSON requests"
	q_emb = model.encode(query, normalize=True)
	scores, indices = model.search(q_emb, import_emb, top_k=10)
	for rank, (s, i) in enumerate(zip(scores[0], indices[0]), 1):
	file, line, text = chunks[i]
	print(f"#{rank} ({s:.3f}) {file}:{line}")
	```

	## Configuration knobs

	All retrieval hyperparameters live in `config.json` and can be overridden
	at load time:

	```python
	model = VortexEmbedV2.from_pretrained(
	"VTXAI/Vortex-Embed-v2",
	sif_a=1e-3, # SIF smoothing (lower = sharper)
	pc_k=0, # disable PC removal
	header_repeat=10, # reduce path-header weight
	py_bonus=0.0, # disable extension bias
	)
	```

	\| Knob \| Default \| Effect \|
	\|---\|---\|---\|
	\| `sif_a` \| 1e-4 \| SIF smoothing. Lower = sharper IDF weighting \|
	\| `pc_k` \| 8 \| Number of principal components to remove \|
	\| `sif_pc` \| 1.0 \| PC removal strength (0 = disabled) \|
	\| `header_repeat` \| 15 \| How many times to repeat path-header tokens \|
	\| `py_bonus` \| 0.05 \| Score boost for `.py` chunks in top-50 \|
	\| `md_penalty` \| -0.02 \| Score penalty for `.md` chunks in top-50 \|
	\| `bias_top_k` \| 50 \| Candidate pool size for the bias \|

	## Files

	- `model.safetensors` — 4-bit LF4 packed weights (3.7 MB)
	- `embedding_scales` (FP16), `embedding_zeros` (FP16) — per-block quantization params
	- `config.json` — model + retrieval config
	- `tokenizer.json` — HuggingFace fast tokenizer (29 KB)
	- `lf4_v2.py` — self-contained model class (drop-in to any project)

	## Citation

	The SIF/PC technique is from:
	> Arora, Liang, Ma (2017). A Simple but Tough-to-Beat Baseline for Sentence Embeddings. ICLR.

	The LF4 quantization is from:
	> Original Vortex-Embed-4.7M model card on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M).

	If you use v2 in research, please cite the original Vortex-Embed paper and
	this AutoResearch loop (see [Vortex-AutoResearch](https://github.com/VortexAI)).