File size: 6,187 Bytes
7c7111f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | ---
license: apache-2.0
tags:
- sentence-similarity
- feature-extraction
- static-embeddings
- lf4-quantization
- retrieval
- code-search
model_name: Vortex-Embed v2
datasets:
- VTXAI/Vortex-Embed
metrics:
- recall@1
- recall@5
- recall@10
- mrr
---
# Vortex-Embed v2
**Retrieval-optimized 4-bit static embeddings for code search.**
Built on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M)
(29528 vocab × 256 dim, 4-bit LF4 packed = **4.7 MB** on disk) with a
set of training-free retrieval upgrades that lift R@1 from 0.314 → **0.745**
on the Webscout codebase benchmark (51 hand-verified code queries,
5,168 chunks across 349 files).
## What changed vs the v1 model
All four upgrades are inference-time only — the underlying 4-bit weights are
bit-identical to the v1 artifact. They are:
1. **SIF IDF weighting.** Each token's contribution is scaled by
`a / (a + p(t))` where `p(t)` is its corpus frequency. Common tokens
("import", "def", "class") are down-weighted; rare tokens are amplified.
2. **Top-8 principal component removal.** The dominant common-topic
direction of the corpus is fitted once via SVD and projected out of
every chunk/query vector (Arora et al. 2017).
3. **File-path header injection.** Before encoding each chunk, its file
path tokens (e.g. `model_fetcher`, `search`, `engines`) are prepended
×15. The file name effectively becomes a "tag" the chunk retrieves on.
4. **Search-time file-extension score bias.** Within the top-50 dense
candidates, `.py` chunks get `+0.05` and `.md` chunks get `-0.02`. This
fixes the common failure where README.md and docs/*.md outrank the
actual code (higher topic overlap but lower action relevance).
## Benchmark
Corpus: 5,168 chunks × 256-dim across 349 files in the Webscout codebase.
Queries: 51 hand-verified natural-language → file-path pairs.
| Model | R@1 | R@5 | R@10 | MRR | enc@1 | enc@64 | search@64 |
|---|---|---|---|---|---|---|---|
| Vortex-Embed v1 (baseline) | 0.314 | 0.667 | 0.863 | 0.478 | 6.2 ms | 227 ms | 4.2 ms |
| **Vortex-Embed v2 (this)** | **0.745** | **0.843** | **0.882** | **0.779** | 6.4 ms | 107 ms | 9.1 ms |
**+137% R@1, +63% MRR.** Encode of 64 chunks is **2.1× faster** thanks
to the same `torch.scatter_add_` (ATen) and sorted `reduceat` kernels
used in v1.
## Usage
```python
from huggingface_hub import snapshot_download
from lf4_v2 import VortexEmbedV2
# Download model + tokenizer + config
path = snapshot_download("VTXAI/Vortex-Embed-v2")
# Load
model = VortexEmbedV2.from_pretrained(path)
print(f"vocab={model.vocab_size}, dim={model.dim}, size={model.model_size_mb:.1f} MB")
# Single-query encode
vec = model.encode("find python json parser", normalize=True)
# vec.shape == (256,)
# Batch encode
docs = [
"def parse_json(s): return json.loads(s)",
"class WeatherAPI: pass",
"import requests",
]
doc_embs = model.encode(docs, normalize=True) # (3, 256)
# Search
import numpy as np
scores, indices = model.search(vec, doc_embs, top_k=3)
# scores.shape == (1, 3), indices.shape == (1, 3)
```
### Codebase retrieval (the real use case)
```python
from pathlib import Path
from lf4_v2 import VortexEmbedV2
# 1. Chunk a codebase (line-based, 40 lines/chunk, 5 line overlap)
chunks, texts = [], []
for path in Path("./src").rglob("*.py"):
for i, line in enumerate(path.read_text().splitlines()):
chunk_start = max(0, i - 40)
chunk = "\n".join(path.read_text().splitlines()[chunk_start:i+5])
chunks.append((str(path), chunk_start, chunk))
texts.append(chunk)
# 2. Load + bind paths (this enables file-path header injection)
model = VortexEmbedV2.from_pretrained("VTXAI/Vortex-Embed-v2")
model.set_file_paths([c[0] for c in chunks]) # critical for v2 quality
# 3. Fit IDF on the corpus (one-time, ~200 ms)
token_lists = [model.tokenizer.encode(t).ids for t in texts]
model.fit_idf(token_lists)
# 4. Encode corpus
import_emb = model.encode_batch(texts, normalize=True) # (n, 256)
# 5. Fit top-K PC on the corpus (one-time, ~300 ms)
model.fit_pc(import_emb, k=8)
# 6. Re-encode with PC removal applied
import_emb = model.encode_batch(texts, normalize=True)
# 7. Query
query = "where do we parse JSON requests"
q_emb = model.encode(query, normalize=True)
scores, indices = model.search(q_emb, import_emb, top_k=10)
for rank, (s, i) in enumerate(zip(scores[0], indices[0]), 1):
file, line, text = chunks[i]
print(f"#{rank} ({s:.3f}) {file}:{line}")
```
## Configuration knobs
All retrieval hyperparameters live in `config.json` and can be overridden
at load time:
```python
model = VortexEmbedV2.from_pretrained(
"VTXAI/Vortex-Embed-v2",
sif_a=1e-3, # SIF smoothing (lower = sharper)
pc_k=0, # disable PC removal
header_repeat=10, # reduce path-header weight
py_bonus=0.0, # disable extension bias
)
```
| Knob | Default | Effect |
|---|---|---|
| `sif_a` | 1e-4 | SIF smoothing. Lower = sharper IDF weighting |
| `pc_k` | 8 | Number of principal components to remove |
| `sif_pc` | 1.0 | PC removal strength (0 = disabled) |
| `header_repeat` | 15 | How many times to repeat path-header tokens |
| `py_bonus` | 0.05 | Score boost for `.py` chunks in top-50 |
| `md_penalty` | -0.02 | Score penalty for `.md` chunks in top-50 |
| `bias_top_k` | 50 | Candidate pool size for the bias |
## Files
- `model.safetensors` — 4-bit LF4 packed weights (3.7 MB)
- `embedding_scales` (FP16), `embedding_zeros` (FP16) — per-block quantization params
- `config.json` — model + retrieval config
- `tokenizer.json` — HuggingFace fast tokenizer (29 KB)
- `lf4_v2.py` — self-contained model class (drop-in to any project)
## Citation
The SIF/PC technique is from:
> Arora, Liang, Ma (2017). *A Simple but Tough-to-Beat Baseline for Sentence Embeddings.* ICLR.
The LF4 quantization is from:
> Original Vortex-Embed-4.7M model card on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M).
If you use v2 in research, please cite the original Vortex-Embed paper and
this AutoResearch loop (see [Vortex-AutoResearch](https://github.com/VortexAI)).
|