Vortex-Embed-v2 / README.md
Abhaykoul's picture
Initial upload: Vortex-Embed v2 (R@1 0.745, +137% over v1)
7c7111f verified
|
Raw
History Blame Contribute Delete
6.19 kB
---
license: apache-2.0
tags:
- sentence-similarity
- feature-extraction
- static-embeddings
- lf4-quantization
- retrieval
- code-search
model_name: Vortex-Embed v2
datasets:
- VTXAI/Vortex-Embed
metrics:
- recall@1
- recall@5
- recall@10
- mrr
---
# Vortex-Embed v2
**Retrieval-optimized 4-bit static embeddings for code search.**
Built on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M)
(29528 vocab × 256 dim, 4-bit LF4 packed = **4.7 MB** on disk) with a
set of training-free retrieval upgrades that lift R@1 from 0.314 → **0.745**
on the Webscout codebase benchmark (51 hand-verified code queries,
5,168 chunks across 349 files).
## What changed vs the v1 model
All four upgrades are inference-time only — the underlying 4-bit weights are
bit-identical to the v1 artifact. They are:
1. **SIF IDF weighting.** Each token's contribution is scaled by
`a / (a + p(t))` where `p(t)` is its corpus frequency. Common tokens
("import", "def", "class") are down-weighted; rare tokens are amplified.
2. **Top-8 principal component removal.** The dominant common-topic
direction of the corpus is fitted once via SVD and projected out of
every chunk/query vector (Arora et al. 2017).
3. **File-path header injection.** Before encoding each chunk, its file
path tokens (e.g. `model_fetcher`, `search`, `engines`) are prepended
×15. The file name effectively becomes a "tag" the chunk retrieves on.
4. **Search-time file-extension score bias.** Within the top-50 dense
candidates, `.py` chunks get `+0.05` and `.md` chunks get `-0.02`. This
fixes the common failure where README.md and docs/*.md outrank the
actual code (higher topic overlap but lower action relevance).
## Benchmark
Corpus: 5,168 chunks × 256-dim across 349 files in the Webscout codebase.
Queries: 51 hand-verified natural-language → file-path pairs.
| Model | R@1 | R@5 | R@10 | MRR | enc@1 | enc@64 | search@64 |
|---|---|---|---|---|---|---|---|
| Vortex-Embed v1 (baseline) | 0.314 | 0.667 | 0.863 | 0.478 | 6.2 ms | 227 ms | 4.2 ms |
| **Vortex-Embed v2 (this)** | **0.745** | **0.843** | **0.882** | **0.779** | 6.4 ms | 107 ms | 9.1 ms |
**+137% R@1, +63% MRR.** Encode of 64 chunks is **2.1× faster** thanks
to the same `torch.scatter_add_` (ATen) and sorted `reduceat` kernels
used in v1.
## Usage
```python
from huggingface_hub import snapshot_download
from lf4_v2 import VortexEmbedV2
# Download model + tokenizer + config
path = snapshot_download("VTXAI/Vortex-Embed-v2")
# Load
model = VortexEmbedV2.from_pretrained(path)
print(f"vocab={model.vocab_size}, dim={model.dim}, size={model.model_size_mb:.1f} MB")
# Single-query encode
vec = model.encode("find python json parser", normalize=True)
# vec.shape == (256,)
# Batch encode
docs = [
"def parse_json(s): return json.loads(s)",
"class WeatherAPI: pass",
"import requests",
]
doc_embs = model.encode(docs, normalize=True) # (3, 256)
# Search
import numpy as np
scores, indices = model.search(vec, doc_embs, top_k=3)
# scores.shape == (1, 3), indices.shape == (1, 3)
```
### Codebase retrieval (the real use case)
```python
from pathlib import Path
from lf4_v2 import VortexEmbedV2
# 1. Chunk a codebase (line-based, 40 lines/chunk, 5 line overlap)
chunks, texts = [], []
for path in Path("./src").rglob("*.py"):
for i, line in enumerate(path.read_text().splitlines()):
chunk_start = max(0, i - 40)
chunk = "\n".join(path.read_text().splitlines()[chunk_start:i+5])
chunks.append((str(path), chunk_start, chunk))
texts.append(chunk)
# 2. Load + bind paths (this enables file-path header injection)
model = VortexEmbedV2.from_pretrained("VTXAI/Vortex-Embed-v2")
model.set_file_paths([c[0] for c in chunks]) # critical for v2 quality
# 3. Fit IDF on the corpus (one-time, ~200 ms)
token_lists = [model.tokenizer.encode(t).ids for t in texts]
model.fit_idf(token_lists)
# 4. Encode corpus
import_emb = model.encode_batch(texts, normalize=True) # (n, 256)
# 5. Fit top-K PC on the corpus (one-time, ~300 ms)
model.fit_pc(import_emb, k=8)
# 6. Re-encode with PC removal applied
import_emb = model.encode_batch(texts, normalize=True)
# 7. Query
query = "where do we parse JSON requests"
q_emb = model.encode(query, normalize=True)
scores, indices = model.search(q_emb, import_emb, top_k=10)
for rank, (s, i) in enumerate(zip(scores[0], indices[0]), 1):
file, line, text = chunks[i]
print(f"#{rank} ({s:.3f}) {file}:{line}")
```
## Configuration knobs
All retrieval hyperparameters live in `config.json` and can be overridden
at load time:
```python
model = VortexEmbedV2.from_pretrained(
"VTXAI/Vortex-Embed-v2",
sif_a=1e-3, # SIF smoothing (lower = sharper)
pc_k=0, # disable PC removal
header_repeat=10, # reduce path-header weight
py_bonus=0.0, # disable extension bias
)
```
| Knob | Default | Effect |
|---|---|---|
| `sif_a` | 1e-4 | SIF smoothing. Lower = sharper IDF weighting |
| `pc_k` | 8 | Number of principal components to remove |
| `sif_pc` | 1.0 | PC removal strength (0 = disabled) |
| `header_repeat` | 15 | How many times to repeat path-header tokens |
| `py_bonus` | 0.05 | Score boost for `.py` chunks in top-50 |
| `md_penalty` | -0.02 | Score penalty for `.md` chunks in top-50 |
| `bias_top_k` | 50 | Candidate pool size for the bias |
## Files
- `model.safetensors` — 4-bit LF4 packed weights (3.7 MB)
- `embedding_scales` (FP16), `embedding_zeros` (FP16) — per-block quantization params
- `config.json` — model + retrieval config
- `tokenizer.json` — HuggingFace fast tokenizer (29 KB)
- `lf4_v2.py` — self-contained model class (drop-in to any project)
## Citation
The SIF/PC technique is from:
> Arora, Liang, Ma (2017). *A Simple but Tough-to-Beat Baseline for Sentence Embeddings.* ICLR.
The LF4 quantization is from:
> Original Vortex-Embed-4.7M model card on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M).
If you use v2 in research, please cite the original Vortex-Embed paper and
this AutoResearch loop (see [Vortex-AutoResearch](https://github.com/VortexAI)).