File size: 6,187 Bytes
7c7111f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
license: apache-2.0
tags:
  - sentence-similarity
  - feature-extraction
  - static-embeddings
  - lf4-quantization
  - retrieval
  - code-search
model_name: Vortex-Embed v2
datasets:
  - VTXAI/Vortex-Embed
metrics:
  - recall@1
  - recall@5
  - recall@10
  - mrr
---

# Vortex-Embed v2

**Retrieval-optimized 4-bit static embeddings for code search.**

Built on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M)
(29528 vocab × 256 dim, 4-bit LF4 packed = **4.7 MB** on disk) with a
set of training-free retrieval upgrades that lift R@1 from 0.314 → **0.745**
on the Webscout codebase benchmark (51 hand-verified code queries,
5,168 chunks across 349 files).

## What changed vs the v1 model

All four upgrades are inference-time only — the underlying 4-bit weights are
bit-identical to the v1 artifact. They are:

1. **SIF IDF weighting.** Each token's contribution is scaled by
   `a / (a + p(t))` where `p(t)` is its corpus frequency. Common tokens
   ("import", "def", "class") are down-weighted; rare tokens are amplified.
2. **Top-8 principal component removal.** The dominant common-topic
   direction of the corpus is fitted once via SVD and projected out of
   every chunk/query vector (Arora et al. 2017).
3. **File-path header injection.** Before encoding each chunk, its file
   path tokens (e.g. `model_fetcher`, `search`, `engines`) are prepended
   ×15. The file name effectively becomes a "tag" the chunk retrieves on.
4. **Search-time file-extension score bias.** Within the top-50 dense
   candidates, `.py` chunks get `+0.05` and `.md` chunks get `-0.02`. This
   fixes the common failure where README.md and docs/*.md outrank the
   actual code (higher topic overlap but lower action relevance).

## Benchmark

Corpus: 5,168 chunks × 256-dim across 349 files in the Webscout codebase.
Queries: 51 hand-verified natural-language → file-path pairs.

| Model | R@1 | R@5 | R@10 | MRR | enc@1 | enc@64 | search@64 |
|---|---|---|---|---|---|---|---|
| Vortex-Embed v1 (baseline) | 0.314 | 0.667 | 0.863 | 0.478 | 6.2 ms | 227 ms | 4.2 ms |
| **Vortex-Embed v2 (this)** | **0.745** | **0.843** | **0.882** | **0.779** | 6.4 ms | 107 ms | 9.1 ms |

**+137% R@1, +63% MRR.** Encode of 64 chunks is **2.1× faster** thanks
to the same `torch.scatter_add_` (ATen) and sorted `reduceat` kernels
used in v1.

## Usage

```python
from huggingface_hub import snapshot_download
from lf4_v2 import VortexEmbedV2

# Download model + tokenizer + config
path = snapshot_download("VTXAI/Vortex-Embed-v2")

# Load
model = VortexEmbedV2.from_pretrained(path)
print(f"vocab={model.vocab_size}, dim={model.dim}, size={model.model_size_mb:.1f} MB")

# Single-query encode
vec = model.encode("find python json parser", normalize=True)
# vec.shape == (256,)

# Batch encode
docs = [
    "def parse_json(s): return json.loads(s)",
    "class WeatherAPI: pass",
    "import requests",
]
doc_embs = model.encode(docs, normalize=True)  # (3, 256)

# Search
import numpy as np
scores, indices = model.search(vec, doc_embs, top_k=3)
# scores.shape == (1, 3), indices.shape == (1, 3)
```

### Codebase retrieval (the real use case)

```python
from pathlib import Path
from lf4_v2 import VortexEmbedV2

# 1. Chunk a codebase (line-based, 40 lines/chunk, 5 line overlap)
chunks, texts = [], []
for path in Path("./src").rglob("*.py"):
    for i, line in enumerate(path.read_text().splitlines()):
        chunk_start = max(0, i - 40)
        chunk = "\n".join(path.read_text().splitlines()[chunk_start:i+5])
        chunks.append((str(path), chunk_start, chunk))
        texts.append(chunk)

# 2. Load + bind paths (this enables file-path header injection)
model = VortexEmbedV2.from_pretrained("VTXAI/Vortex-Embed-v2")
model.set_file_paths([c[0] for c in chunks])  # critical for v2 quality

# 3. Fit IDF on the corpus (one-time, ~200 ms)
token_lists = [model.tokenizer.encode(t).ids for t in texts]
model.fit_idf(token_lists)

# 4. Encode corpus
import_emb = model.encode_batch(texts, normalize=True)  # (n, 256)

# 5. Fit top-K PC on the corpus (one-time, ~300 ms)
model.fit_pc(import_emb, k=8)

# 6. Re-encode with PC removal applied
import_emb = model.encode_batch(texts, normalize=True)

# 7. Query
query = "where do we parse JSON requests"
q_emb = model.encode(query, normalize=True)
scores, indices = model.search(q_emb, import_emb, top_k=10)
for rank, (s, i) in enumerate(zip(scores[0], indices[0]), 1):
    file, line, text = chunks[i]
    print(f"#{rank} ({s:.3f}) {file}:{line}")
```

## Configuration knobs

All retrieval hyperparameters live in `config.json` and can be overridden
at load time:

```python
model = VortexEmbedV2.from_pretrained(
    "VTXAI/Vortex-Embed-v2",
    sif_a=1e-3,           # SIF smoothing (lower = sharper)
    pc_k=0,               # disable PC removal
    header_repeat=10,     # reduce path-header weight
    py_bonus=0.0,         # disable extension bias
)
```

| Knob | Default | Effect |
|---|---|---|
| `sif_a` | 1e-4 | SIF smoothing. Lower = sharper IDF weighting |
| `pc_k` | 8 | Number of principal components to remove |
| `sif_pc` | 1.0 | PC removal strength (0 = disabled) |
| `header_repeat` | 15 | How many times to repeat path-header tokens |
| `py_bonus` | 0.05 | Score boost for `.py` chunks in top-50 |
| `md_penalty` | -0.02 | Score penalty for `.md` chunks in top-50 |
| `bias_top_k` | 50 | Candidate pool size for the bias |

## Files

- `model.safetensors` — 4-bit LF4 packed weights (3.7 MB)
- `embedding_scales` (FP16), `embedding_zeros` (FP16) — per-block quantization params
- `config.json` — model + retrieval config
- `tokenizer.json` — HuggingFace fast tokenizer (29 KB)
- `lf4_v2.py` — self-contained model class (drop-in to any project)

## Citation

The SIF/PC technique is from:
> Arora, Liang, Ma (2017). *A Simple but Tough-to-Beat Baseline for Sentence Embeddings.* ICLR.

The LF4 quantization is from:
> Original Vortex-Embed-4.7M model card on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M).

If you use v2 in research, please cite the original Vortex-Embed paper and
this AutoResearch loop (see [Vortex-AutoResearch](https://github.com/VortexAI)).