| --- |
| license: apache-2.0 |
| tags: |
| - sentence-similarity |
| - feature-extraction |
| - static-embeddings |
| - lf4-quantization |
| - retrieval |
| - code-search |
| model_name: Vortex-Embed v2 |
| datasets: |
| - VTXAI/Vortex-Embed |
| metrics: |
| - recall@1 |
| - recall@5 |
| - recall@10 |
| - mrr |
| --- |
| |
| # Vortex-Embed v2 |
|
|
| **Retrieval-optimized 4-bit static embeddings for code search.** |
|
|
| Built on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M) |
| (29528 vocab × 256 dim, 4-bit LF4 packed = **4.7 MB** on disk) with a |
| set of training-free retrieval upgrades that lift R@1 from 0.314 → **0.745** |
| on the Webscout codebase benchmark (51 hand-verified code queries, |
| 5,168 chunks across 349 files). |
|
|
| ## What changed vs the v1 model |
|
|
| All four upgrades are inference-time only — the underlying 4-bit weights are |
| bit-identical to the v1 artifact. They are: |
|
|
| 1. **SIF IDF weighting.** Each token's contribution is scaled by |
| `a / (a + p(t))` where `p(t)` is its corpus frequency. Common tokens |
| ("import", "def", "class") are down-weighted; rare tokens are amplified. |
| 2. **Top-8 principal component removal.** The dominant common-topic |
| direction of the corpus is fitted once via SVD and projected out of |
| every chunk/query vector (Arora et al. 2017). |
| 3. **File-path header injection.** Before encoding each chunk, its file |
| path tokens (e.g. `model_fetcher`, `search`, `engines`) are prepended |
| ×15. The file name effectively becomes a "tag" the chunk retrieves on. |
| 4. **Search-time file-extension score bias.** Within the top-50 dense |
| candidates, `.py` chunks get `+0.05` and `.md` chunks get `-0.02`. This |
| fixes the common failure where README.md and docs/*.md outrank the |
| actual code (higher topic overlap but lower action relevance). |
| |
| ## Benchmark |
| |
| Corpus: 5,168 chunks × 256-dim across 349 files in the Webscout codebase. |
| Queries: 51 hand-verified natural-language → file-path pairs. |
| |
| | Model | R@1 | R@5 | R@10 | MRR | enc@1 | enc@64 | search@64 | |
| |---|---|---|---|---|---|---|---| |
| | Vortex-Embed v1 (baseline) | 0.314 | 0.667 | 0.863 | 0.478 | 6.2 ms | 227 ms | 4.2 ms | |
| | **Vortex-Embed v2 (this)** | **0.745** | **0.843** | **0.882** | **0.779** | 6.4 ms | 107 ms | 9.1 ms | |
| |
| **+137% R@1, +63% MRR.** Encode of 64 chunks is **2.1× faster** thanks |
| to the same `torch.scatter_add_` (ATen) and sorted `reduceat` kernels |
| used in v1. |
| |
| ## Usage |
| |
| ```python |
| from huggingface_hub import snapshot_download |
| from lf4_v2 import VortexEmbedV2 |
| |
| # Download model + tokenizer + config |
| path = snapshot_download("VTXAI/Vortex-Embed-v2") |
| |
| # Load |
| model = VortexEmbedV2.from_pretrained(path) |
| print(f"vocab={model.vocab_size}, dim={model.dim}, size={model.model_size_mb:.1f} MB") |
| |
| # Single-query encode |
| vec = model.encode("find python json parser", normalize=True) |
| # vec.shape == (256,) |
| |
| # Batch encode |
| docs = [ |
| "def parse_json(s): return json.loads(s)", |
| "class WeatherAPI: pass", |
| "import requests", |
| ] |
| doc_embs = model.encode(docs, normalize=True) # (3, 256) |
| |
| # Search |
| import numpy as np |
| scores, indices = model.search(vec, doc_embs, top_k=3) |
| # scores.shape == (1, 3), indices.shape == (1, 3) |
| ``` |
| |
| ### Codebase retrieval (the real use case) |
| |
| ```python |
| from pathlib import Path |
| from lf4_v2 import VortexEmbedV2 |
| |
| # 1. Chunk a codebase (line-based, 40 lines/chunk, 5 line overlap) |
| chunks, texts = [], [] |
| for path in Path("./src").rglob("*.py"): |
| for i, line in enumerate(path.read_text().splitlines()): |
| chunk_start = max(0, i - 40) |
| chunk = "\n".join(path.read_text().splitlines()[chunk_start:i+5]) |
| chunks.append((str(path), chunk_start, chunk)) |
| texts.append(chunk) |
| |
| # 2. Load + bind paths (this enables file-path header injection) |
| model = VortexEmbedV2.from_pretrained("VTXAI/Vortex-Embed-v2") |
| model.set_file_paths([c[0] for c in chunks]) # critical for v2 quality |
| |
| # 3. Fit IDF on the corpus (one-time, ~200 ms) |
| token_lists = [model.tokenizer.encode(t).ids for t in texts] |
| model.fit_idf(token_lists) |
|
|
| # 4. Encode corpus |
| import_emb = model.encode_batch(texts, normalize=True) # (n, 256) |
|
|
| # 5. Fit top-K PC on the corpus (one-time, ~300 ms) |
| model.fit_pc(import_emb, k=8) |
|
|
| # 6. Re-encode with PC removal applied |
| import_emb = model.encode_batch(texts, normalize=True) |
|
|
| # 7. Query |
| query = "where do we parse JSON requests" |
| q_emb = model.encode(query, normalize=True) |
| scores, indices = model.search(q_emb, import_emb, top_k=10) |
| for rank, (s, i) in enumerate(zip(scores[0], indices[0]), 1): |
| file, line, text = chunks[i] |
| print(f"#{rank} ({s:.3f}) {file}:{line}") |
| ``` |
| |
| ## Configuration knobs |
|
|
| All retrieval hyperparameters live in `config.json` and can be overridden |
| at load time: |
|
|
| ```python |
| model = VortexEmbedV2.from_pretrained( |
| "VTXAI/Vortex-Embed-v2", |
| sif_a=1e-3, # SIF smoothing (lower = sharper) |
| pc_k=0, # disable PC removal |
| header_repeat=10, # reduce path-header weight |
| py_bonus=0.0, # disable extension bias |
| ) |
| ``` |
|
|
| | Knob | Default | Effect | |
| |---|---|---| |
| | `sif_a` | 1e-4 | SIF smoothing. Lower = sharper IDF weighting | |
| | `pc_k` | 8 | Number of principal components to remove | |
| | `sif_pc` | 1.0 | PC removal strength (0 = disabled) | |
| | `header_repeat` | 15 | How many times to repeat path-header tokens | |
| | `py_bonus` | 0.05 | Score boost for `.py` chunks in top-50 | |
| | `md_penalty` | -0.02 | Score penalty for `.md` chunks in top-50 | |
| | `bias_top_k` | 50 | Candidate pool size for the bias | |
|
|
| ## Files |
|
|
| - `model.safetensors` — 4-bit LF4 packed weights (3.7 MB) |
| - `embedding_scales` (FP16), `embedding_zeros` (FP16) — per-block quantization params |
| - `config.json` — model + retrieval config |
| - `tokenizer.json` — HuggingFace fast tokenizer (29 KB) |
| - `lf4_v2.py` — self-contained model class (drop-in to any project) |
|
|
| ## Citation |
|
|
| The SIF/PC technique is from: |
| > Arora, Liang, Ma (2017). *A Simple but Tough-to-Beat Baseline for Sentence Embeddings.* ICLR. |
|
|
| The LF4 quantization is from: |
| > Original Vortex-Embed-4.7M model card on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M). |
|
|
| If you use v2 in research, please cite the original Vortex-Embed paper and |
| this AutoResearch loop (see [Vortex-AutoResearch](https://github.com/VortexAI)). |
|
|