hashintel
/

vec2slug-v1-openai-small

@@ -15,7 +15,8 @@ pipeline_tag: summarization
 # vec2slug-v1-openai-small
 Generate URL slugs directly from text embeddings, without re-feeding
-source text through a language model.
 | | |
 |---|---|
@@ -27,6 +28,11 @@ source text through a language model.
 | **ONNX size** | 44.3 MiB |
 | **Inference (CPU)** | ~21ms (M-series), ~89ms (budget VPS) |
 This is the **smaller** of two variants. It is recommended for most deployments: the larger model adds only +0.008 Token F1 at 2x the inference cost.
 See also: [Vec2Slug V1-Openai-Large](https://huggingface.co/hashintel/vec2slug-v1-openai-large)
@@ -61,6 +67,21 @@ predictor = PyTorchPredictor.from_dir(".")
 slugs = predictor.predict(embeddings)
 ```
 ## How it works
 The model is a prefix-conditioned transformer decoder. A precomputed text
@@ -102,9 +123,17 @@ decoding pipeline.
 |---|---|
 | Token F1 (macro) | 0.298 |
 | Exact match | 1.9% |
 | Validity | 100% |
 | Vocab diversity | 97.3% |
 ## Limitations
 - Requires precomputed embeddings from OpenAI `text-embedding-3-small`.
@@ -113,6 +142,9 @@ decoding pipeline.
   may produce generic or inaccurate slugs.
 - Slugs reflect patterns in the training URLs, which include SEO-influenced
   and editorially inconsistent sources.
 ## Links
@@ -123,10 +155,10 @@ decoding pipeline.
 ## Citation
 ```bibtex
-@misc{vec2slug2025,
   title={vec2slug: URL Slug Generation from Text Embeddings},
-  author={Mahmoud, Bilal},
-  year={2025},
   url={https://github.com/hashintel/labs}
 }
 ```

 # vec2slug-v1-openai-small
 Generate URL slugs directly from text embeddings, without re-feeding
+source text through a language model. Designed to piggyback on embeddings
+a system already has for search or deduplication.
 | | |
 |---|---|
 | **ONNX size** | 44.3 MiB |
 | **Inference (CPU)** | ~21ms (M-series), ~89ms (budget VPS) |
+14 to 19× faster and approximately 85× cheaper than a Haiku-class LLM
+call for the same task, including the cost of computing a fresh embedding.
+With existing embeddings (the intended use case), approximately 2,000×
+cheaper.
 This is the **smaller** of two variants. It is recommended for most deployments: the larger model adds only +0.008 Token F1 at 2x the inference cost.
 See also: [Vec2Slug V1-Openai-Large](https://huggingface.co/hashintel/vec2slug-v1-openai-large)
 slugs = predictor.predict(embeddings)
 ```
+## Examples
+Predictions on held-out test samples (beam search, width 4). The model
+sees only the 1536-dim embedding, never the source text.
+| Source text | Reference slug | Predicted slug |
+|---|---|---|
+| Children's book about astronomy and living on Mars | `can-we-live-on-mars` | `can-we-live-on-mars` |
+| Teaching resources for Martin Luther King Jr. Day | `celebrating-martin-luther-king-jr-day` | `celebrating-martin-luther-king-jr-day` |
+| Article about Waldorf education practices | `12-things-may-not-know-waldorf-education` | `10-things-you-didnt-know-about-waldorf-education` |
+The third example illustrates the typical case: the model captures the
+topic correctly but diverges in specific wording. The common failure mode
+is overgeneralization rather than incoherence.
 ## How it works
 The model is a prefix-conditioned transformer decoder. A precomputed text
 |---|---|
 | Token F1 (macro) | 0.298 |
 | Exact match | 1.9% |
+| ROUGE-L | 0.277 |
+| BERTScore F1 | 0.869 |
 | Validity | 100% |
 | Vocab diversity | 97.3% |
+Token F1 splits both slugs on hyphens and computes set-overlap F1 (order
+ignored). ROUGE-L measures the longest common subsequence and penalizes
+misordered words. BERTScore computes contextual embedding similarity via
+roberta-large; the floor is high (~0.82) because short English slugs are
+not widely separated in that embedding space.
 ## Limitations
 - Requires precomputed embeddings from OpenAI `text-embedding-3-small`.
   may produce generic or inaccurate slugs.
 - Slugs reflect patterns in the training URLs, which include SEO-influenced
   and editorially inconsistent sources.
+- The primary failure mode is overgeneralization: the model captures the
+  topic but may miss specific angles or proper nouns (`asm` instead of
+  `wasm` for a WebAssembly article).
 ## Links
 ## Citation
 ```bibtex
+@misc{vec2slug2026,
   title={vec2slug: URL Slug Generation from Text Embeddings},
+  author={Mahmoud, Bilal and {HASH}},
+  year={2026},
   url={https://github.com/hashintel/labs}
 }
 ```