fix: citation key
Browse files
README.md
CHANGED
|
@@ -15,7 +15,8 @@ pipeline_tag: summarization
|
|
| 15 |
# vec2slug-v1-openai-small
|
| 16 |
|
| 17 |
Generate URL slugs directly from text embeddings, without re-feeding
|
| 18 |
-
source text through a language model.
|
|
|
|
| 19 |
|
| 20 |
| | |
|
| 21 |
|---|---|
|
|
@@ -27,6 +28,11 @@ source text through a language model.
|
|
| 27 |
| **ONNX size** | 44.3 MiB |
|
| 28 |
| **Inference (CPU)** | ~21ms (M-series), ~89ms (budget VPS) |
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
This is the **smaller** of two variants. It is recommended for most deployments: the larger model adds only +0.008 Token F1 at 2x the inference cost.
|
| 31 |
|
| 32 |
See also: [Vec2Slug V1-Openai-Large](https://huggingface.co/hashintel/vec2slug-v1-openai-large)
|
|
@@ -61,6 +67,21 @@ predictor = PyTorchPredictor.from_dir(".")
|
|
| 61 |
slugs = predictor.predict(embeddings)
|
| 62 |
```
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
## How it works
|
| 65 |
|
| 66 |
The model is a prefix-conditioned transformer decoder. A precomputed text
|
|
@@ -102,9 +123,17 @@ decoding pipeline.
|
|
| 102 |
|---|---|
|
| 103 |
| Token F1 (macro) | 0.298 |
|
| 104 |
| Exact match | 1.9% |
|
|
|
|
|
|
|
| 105 |
| Validity | 100% |
|
| 106 |
| Vocab diversity | 97.3% |
|
| 107 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
## Limitations
|
| 109 |
|
| 110 |
- Requires precomputed embeddings from OpenAI `text-embedding-3-small`.
|
|
@@ -113,6 +142,9 @@ decoding pipeline.
|
|
| 113 |
may produce generic or inaccurate slugs.
|
| 114 |
- Slugs reflect patterns in the training URLs, which include SEO-influenced
|
| 115 |
and editorially inconsistent sources.
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
## Links
|
| 118 |
|
|
@@ -123,10 +155,10 @@ decoding pipeline.
|
|
| 123 |
## Citation
|
| 124 |
|
| 125 |
```bibtex
|
| 126 |
-
@misc{
|
| 127 |
title={vec2slug: URL Slug Generation from Text Embeddings},
|
| 128 |
-
author={Mahmoud, Bilal},
|
| 129 |
-
year={
|
| 130 |
url={https://github.com/hashintel/labs}
|
| 131 |
}
|
| 132 |
```
|
|
|
|
| 15 |
# vec2slug-v1-openai-small
|
| 16 |
|
| 17 |
Generate URL slugs directly from text embeddings, without re-feeding
|
| 18 |
+
source text through a language model. Designed to piggyback on embeddings
|
| 19 |
+
a system already has for search or deduplication.
|
| 20 |
|
| 21 |
| | |
|
| 22 |
|---|---|
|
|
|
|
| 28 |
| **ONNX size** | 44.3 MiB |
|
| 29 |
| **Inference (CPU)** | ~21ms (M-series), ~89ms (budget VPS) |
|
| 30 |
|
| 31 |
+
14 to 19× faster and approximately 85× cheaper than a Haiku-class LLM
|
| 32 |
+
call for the same task, including the cost of computing a fresh embedding.
|
| 33 |
+
With existing embeddings (the intended use case), approximately 2,000×
|
| 34 |
+
cheaper.
|
| 35 |
+
|
| 36 |
This is the **smaller** of two variants. It is recommended for most deployments: the larger model adds only +0.008 Token F1 at 2x the inference cost.
|
| 37 |
|
| 38 |
See also: [Vec2Slug V1-Openai-Large](https://huggingface.co/hashintel/vec2slug-v1-openai-large)
|
|
|
|
| 67 |
slugs = predictor.predict(embeddings)
|
| 68 |
```
|
| 69 |
|
| 70 |
+
## Examples
|
| 71 |
+
|
| 72 |
+
Predictions on held-out test samples (beam search, width 4). The model
|
| 73 |
+
sees only the 1536-dim embedding, never the source text.
|
| 74 |
+
|
| 75 |
+
| Source text | Reference slug | Predicted slug |
|
| 76 |
+
|---|---|---|
|
| 77 |
+
| Children's book about astronomy and living on Mars | `can-we-live-on-mars` | `can-we-live-on-mars` |
|
| 78 |
+
| Teaching resources for Martin Luther King Jr. Day | `celebrating-martin-luther-king-jr-day` | `celebrating-martin-luther-king-jr-day` |
|
| 79 |
+
| Article about Waldorf education practices | `12-things-may-not-know-waldorf-education` | `10-things-you-didnt-know-about-waldorf-education` |
|
| 80 |
+
|
| 81 |
+
The third example illustrates the typical case: the model captures the
|
| 82 |
+
topic correctly but diverges in specific wording. The common failure mode
|
| 83 |
+
is overgeneralization rather than incoherence.
|
| 84 |
+
|
| 85 |
## How it works
|
| 86 |
|
| 87 |
The model is a prefix-conditioned transformer decoder. A precomputed text
|
|
|
|
| 123 |
|---|---|
|
| 124 |
| Token F1 (macro) | 0.298 |
|
| 125 |
| Exact match | 1.9% |
|
| 126 |
+
| ROUGE-L | 0.277 |
|
| 127 |
+
| BERTScore F1 | 0.869 |
|
| 128 |
| Validity | 100% |
|
| 129 |
| Vocab diversity | 97.3% |
|
| 130 |
|
| 131 |
+
Token F1 splits both slugs on hyphens and computes set-overlap F1 (order
|
| 132 |
+
ignored). ROUGE-L measures the longest common subsequence and penalizes
|
| 133 |
+
misordered words. BERTScore computes contextual embedding similarity via
|
| 134 |
+
roberta-large; the floor is high (~0.82) because short English slugs are
|
| 135 |
+
not widely separated in that embedding space.
|
| 136 |
+
|
| 137 |
## Limitations
|
| 138 |
|
| 139 |
- Requires precomputed embeddings from OpenAI `text-embedding-3-small`.
|
|
|
|
| 142 |
may produce generic or inaccurate slugs.
|
| 143 |
- Slugs reflect patterns in the training URLs, which include SEO-influenced
|
| 144 |
and editorially inconsistent sources.
|
| 145 |
+
- The primary failure mode is overgeneralization: the model captures the
|
| 146 |
+
topic but may miss specific angles or proper nouns (`asm` instead of
|
| 147 |
+
`wasm` for a WebAssembly article).
|
| 148 |
|
| 149 |
## Links
|
| 150 |
|
|
|
|
| 155 |
## Citation
|
| 156 |
|
| 157 |
```bibtex
|
| 158 |
+
@misc{vec2slug2026,
|
| 159 |
title={vec2slug: URL Slug Generation from Text Embeddings},
|
| 160 |
+
author={Mahmoud, Bilal and {HASH}},
|
| 161 |
+
year={2026},
|
| 162 |
url={https://github.com/hashintel/labs}
|
| 163 |
}
|
| 164 |
```
|