indietyp commited on
Commit
07e9b56
·
verified ·
1 Parent(s): e3d8a3a

fix: citation key

Browse files
Files changed (1) hide show
  1. README.md +36 -4
README.md CHANGED
@@ -15,7 +15,8 @@ pipeline_tag: summarization
15
  # vec2slug-v1-openai-small
16
 
17
  Generate URL slugs directly from text embeddings, without re-feeding
18
- source text through a language model.
 
19
 
20
  | | |
21
  |---|---|
@@ -27,6 +28,11 @@ source text through a language model.
27
  | **ONNX size** | 44.3 MiB |
28
  | **Inference (CPU)** | ~21ms (M-series), ~89ms (budget VPS) |
29
 
 
 
 
 
 
30
  This is the **smaller** of two variants. It is recommended for most deployments: the larger model adds only +0.008 Token F1 at 2x the inference cost.
31
 
32
  See also: [Vec2Slug V1-Openai-Large](https://huggingface.co/hashintel/vec2slug-v1-openai-large)
@@ -61,6 +67,21 @@ predictor = PyTorchPredictor.from_dir(".")
61
  slugs = predictor.predict(embeddings)
62
  ```
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ## How it works
65
 
66
  The model is a prefix-conditioned transformer decoder. A precomputed text
@@ -102,9 +123,17 @@ decoding pipeline.
102
  |---|---|
103
  | Token F1 (macro) | 0.298 |
104
  | Exact match | 1.9% |
 
 
105
  | Validity | 100% |
106
  | Vocab diversity | 97.3% |
107
 
 
 
 
 
 
 
108
  ## Limitations
109
 
110
  - Requires precomputed embeddings from OpenAI `text-embedding-3-small`.
@@ -113,6 +142,9 @@ decoding pipeline.
113
  may produce generic or inaccurate slugs.
114
  - Slugs reflect patterns in the training URLs, which include SEO-influenced
115
  and editorially inconsistent sources.
 
 
 
116
 
117
  ## Links
118
 
@@ -123,10 +155,10 @@ decoding pipeline.
123
  ## Citation
124
 
125
  ```bibtex
126
- @misc{vec2slug2025,
127
  title={vec2slug: URL Slug Generation from Text Embeddings},
128
- author={Mahmoud, Bilal},
129
- year={2025},
130
  url={https://github.com/hashintel/labs}
131
  }
132
  ```
 
15
  # vec2slug-v1-openai-small
16
 
17
  Generate URL slugs directly from text embeddings, without re-feeding
18
+ source text through a language model. Designed to piggyback on embeddings
19
+ a system already has for search or deduplication.
20
 
21
  | | |
22
  |---|---|
 
28
  | **ONNX size** | 44.3 MiB |
29
  | **Inference (CPU)** | ~21ms (M-series), ~89ms (budget VPS) |
30
 
31
+ 14 to 19× faster and approximately 85× cheaper than a Haiku-class LLM
32
+ call for the same task, including the cost of computing a fresh embedding.
33
+ With existing embeddings (the intended use case), approximately 2,000×
34
+ cheaper.
35
+
36
  This is the **smaller** of two variants. It is recommended for most deployments: the larger model adds only +0.008 Token F1 at 2x the inference cost.
37
 
38
  See also: [Vec2Slug V1-Openai-Large](https://huggingface.co/hashintel/vec2slug-v1-openai-large)
 
67
  slugs = predictor.predict(embeddings)
68
  ```
69
 
70
+ ## Examples
71
+
72
+ Predictions on held-out test samples (beam search, width 4). The model
73
+ sees only the 1536-dim embedding, never the source text.
74
+
75
+ | Source text | Reference slug | Predicted slug |
76
+ |---|---|---|
77
+ | Children's book about astronomy and living on Mars | `can-we-live-on-mars` | `can-we-live-on-mars` |
78
+ | Teaching resources for Martin Luther King Jr. Day | `celebrating-martin-luther-king-jr-day` | `celebrating-martin-luther-king-jr-day` |
79
+ | Article about Waldorf education practices | `12-things-may-not-know-waldorf-education` | `10-things-you-didnt-know-about-waldorf-education` |
80
+
81
+ The third example illustrates the typical case: the model captures the
82
+ topic correctly but diverges in specific wording. The common failure mode
83
+ is overgeneralization rather than incoherence.
84
+
85
  ## How it works
86
 
87
  The model is a prefix-conditioned transformer decoder. A precomputed text
 
123
  |---|---|
124
  | Token F1 (macro) | 0.298 |
125
  | Exact match | 1.9% |
126
+ | ROUGE-L | 0.277 |
127
+ | BERTScore F1 | 0.869 |
128
  | Validity | 100% |
129
  | Vocab diversity | 97.3% |
130
 
131
+ Token F1 splits both slugs on hyphens and computes set-overlap F1 (order
132
+ ignored). ROUGE-L measures the longest common subsequence and penalizes
133
+ misordered words. BERTScore computes contextual embedding similarity via
134
+ roberta-large; the floor is high (~0.82) because short English slugs are
135
+ not widely separated in that embedding space.
136
+
137
  ## Limitations
138
 
139
  - Requires precomputed embeddings from OpenAI `text-embedding-3-small`.
 
142
  may produce generic or inaccurate slugs.
143
  - Slugs reflect patterns in the training URLs, which include SEO-influenced
144
  and editorially inconsistent sources.
145
+ - The primary failure mode is overgeneralization: the model captures the
146
+ topic but may miss specific angles or proper nouns (`asm` instead of
147
+ `wasm` for a WebAssembly article).
148
 
149
  ## Links
150
 
 
155
  ## Citation
156
 
157
  ```bibtex
158
+ @misc{vec2slug2026,
159
  title={vec2slug: URL Slug Generation from Text Embeddings},
160
+ author={Mahmoud, Bilal and {HASH}},
161
+ year={2026},
162
  url={https://github.com/hashintel/labs}
163
  }
164
  ```