vec2slug-v1-openai-large

Generate URL slugs directly from text embeddings, without re-feeding source text through a language model. Designed to piggyback on embeddings a system already has for search or deduplication.


Parameters	24.8M
Architecture	Transformer decoder, 6L, d=512
Input	OpenAI `text-embedding-3-small` (1536d)
Vocab	BPE, 5000 subwords
Token F1	0.306
ONNX size	95.1 MiB
Inference (CPU)	~41ms (M-series), ~160ms (budget VPS)

14 to 19× faster and approximately 85× cheaper than a Haiku-class LLM call for the same task, including the cost of computing a fresh embedding. With existing embeddings (the intended use case), approximately 2,000× cheaper.

This is the larger of two variants. It achieves the best Token F1 but at 2x the inference cost of the smaller model.

Quickstart

# install dependencies
pip install onnxruntime numpy

# or run directly with uv
uv run inference.py . --input embeddings.npy

from inference import OnnxPredictor
import numpy as np

predictor = OnnxPredictor.from_dir(".")

# embeddings: [N, 1536] float32 from OpenAI text-embedding-3-small
slugs = predictor.predict(embeddings)
# ["how-neural-networks-learn", "climate-change-solutions", ...]

PyTorch inference (requires torch):

from inference import PyTorchPredictor

predictor = PyTorchPredictor.from_dir(".")
slugs = predictor.predict(embeddings)

Examples

Predictions on held-out test samples (beam search, width 4). The model sees only the 1536-dim embedding, never the source text.

Source text	Reference slug	Predicted slug
Children's book about astronomy and living on Mars	`can-we-live-on-mars`	`can-we-live-on-mars`
Teaching resources for Martin Luther King Jr. Day	`celebrating-martin-luther-king-jr-day`	`celebrating-martin-luther-king-jr-day`
Article about Waldorf education practices	`12-things-may-not-know-waldorf-education`	`10-things-you-didnt-know-about-waldorf-education`

The third example illustrates the typical case: the model captures the topic correctly but diverges in specific wording. The common failure mode is overgeneralization rather than incoherence.

How it works

The model is a prefix-conditioned transformer decoder. A precomputed text embedding is linearly projected into the decoder's hidden space and placed at position 0 as a prefix token. The decoder then autoregressively generates BPE subword tokens that form a kebab-case URL slug.

Beam search uses bounded additive length reward with score-based optimal stopping (Huang et al. 2017). All decoding parameters are stored in model.json.

Files

File	Description
`model.onnx`	ONNX model (forward pass only)
`model.json`	Sidecar: vocabulary, beam search config, stopwords
`model.pt`	PyTorch weights (`state_dict`)
`tokenizer.json`	BPE tokenizer (HuggingFace `tokenizers` format)
`inference.py`	Standalone inference script (`uv run` compatible)
`manifest.train.json`	Training configuration and results
`manifest.onnx.json`	Export verification (tolerance, argmax agreement)
`history.train.jsonl`	Training loss/metric curves

Training

Trained on 2.3M documents from FineWeb-Edu with slugs extracted from source URLs. The extraction pipeline filters on language, slug format, Gopher repetition, and token count.

BPE vocabulary (5,000 subwords) with - as a special token. Trained for 36 epochs with label smoothing (0.1) and position-aware EOS loss weighting. Best checkpoint at step 70,560.

Evaluation

Evaluated on 5,000 held-out test samples using the full beam search decoding pipeline.

Metric	Value
Token F1 (macro)	0.306
Exact match	2.1%
ROUGE-L	0.284
BERTScore F1	0.872
Validity	100%
Vocab diversity	97.8%

Token F1 splits both slugs on hyphens and computes set-overlap F1 (order ignored). ROUGE-L measures the longest common subsequence and penalizes misordered words. BERTScore computes contextual embedding similarity via roberta-large; the floor is high (~0.82) because short English slugs are not widely separated in that embedding space.

Limitations

Requires precomputed embeddings from OpenAI text-embedding-3-small. Other embedding models will produce poor results.
Trained on English web content. Non-English or domain-specific text may produce generic or inaccurate slugs.
Slugs reflect patterns in the training URLs, which include SEO-influenced and editorially inconsistent sources.
The primary failure mode is overgeneralization: the model captures the topic but may miss specific angles or proper nouns (asm instead of wasm for a WebAssembly article).

Citation

@misc{vec2slug2026,
  title={vec2slug: URL Slug Generation from Text Embeddings},
  author={Mahmoud, Bilal and {HASH}},
  year={2026},
  url={https://github.com/hashintel/labs}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including hashintel/vec2slug-v1-openai-large

vec2slug

Collection

https://hash.dev/blog/vec2slug • 2 items • Updated May 24

hashintel
/

vec2slug-v1-openai-large