Instructions to use wimpSquad/glyph-embedder-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use wimpSquad/glyph-embedder-v2 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("wimpSquad/glyph-embedder-v2") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
glyph-embedder-v2
A small (6.4M-parameter) glyph-native sentence encoder. Trained
from scratch on a glyph-only byte-level BPE tokenizer; produces
256-dim L2-normalized embeddings; loadable with sentence-transformers,
and convertible cleanly to GGUF for llama.cpp / LM Studio / Ollama
with no Python sidecar.
Glyph is APE's compact operator-based notation for causal/structural
claims; see wimpSquad/glyph-translator-v7
for the prose-to-glyph translator that produces the strings this model
embeds.
Headline result
Mutation-based semantic-discrimination test (the one that actually distinguishes semantic-tracking from surface-matching):
| model | params | semantic win rate |
|---|---|---|
| Random | — | 0.527 (chance) |
| Jaccard surface 3-gram | — | 0.000 |
Qwen/Qwen3-Embedding-0.6B (off-the-shelf) |
600M | 0.003 |
| glyph-embedder-v2 | 6.4M | 0.860 |
Paraphrase retrieval MRR@10: 0.956.
The 6.4M-parameter model beats Qwen3-0.6B (~250× larger) on the metric
that distinguishes semantic content from surface form. The lesson:
surface-dominance in trained embedders is a training-DATA problem,
not a capacity problem. v2's contrastive training uses synthetic
hard-negatives generated by meaning-flipping mutations of glyph
(operator inversions, ¬-insertion, ⊥-injection), which forces the
encoder to learn semantics rather than surface n-grams.
Architecture
- Base:
RobertaModel, trained from scratch (no public-checkpoint init). - Tokenizer: glyph-native byte-level BPE, vocab 6,000, GPT-2 regex
pre-tokenizer. 19/20 commonly-used glyph operators are single-token
merges;
∴(rare) splits into two byte tokens. - Hidden: 256, layers: 6, heads: 4, intermediate: 1024.
- Max sequence: 256 tokens (glyph corpus token-length p99 = 209).
- Pooling: mean.
- Head: L2 normalize → 256-dim output.
Total: 6.41M trainable parameters.
Training
- Stage 1 — MLM pretraining: standard masked language modelling
on the glyph corpus exported from APE's
corpus.db(39k unique glyph strings, 1.2% Qwen-noise dropped). - Stage 2 — contrastive fine-tune (v22 recipe): 118,817 triples,
mix of
rename(atom rename) andparaphrase(semantic rephrase in glyph), with synthetic hard negatives generated by mutation rules (operator inversions:→↔←,⇒↔⇐,∧↔∨,≡↔≠,⊃↔⊂,↑↔↓,≥↔≤,∀↔∃,∈↔∉,∵↔∴,⊢↔⊣,↔↔↛; plus¬-insertion/removal and⊥-injection). 4 epochs, batch 128, lr 2e-5, fp16.
The mutation-based hard-negative recipe is reusable for any contrastive encoder over a structured language (formal logic, code, DSLs, math).
Usage
sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("wimpSquad/glyph-embedder-v2")
emb = model.encode([
"v5⊃compression↑ ⇒ compression⊖attribution_edges",
"compression⊕retrieval_layer ⇒ attribution_edges",
])
# emb.shape -> (2, 256), L2-normalized
transformers (manual mean-pool)
import torch
from transformers import AutoModel, AutoTokenizer
tok = AutoTokenizer.from_pretrained("wimpSquad/glyph-embedder-v2")
model = AutoModel.from_pretrained("wimpSquad/glyph-embedder-v2").eval()
def embed(texts):
batch = tok(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
out = model(**batch).last_hidden_state
mask = batch["attention_mask"].unsqueeze(-1).float()
pooled = (out * mask).sum(1) / mask.sum(1).clamp(min=1)
return torch.nn.functional.normalize(pooled, dim=-1)
Intended use
- Semantic retrieval over glyph corpora.
- Deduplication / clustering of glyph entries inside APE.
- Drop-in encoder for any structured-language pipeline that wants a small mutation-trained embedder rather than a large general one.
Limitations
- English-glyph only. The tokenizer is built on the APE corpus; arbitrary natural-language prose tokenises poorly.
- Direction markers in scalar contexts.
↑/↓swaps in long-context glyph still trip ~14% of mutation tests. Mean-pool dilutes single-token swaps over 30+ tokens. Fix path is a bigger model + CLS-aware pretraining objective; that's v3 work, not a v2 patch. - Targeted oversampling backfires. v2.5 doubled
↑/↓training signal; it fixed those operators but tanked overall win rate 0.860 → 0.747. v2 (this checkpoint) is the balanced one.
License
Apache-2.0.
- Downloads last month
- 16