glyph-embedder-v2

A small (6.4M-parameter) glyph-native sentence encoder. Trained from scratch on a glyph-only byte-level BPE tokenizer; produces 256-dim L2-normalized embeddings; loadable with sentence-transformers, and convertible cleanly to GGUF for llama.cpp / LM Studio / Ollama with no Python sidecar.

Glyph is APE's compact operator-based notation for causal/structural claims; see wimpSquad/glyph-translator-v7 for the prose-to-glyph translator that produces the strings this model embeds.

Headline result

Mutation-based semantic-discrimination test (the one that actually distinguishes semantic-tracking from surface-matching):

model	params	semantic win rate
Random	—	0.527 (chance)
Jaccard surface 3-gram	—	0.000
`Qwen/Qwen3-Embedding-0.6B` (off-the-shelf)	600M	0.003
glyph-embedder-v2	6.4M	0.860

Paraphrase retrieval MRR@10: 0.956.

The 6.4M-parameter model beats Qwen3-0.6B (~250× larger) on the metric that distinguishes semantic content from surface form. The lesson: surface-dominance in trained embedders is a training-DATA problem, not a capacity problem. v2's contrastive training uses synthetic hard-negatives generated by meaning-flipping mutations of glyph (operator inversions, ¬-insertion, ⊥-injection), which forces the encoder to learn semantics rather than surface n-grams.

Architecture

Base: RobertaModel, trained from scratch (no public-checkpoint init).
Tokenizer: glyph-native byte-level BPE, vocab 6,000, GPT-2 regex pre-tokenizer. 19/20 commonly-used glyph operators are single-token merges; ∴ (rare) splits into two byte tokens.
Hidden: 256, layers: 6, heads: 4, intermediate: 1024.
Max sequence: 256 tokens (glyph corpus token-length p99 = 209).
Pooling: mean.
Head: L2 normalize → 256-dim output.

Total: 6.41M trainable parameters.

Training

Stage 1 — MLM pretraining: standard masked language modelling on the glyph corpus exported from APE's corpus.db (39k unique glyph strings, 1.2% Qwen-noise dropped).
Stage 2 — contrastive fine-tune (v22 recipe): 118,817 triples, mix of rename (atom rename) and paraphrase (semantic rephrase in glyph), with synthetic hard negatives generated by mutation rules (operator inversions: →↔←, ⇒↔⇐, ∧↔∨, ≡↔≠, ⊃↔⊂, ↑↔↓, ≥↔≤, ∀↔∃, ∈↔∉, ∵↔∴, ⊢↔⊣, ↔↔↛; plus ¬-insertion/removal and ⊥-injection). 4 epochs, batch 128, lr 2e-5, fp16.

The mutation-based hard-negative recipe is reusable for any contrastive encoder over a structured language (formal logic, code, DSLs, math).

Usage

sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("wimpSquad/glyph-embedder-v2")
emb = model.encode([
    "v5⊃compression↑ ⇒ compression⊖attribution_edges",
    "compression⊕retrieval_layer ⇒ attribution_edges",
])
# emb.shape -> (2, 256), L2-normalized

transformers (manual mean-pool)

import torch
from transformers import AutoModel, AutoTokenizer

tok = AutoTokenizer.from_pretrained("wimpSquad/glyph-embedder-v2")
model = AutoModel.from_pretrained("wimpSquad/glyph-embedder-v2").eval()

def embed(texts):
    batch = tok(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        out = model(**batch).last_hidden_state
    mask = batch["attention_mask"].unsqueeze(-1).float()
    pooled = (out * mask).sum(1) / mask.sum(1).clamp(min=1)
    return torch.nn.functional.normalize(pooled, dim=-1)

Intended use

Semantic retrieval over glyph corpora.
Deduplication / clustering of glyph entries inside APE.
Drop-in encoder for any structured-language pipeline that wants a small mutation-trained embedder rather than a large general one.

Limitations

English-glyph only. The tokenizer is built on the APE corpus; arbitrary natural-language prose tokenises poorly.
Direction markers in scalar contexts. ↑ / ↓ swaps in long-context glyph still trip ~14% of mutation tests. Mean-pool dilutes single-token swaps over 30+ tokens. Fix path is a bigger model + CLS-aware pretraining objective; that's v3 work, not a v2 patch.
Targeted oversampling backfires. v2.5 doubled ↑/↓ training signal; it fixed those operators but tanked overall win rate 0.860 → 0.747. v2 (this checkpoint) is the balanced one.

License

Apache-2.0.

Downloads last month: 16

Safetensors

Model size

6.41M params

Tensor type

F32