File size: 6,704 Bytes
112bbeb 1f5705d 112bbeb 46e4939 112bbeb 05760d4 112bbeb 46e4939 2e2a47a 46e4939 5d2fd42 46e4939 2a57df1 112bbeb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | ---
license: mit
base_model:
- openai/clip-vit-base-patch32
datasets:
- AbstractPhil/geometric-vocab-512d
tags:
- experiment
---
# Preface
A first experiment to test and convert clip-vit-base-patch32 into a geometric model by using only a classification head.
Below is GPT 5's auto-generated dictation based on the notebook. I have included the entire notebook 6 for posterity.
The question was simple; can linear layers learn geometric?
The answer is... maybe. More research required.
# Reasoning
I used the 32 dim geometric vocab; as it seemed to be the weakest with flow-match euler-discreet to test the hypothesis that a small dimensional geometry could in fact be used in substitution of a high-geometric variation.
For obvious reasons the 512 dim vocab managed to handle a higher accuracy than the 32d; however not for the reason you would think.
The output model is much larger than I wanted; which defeats the purpose of the overall structure - but it's paired directly at the knee with clip-vit-base-patch32, so I'll prepare a decoupled version here in a bit.
# Why clip-vit instead of just vit?
I believe the clip-vit variations have more utility overall so I wanted to ensure a fair target was assessed.
# Notebook-6 Β· Crystal-CLIP CIFAR-100
One-vector image embeddings (HF CLIP) + pentachora vocabulary anchors β cosine-similarity classifier for CIFAR-100.
This repo hosts the trained crystal classification head (+ run configs/metrics) built in Notebook 6.
---
OVERVIEW
- Vision encoder: openai/clip-vit-base-patch32 (Hugging Face transformers), frozen by default.
Produces exactly one L2-normalized embedding per image (image_embeds, dim=512).
- Vocabulary: AbstractPhil/geometric-vocab-512d (pentachora crystals).
For CIFAR-100 class names, any missing tokens are deterministically synthesized via the unicode path to guarantee 100/100 coverage and preserve class ordering.
- Head: projects both image embeddings (De=512) and role-selected class anchors (Dv=512) into a shared symbol space (crystal_dims=128), L2-normalizes, and computes cosine logits divided by T (temperature).
- Training: Cross-Entropy on CIFAR-100, AdamW, optional AMP, cosine LR with warmup. Best checkpoint is saved and (optionally) pushed to Hugging Face.
---
MODEL CARD
- Task: Image Classification (CIFAR-100)
- Backbone: openai/clip-vit-base-patch32 (vision-only)
- Head: Crystal projection head (image 512β128, anchor 512β128) + cosine logits (temperature)
- Vocabulary: AbstractPhil/geometric-vocab-512d (wordnet_eng split + deterministic unicode synth for gaps)
- Metrics: Top-1 = [80~], Top-3 = [90>]
- License: MIT
---
FILES IN THIS REPO
- <run_name>_best.safetensors β weights for:
- head::* (crystal classifier head)
- encoder::* (optional, if you chose to unfreeze/fine-tune)
- <run_name>_best.config.json β full CONFIG used for the run
- <run_name>_best.metrics.json β summary metrics for the best epoch
- Optionally: *_latest.* variants if you pushed latest per-epoch artifacts.
Note: If you only want to ship the head, you can also include a stripped crystal_head.safetensors (head-only state_dict). The snippets below handle either format.
---
QUICKSTART (Inference)
1) Load CLIP vision (frozen) and processor
HF_CLIP_ID = "openai/clip-vit-base-patch32"
Processor = AutoImageProcessor.from_pretrained(HF_CLIP_ID)
Encoder = CLIPVisionModelWithProjection.from_pretrained(HF_CLIP_ID).eval().to("cuda")
2) Build the crystal head (same shape as training)
image_dim = Encoder.config.projection_dim # 512
crystal_dim = 512 # vocab repo uses 512D anchors
sym_dim = 128 # crystal_dims from CONFIG
temperature = 0.07 # from CONFIG
class CrystalHead(torch.nn.Module):
def __init__(self, De, Dv, Dsym, T):
super().__init__()
self.proj_img = torch.nn.Linear(De, Dsym, bias=True)
self.proj_anc = torch.nn.Linear(Dv, Dsym, bias=False)
self.T = T
self.register_buffer("anchors_vocab", torch.empty(0, Dv), persistent=False)
def set_anchors(self, anchors): # [C, Dv]
self.anchors_vocab = anchors.contiguous()
def forward(self, image_embeds): # [B, De] (L2 ok)
z = torch.nn.functional.normalize(self.proj_img(image_embeds), dim=-1)
a = torch.nn.functional.normalize(self.proj_anc(self.anchors_vocab), dim=-1)
return (z @ a.T) / max(1e-8, self.T) # [B, C]
head = CrystalHead(De=image_dim, Dv=crystal_dim, Dsym=sym_dim, T=temperature).to("cuda")
3) Load weights (handles prefixed multi-module .safetensors)
state = safetensors.torch.load_file("<run_name>_best.safetensors")
head_state = {k.split("head::",1)[1]: v for k,v in state.items() if k.startswith("head::")}
head.load_state_dict(head_state, strict=True)
4) Prepare anchors from your vocabulary (same order as training)
You likely already exported anchors or can rebuild them exactly as in Notebook 6.
anchors: torch.Tensor of shape [100, 512]
head.set_anchors(anchors.to("cuda"))
5) Inference on a batch of images (PIL or ndarray)
imgs = [PIL.Image.open("example_0.png").convert("RGB"), PIL.Image.open("example_1.png").convert("RGB")]
batch = Processor(images=imgs, return_tensors="pt").to("cuda")
with torch.no_grad():
out = Encoder(pixel_values=batch["pixel_values"], return_dict=True)
z = torch.nn.functional.normalize(out.image_embeds, dim=-1) # [B, 512]
logits = head(z) # [B, 100]
pred = logits.argmax(dim=-1).tolist()
print("pred:", pred)
Note: The head expects the same class order used at training time. Save and ship class_names.json (CIFAR-100 labels) and the exact anchors_vocab.pt you used (or rebuild deterministically with the vocab + synth step).
---
REPRODUCE (Notebook 6)
1. Config only (single source of truth): image size, CLIP stats, dataset, temperature, crystal dims, etc.
2. Cell 5 β HF CLIP vision loader (one embedding per image).
3. Cell 6 β Vocabulary interface; synth any missing CIFAR tokens, cache crystals, select role anchors.
4. Cell 8 β Crystal head (image+anchor projections β cosine logits / T).
5. Cell 9 β Trainer (AdamW + AMP + cosine LR). Saves latest/best, pushes to HF if enabled.
Replace with your final numbers after the run completes.
---
ACKNOWLEDGEMENTS
- CLIP ViT-B/32: OpenAI (openai/clip-vit-base-patch32) via Hugging Face transformers.
- Pentachora Vocabulary: AbstractPhil/geometric-vocab-512d.
- Built in Notebook 6 (CONFIG-first, deterministic synth for gaps, head-only training). |