File size: 6,704 Bytes
112bbeb
 
 
 
 
1f5705d
112bbeb
 
 
 
46e4939
 
112bbeb
 
05760d4
112bbeb
 
 
 
 
46e4939
 
2e2a47a
46e4939
5d2fd42
 
46e4939
 
2a57df1
 
 
 
 
112bbeb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: mit
base_model:
- openai/clip-vit-base-patch32
datasets:
- AbstractPhil/geometric-vocab-512d
tags:
- experiment
---

# Preface

A first experiment to test and convert clip-vit-base-patch32 into a geometric model by using only a classification head.

Below is GPT 5's auto-generated dictation based on the notebook. I have included the entire notebook 6 for posterity.

The question was simple; can linear layers learn geometric?

The answer is... maybe. More research required.

# Reasoning

I used the 32 dim geometric vocab; as it seemed to be the weakest with flow-match euler-discreet to test the hypothesis that a small dimensional geometry could in fact be used in substitution of a high-geometric variation.

For obvious reasons the 512 dim vocab managed to handle a higher accuracy than the 32d; however not for the reason you would think.

The output model is much larger than I wanted; which defeats the purpose of the overall structure - but it's paired directly at the knee with clip-vit-base-patch32, so I'll prepare a decoupled version here in a bit.


# Why clip-vit instead of just vit?

I believe the clip-vit variations have more utility overall so I wanted to ensure a fair target was assessed.

# Notebook-6 Β· Crystal-CLIP CIFAR-100

One-vector image embeddings (HF CLIP) + pentachora vocabulary anchors β†’ cosine-similarity classifier for CIFAR-100.
This repo hosts the trained crystal classification head (+ run configs/metrics) built in Notebook 6.

---

OVERVIEW
- Vision encoder: openai/clip-vit-base-patch32 (Hugging Face transformers), frozen by default.
  Produces exactly one L2-normalized embedding per image (image_embeds, dim=512).
- Vocabulary: AbstractPhil/geometric-vocab-512d (pentachora crystals).
  For CIFAR-100 class names, any missing tokens are deterministically synthesized via the unicode path to guarantee 100/100 coverage and preserve class ordering.
- Head: projects both image embeddings (De=512) and role-selected class anchors (Dv=512) into a shared symbol space (crystal_dims=128), L2-normalizes, and computes cosine logits divided by T (temperature).
- Training: Cross-Entropy on CIFAR-100, AdamW, optional AMP, cosine LR with warmup. Best checkpoint is saved and (optionally) pushed to Hugging Face.

---

MODEL CARD
- Task: Image Classification (CIFAR-100)
- Backbone: openai/clip-vit-base-patch32 (vision-only)
- Head: Crystal projection head (image 512β†’128, anchor 512β†’128) + cosine logits (temperature)
- Vocabulary: AbstractPhil/geometric-vocab-512d (wordnet_eng split + deterministic unicode synth for gaps)
- Metrics: Top-1 = [80~], Top-3 = [90>]
- License: MIT

---

FILES IN THIS REPO
- <run_name>_best.safetensors β€” weights for:
  - head::* (crystal classifier head)
  - encoder::* (optional, if you chose to unfreeze/fine-tune)
- <run_name>_best.config.json β€” full CONFIG used for the run
- <run_name>_best.metrics.json β€” summary metrics for the best epoch
- Optionally: *_latest.* variants if you pushed latest per-epoch artifacts.

Note: If you only want to ship the head, you can also include a stripped crystal_head.safetensors (head-only state_dict). The snippets below handle either format.

---

QUICKSTART (Inference)
1) Load CLIP vision (frozen) and processor
   HF_CLIP_ID = "openai/clip-vit-base-patch32"
   Processor  = AutoImageProcessor.from_pretrained(HF_CLIP_ID)
   Encoder    = CLIPVisionModelWithProjection.from_pretrained(HF_CLIP_ID).eval().to("cuda")

2) Build the crystal head (same shape as training)
   image_dim   = Encoder.config.projection_dim    # 512
   crystal_dim = 512                               # vocab repo uses 512D anchors
   sym_dim     = 128                               # crystal_dims from CONFIG
   temperature = 0.07                              # from CONFIG

   class CrystalHead(torch.nn.Module):
       def __init__(self, De, Dv, Dsym, T):
           super().__init__()
           self.proj_img = torch.nn.Linear(De, Dsym, bias=True)
           self.proj_anc = torch.nn.Linear(Dv, Dsym, bias=False)
           self.T = T
           self.register_buffer("anchors_vocab", torch.empty(0, Dv), persistent=False)
       def set_anchors(self, anchors):  # [C, Dv]
           self.anchors_vocab = anchors.contiguous()
       def forward(self, image_embeds):  # [B, De] (L2 ok)
           z = torch.nn.functional.normalize(self.proj_img(image_embeds), dim=-1)
           a = torch.nn.functional.normalize(self.proj_anc(self.anchors_vocab), dim=-1)
           return (z @ a.T) / max(1e-8, self.T)  # [B, C]

   head = CrystalHead(De=image_dim, Dv=crystal_dim, Dsym=sym_dim, T=temperature).to("cuda")

3) Load weights (handles prefixed multi-module .safetensors)
   state = safetensors.torch.load_file("<run_name>_best.safetensors")
   head_state = {k.split("head::",1)[1]: v for k,v in state.items() if k.startswith("head::")}
   head.load_state_dict(head_state, strict=True)

4) Prepare anchors from your vocabulary (same order as training)
   You likely already exported anchors or can rebuild them exactly as in Notebook 6.
   anchors: torch.Tensor of shape [100, 512]
   head.set_anchors(anchors.to("cuda"))

5) Inference on a batch of images (PIL or ndarray)
   imgs = [PIL.Image.open("example_0.png").convert("RGB"), PIL.Image.open("example_1.png").convert("RGB")]
   batch = Processor(images=imgs, return_tensors="pt").to("cuda")
   with torch.no_grad():
       out = Encoder(pixel_values=batch["pixel_values"], return_dict=True)
       z   = torch.nn.functional.normalize(out.image_embeds, dim=-1)  # [B, 512]
       logits = head(z)                                               # [B, 100]
       pred   = logits.argmax(dim=-1).tolist()
   print("pred:", pred)

Note: The head expects the same class order used at training time. Save and ship class_names.json (CIFAR-100 labels) and the exact anchors_vocab.pt you used (or rebuild deterministically with the vocab + synth step).

---

REPRODUCE (Notebook 6)
1. Config only (single source of truth): image size, CLIP stats, dataset, temperature, crystal dims, etc.
2. Cell 5 – HF CLIP vision loader (one embedding per image).
3. Cell 6 – Vocabulary interface; synth any missing CIFAR tokens, cache crystals, select role anchors.
4. Cell 8 – Crystal head (image+anchor projections β†’ cosine logits / T).
5. Cell 9 – Trainer (AdamW + AMP + cosine LR). Saves latest/best, pushes to HF if enabled.


Replace with your final numbers after the run completes.

---

ACKNOWLEDGEMENTS
- CLIP ViT-B/32: OpenAI (openai/clip-vit-base-patch32) via Hugging Face transformers.
- Pentachora Vocabulary: AbstractPhil/geometric-vocab-512d.
- Built in Notebook 6 (CONFIG-first, deterministic synth for gaps, head-only training).