File size: 6,530 Bytes

4b3c4dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df35c52
4b3c4dc
 
 
 
 
 
 
 
 
 
 
 
 
 
6dfb17d
 
4b3c4dc
6dfb17d
4b3c4dc
 
 
 
 
 
 
6dfb17d
 
 
4b3c4dc

---
license: mit
base_model:
- openai/clip-vit-base-patch32
datasets:
- AbstractPhil/geometric-vocab-32d
tags:
- experiment
---

# Preface

A first experiment to test and convert clip-vit-base-patch32 into a geometric model by using only a classification head.

Below is GPT 5's auto-generated dictation based on the notebook. I'll include the full notebook in a moment here.

The question was simple; can linear layers learn geometric?

The answer is... maybe. More research required.

# Reasoning

I used the 32 dim geometric vocab; as it seemed to be the weakest with flow-match euler-discreet to test the hypothesis that a small dimensional geometry could in fact be used in substitution of a high-geometric variation.

The output head is incredibly small unlike my first impression. This one happened to have the pair packed, but the updated notebook will show the separated head is in fact less than 100 kb.

# Why clip-vit instead of just vit?

I believe the clip-vit variations have more utility overall so I wanted to ensure a fair target was assessed.

# Notebook-6 · Crystal-CLIP CIFAR-100

One-vector image embeddings (HF CLIP) + pentachora vocabulary anchors → cosine-similarity classifier for CIFAR-100.
This repo hosts the trained crystal classification head (+ run configs/metrics) built in Notebook 6.

---

OVERVIEW
- Vision encoder: openai/clip-vit-base-patch32 (Hugging Face transformers), frozen by default.
  Produces exactly one L2-normalized embedding per image (image_embeds, dim=32).
- Vocabulary: AbstractPhil/geometric-vocab-32d (pentachora crystals).
  For CIFAR-100 class names, any missing tokens are deterministically synthesized via the unicode path to guarantee 100/100 coverage and preserve class ordering.
- Head: projects both image embeddings (De=32) and role-selected class anchors (Dv=32) into a shared symbol space (crystal_dims=64), L2-normalizes, and computes cosine logits divided by T (temperature).
- Training: Cross-Entropy on CIFAR-100, AdamW, optional AMP, cosine LR with warmup. Best checkpoint is saved and (optionally) pushed to Hugging Face.

---

MODEL CARD
- Task: Image Classification (CIFAR-100)
- Backbone: openai/clip-vit-base-patch32 (vision-only)
- Head: Crystal projection head (image 512→64, anchor 32→64) + cosine logits (temperature)
- Vocabulary: AbstractPhil/geometric-vocab-32d (wordnet_eng split + deterministic unicode synth for gaps)
- Metrics: Top-1 = [60~], Top-3 = [80>]
- License: MIT

---

FILES IN THIS REPO
- <run_name>_best.safetensors — weights for:
  - head::* (crystal classifier head)
  - encoder::* (optional, if you chose to unfreeze/fine-tune)
- <run_name>_best.config.json — full CONFIG used for the run
- <run_name>_best.metrics.json — summary metrics for the best epoch
- Optionally: *_latest.* variants if you pushed latest per-epoch artifacts.

Note: If you only want to ship the head, you can also include a stripped crystal_head.safetensors (head-only state_dict). The snippets below handle either format.

---

QUICKSTART (Inference)
1) Load CLIP vision (frozen) and processor
   HF_CLIP_ID = "openai/clip-vit-base-patch32"
   Processor  = AutoImageProcessor.from_pretrained(HF_CLIP_ID)
   Encoder    = CLIPVisionModelWithProjection.from_pretrained(HF_CLIP_ID).eval().to("cuda")

2) Build the crystal head (same shape as training)
   image_dim   = Encoder.config.projection_dim    # 512
   crystal_dim = 512                               # vocab repo uses 512D anchors
   sym_dim     = 128                               # crystal_dims from CONFIG
   temperature = 0.07                              # from CONFIG

   class CrystalHead(torch.nn.Module):
       def __init__(self, De, Dv, Dsym, T):
           super().__init__()
           self.proj_img = torch.nn.Linear(De, Dsym, bias=True)
           self.proj_anc = torch.nn.Linear(Dv, Dsym, bias=False)
           self.T = T
           self.register_buffer("anchors_vocab", torch.empty(0, Dv), persistent=False)
       def set_anchors(self, anchors):  # [C, Dv]
           self.anchors_vocab = anchors.contiguous()
       def forward(self, image_embeds):  # [B, De] (L2 ok)
           z = torch.nn.functional.normalize(self.proj_img(image_embeds), dim=-1)
           a = torch.nn.functional.normalize(self.proj_anc(self.anchors_vocab), dim=-1)
           return (z @ a.T) / max(1e-8, self.T)  # [B, C]

   head = CrystalHead(De=image_dim, Dv=crystal_dim, Dsym=sym_dim, T=temperature).to("cuda")

3) Load weights (handles prefixed multi-module .safetensors)
   state = safetensors.torch.load_file("<run_name>_best.safetensors")
   head_state = {k.split("head::",1)[1]: v for k,v in state.items() if k.startswith("head::")}
   head.load_state_dict(head_state, strict=True)

4) Prepare anchors from your vocabulary (same order as training)
   You likely already exported anchors or can rebuild them exactly as in Notebook 6.
   anchors: torch.Tensor of shape [100, 512]
   head.set_anchors(anchors.to("cuda"))

5) Inference on a batch of images (PIL or ndarray)
   imgs = [PIL.Image.open("example_0.png").convert("RGB"), PIL.Image.open("example_1.png").convert("RGB")]
   batch = Processor(images=imgs, return_tensors="pt").to("cuda")
   with torch.no_grad():
       out = Encoder(pixel_values=batch["pixel_values"], return_dict=True)
       z   = torch.nn.functional.normalize(out.image_embeds, dim=-1)  # [B, 512]
       logits = head(z)                                               # [B, 100]
       pred   = logits.argmax(dim=-1).tolist()
   print("pred:", pred)

Note: The head expects the same class order used at training time. Save and ship class_names.json (CIFAR-100 labels) and the exact anchors_vocab.pt you used (or rebuild deterministically with the vocab + synth step).

---

REPRODUCE (Notebook 6)
1. Config only (single source of truth): image size, CLIP stats, dataset, temperature, crystal dims, etc.
2. Cell 5 – HF CLIP vision loader (one embedding per image).
3. Cell 6 – Vocabulary interface; synth any missing CIFAR tokens, cache crystals, select role anchors.
4. Cell 8 – Crystal head (image+anchor projections → cosine logits / T).
5. Cell 9 – Trainer (AdamW + AMP + cosine LR). Saves latest/best, pushes to HF if enabled.


Replace with your final numbers after the run completes.

---

ACKNOWLEDGEMENTS
- CLIP ViT-B/32: OpenAI (openai/clip-vit-base-patch32) via Hugging Face transformers.
- Pentachora Vocabulary: AbstractPhil/geometric-vocab-512d.
- Built in Notebook 6 (CONFIG-first, deterministic synth for gaps, head-only training).