AbstractPhil commited on
Commit
7fc1cb7
Β·
verified Β·
1 Parent(s): 5a73688

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -0
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - geometric-deep-learning
5
+ - voxel-classifier
6
+ - cross-contrast
7
+ - pentachoron
8
+ - contrastive-learning
9
+ - 3d-classification
10
+ pipeline_tag: other
11
+ ---
12
+
13
+ # Grid Geometric Classifier Proto
14
+
15
+ A prototype system for geometric primitive classification and text–geometry alignment. A voxel classifier learns to identify 38 shape classes from 5Γ—5Γ—5 binary occupancy grids using capacity cascades, curvature analysis, differentiation gates, and a rectified flow arbiter. A cross-contrast module then aligns the classifier's learned features with Qwen 2.5-1.5B text embeddings via InfoNCE, producing a shared latent space where geometric structure and natural language descriptions are jointly represented.
16
+
17
+ This is a research prototype exploring whether a geometric vocabulary learned from pure structure can meaningfully align with linguistic semantics.
18
+
19
+ ## Repository Structure
20
+
21
+ ```
22
+ geometric_classifier/ ← Voxel classifier (~1.85M params)
23
+ β”œβ”€β”€ config.json # Architecture: dims, classes, shape catalog
24
+ β”œβ”€β”€ training_config.json # Hyperparams, loss weights, results
25
+ └── model.safetensors # Weights
26
+
27
+ crosscontrast/ ← Text↔Voxel alignment heads
28
+ β”œβ”€β”€ config.json # Projection dims, latent space config
29
+ β”œβ”€β”€ training_config.json # Contrastive training params & results
30
+ β”œβ”€β”€ text_proj.safetensors # Text β†’ latent projection
31
+ β”œβ”€β”€ voxel_proj.safetensors # Voxel β†’ latent projection
32
+ └── temperature.safetensors # Learned temperature scalar
33
+
34
+ qwen_embeddings/ ← Cached Qwen 2.5-1.5B embeddings
35
+ β”œβ”€β”€ config.json # Model name, hidden dim, extraction method
36
+ β”œβ”€β”€ embeddings.safetensors # (38, 1536) class embeddings
37
+ └── descriptions.json # Natural language shape descriptions
38
+ ```
39
+
40
+ ## Shape Vocabulary: 38 Classes
41
+
42
+ The vocabulary spans 0D–3D primitives, both rigid and curved, organized by intrinsic dimensionality:
43
+
44
+ | Dim | Rigid | Curved |
45
+ |-----|-------|--------|
46
+ | 0D | point | β€” |
47
+ | 1D | line_x, line_y, line_z, line_diag, cross, l_shape, collinear | arc, helix |
48
+ | 2D | triangle_xy, triangle_xz, triangle_3d, square_xy, square_xz, rectangle, coplanar, plane | circle, ellipse, disc |
49
+ | 3D | tetrahedron, pyramid, pentachoron, cube, cuboid, triangular_prism, octahedron | sphere, hemisphere, cylinder, cone, capsule, torus, shell, tube, bowl, saddle |
50
+
51
+ Eight curvature types: `none`, `convex`, `concave`, `cylindrical`, `conical`, `toroidal`, `hyperbolic`, `helical`.
52
+
53
+ ## Architecture
54
+
55
+ ### GeometricShapeClassifier (v8)
56
+
57
+ Input is a 5Γ—5Γ—5 binary voxel grid. The forward pass has four stages:
58
+
59
+ **1. Tracer Attention** β€” 5 learned tracer tokens attend over 125 voxel embeddings (occupancy + normalized 3D position β†’ 64-dim via MLP). All C(5,2)=10 tracer pairs compute interaction features and edge detection scores via SwiGLU heads. Pool dimension: 320 (5 tracers Γ— 64-dim).
60
+
61
+ **2. Capacity Cascade** — Four `CapacityHead` modules with learned capacities (initialized at 0.5, 1.0, 1.5, 2.0) process features sequentially. Each outputs a fill ratio (sigmoid), overflow signal, and residual features. The cascade partitions representation capacity across intrinsic dimensions (0D→3D), with fill ratios serving as soft dimensionality indicators.
62
+
63
+ **3. Curvature Analysis** β€” A `DifferentiationGate` computes radial distance profiles binned into 5 shells, producing sigmoid gates and additive directional features that differentiate convex/concave curvature. A `CurvatureHead` combines rigid features with gated curvature features to predict: is_curved (binary), curvature_type (8-class), and a curvature embedding used downstream.
64
+
65
+ **4. Rectified Flow Arbiter** β€” For ambiguous cases, a `RectifiedFlowArbiter` integrates a learned velocity field over 4 flow-matching steps from noise to class prototypes. Produces refined logits, trajectory logits at each step, confidence scores, and a blend weight that gates between initial and refined predictions. Trained with OT-conditioned flow matching loss.
66
+
67
+ The final class prediction blends initial and arbiter-refined logits via the learned blend weight.
68
+
69
+ ### CrossContrastModel
70
+
71
+ Two MLP projection heads map frozen voxel features (645-dim) and frozen Qwen text embeddings (1536-dim) into a shared 256-dim latent space. Architecture per head: `Linear β†’ LayerNorm β†’ GELU β†’ Linear β†’ LayerNorm β†’ GELU β†’ Linear`. Trained with symmetric InfoNCE loss and a learned temperature parameter.
72
+
73
+ ### Text Embeddings
74
+
75
+ Class descriptions are encoded by Qwen 2.5-1.5B-Instruct using mean-pooled last hidden states. Each of the 38 classes has a 2-shot geometric description (e.g., *"A flat triangular outline formed by three connected edges lying in the horizontal xy-plane, the simplest polygon"*).
76
+
77
+ ## Training
78
+
79
+ ### Classifier (Cell 3)
80
+
81
+ | Parameter | Value |
82
+ |-----------|-------|
83
+ | Dataset | 500K procedurally generated samples (400K train / 100K val) |
84
+ | Grid size | 5Γ—5Γ—5 binary occupancy |
85
+ | Batch size | 4,096 |
86
+ | Optimizer | AdamW (lr=3e-3, wd=1e-4) |
87
+ | Schedule | Cosine with 5-epoch warmup |
88
+ | Precision | BF16 autocast (no GradScaler) |
89
+ | Compile | torch.compile (default mode) |
90
+ | Augmentation | Voxel dropout (5%), random addition (5%), spatial shift (8%) |
91
+ | Epochs | 80 |
92
+
93
+ The classifier is trained with a composite loss: cross-entropy on initial and refined logits, capacity fill ratio supervision, peak dimension classification, overflow regularization, capacity diversity, volume regression (log1p MSE), Cayley-Menger determinant sign prediction, curvature binary/type classification, flow matching loss, arbiter confidence calibration, and blend weight supervision. 13 weighted terms total.
94
+
95
+ ### Cross-Contrast (Cell 4)
96
+
97
+ | Parameter | Value |
98
+ |-----------|-------|
99
+ | Dataset | Reuses Cell 3 cached dataset |
100
+ | Voxel encoder | Frozen GeometricShapeClassifier |
101
+ | Text encoder | Frozen Qwen 2.5-1.5B-Instruct |
102
+ | Latent dim | 256 |
103
+ | Batch size | 4,096 |
104
+ | Optimizer | AdamW (lr=2e-3, wd=1e-4) |
105
+ | Schedule | Cosine with 3-epoch warmup |
106
+ | Loss | Symmetric InfoNCE |
107
+ | Temperature | Learned (init 0.07) |
108
+ | Epochs | 40 |
109
+
110
+ ## Quick Start
111
+
112
+ ```python
113
+ import torch
114
+ from safetensors.torch import load_file
115
+
116
+ # Load classifier
117
+ weights = load_file("geometric_classifier/model.safetensors")
118
+ # Instantiate GeometricShapeClassifier and load_state_dict(weights)
119
+
120
+ # Load cross-contrast
121
+ text_proj_w = load_file("crosscontrast/text_proj.safetensors")
122
+ voxel_proj_w = load_file("crosscontrast/voxel_proj.safetensors")
123
+ temp = load_file("crosscontrast/temperature.safetensors")
124
+
125
+ # Load cached embeddings
126
+ emb = load_file("qwen_embeddings/embeddings.safetensors")
127
+ text_embeddings = emb["embeddings"] # (38, 1536)
128
+
129
+ # Classify a voxel grid
130
+ grid = torch.zeros(1, 5, 5, 5) # your binary occupancy grid
131
+ grid[0, 2, 2, 2] = 1 # single point
132
+ with torch.no_grad():
133
+ out = model(grid)
134
+ predicted_class = out["class_logits"].argmax(1)
135
+ ```
136
+
137
+ ## What This Is (and Isn't)
138
+
139
+ This is a **prototype** exploring geometric–linguistic alignment at small scale. The 5Γ—5Γ—5 grid is intentionally minimal β€” large enough to represent 38 distinct geometric primitives with curvature distinctions, small enough to train in minutes on a single GPU. The interesting questions are about the structure of the shared latent space: whether text-space confusions mirror geometric failure modes, whether the alignment generalizes beyond the training vocabulary, and what happens at scale.
140
+
141
+ This is not a production classifier. The procedural dataset is synthetic, the grid resolution is toy-scale, and the cross-contrast vocabulary is fixed at 38 classes.
142
+
143
+ ## License
144
+
145
+ MIT