File size: 7,868 Bytes
7fc1cb7 58eb211 adbf562 7fc1cb7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
license: mit
tags:
- geometric-deep-learning
- voxel-classifier
- cross-contrast
- pentachoron
- contrastive-learning
- 3d-classification
pipeline_tag: other
---
# Grid Geometric Classifier Proto
This is a subcomponent experiment for the larger scene classification experiments. Coming full circle back to the original geometric vocabulary soon.
A prototype system for geometric primitive classification and textβgeometry alignment. A voxel classifier learns to identify 38 shape classes from 5Γ5Γ5 binary occupancy grids using capacity cascades, curvature analysis, differentiation gates, and a rectified flow arbiter. A cross-contrast module then aligns the classifier's learned features with Qwen 2.5-1.5B text embeddings via InfoNCE, producing a shared latent space where geometric structure and natural language descriptions are jointly represented.
This is a research prototype exploring whether a geometric vocabulary learned from pure structure can meaningfully align with linguistic semantics.
## Repository Structure
```
geometric_classifier/ β Voxel classifier (~1.85M params)
βββ config.json # Architecture: dims, classes, shape catalog
βββ training_config.json # Hyperparams, loss weights, results
βββ model.safetensors # Weights
crosscontrast/ β TextβVoxel alignment heads
βββ config.json # Projection dims, latent space config
βββ training_config.json # Contrastive training params & results
βββ text_proj.safetensors # Text β latent projection
βββ voxel_proj.safetensors # Voxel β latent projection
βββ temperature.safetensors # Learned temperature scalar
qwen_embeddings/ β Cached Qwen 2.5-1.5B embeddings
βββ config.json # Model name, hidden dim, extraction method
βββ embeddings.safetensors # (38, 1536) class embeddings
βββ descriptions.json # Natural language shape descriptions
```
## Shape Vocabulary: 38 Classes
The vocabulary spans 0Dβ3D primitives, both rigid and curved, organized by intrinsic dimensionality:
| Dim | Rigid | Curved |
|-----|-------|--------|
| 0D | point | β |
| 1D | line_x, line_y, line_z, line_diag, cross, l_shape, collinear | arc, helix |
| 2D | triangle_xy, triangle_xz, triangle_3d, square_xy, square_xz, rectangle, coplanar, plane | circle, ellipse, disc |
| 3D | tetrahedron, pyramid, pentachoron, cube, cuboid, triangular_prism, octahedron | sphere, hemisphere, cylinder, cone, capsule, torus, shell, tube, bowl, saddle |
Eight curvature types: `none`, `convex`, `concave`, `cylindrical`, `conical`, `toroidal`, `hyperbolic`, `helical`.
## Architecture
### GeometricShapeClassifier (v8)
Input is a 5Γ5Γ5 binary voxel grid. The forward pass has four stages:
**1. Tracer Attention** β 5 learned tracer tokens attend over 125 voxel embeddings (occupancy + normalized 3D position β 64-dim via MLP). All C(5,2)=10 tracer pairs compute interaction features and edge detection scores via SwiGLU heads. Pool dimension: 320 (5 tracers Γ 64-dim).
**2. Capacity Cascade** β Four `CapacityHead` modules with learned capacities (initialized at 0.5, 1.0, 1.5, 2.0) process features sequentially. Each outputs a fill ratio (sigmoid), overflow signal, and residual features. The cascade partitions representation capacity across intrinsic dimensions (0Dβ3D), with fill ratios serving as soft dimensionality indicators.
**3. Curvature Analysis** β A `DifferentiationGate` computes radial distance profiles binned into 5 shells, producing sigmoid gates and additive directional features that differentiate convex/concave curvature. A `CurvatureHead` combines rigid features with gated curvature features to predict: is_curved (binary), curvature_type (8-class), and a curvature embedding used downstream.
**4. Rectified Flow Arbiter** β For ambiguous cases, a `RectifiedFlowArbiter` integrates a learned velocity field over 4 flow-matching steps from noise to class prototypes. Produces refined logits, trajectory logits at each step, confidence scores, and a blend weight that gates between initial and refined predictions. Trained with OT-conditioned flow matching loss.
The final class prediction blends initial and arbiter-refined logits via the learned blend weight.
### CrossContrastModel
Two MLP projection heads map frozen voxel features (645-dim) and frozen Qwen text embeddings (1536-dim) into a shared 256-dim latent space. Architecture per head: `Linear β LayerNorm β GELU β Linear β LayerNorm β GELU β Linear`. Trained with symmetric InfoNCE loss and a learned temperature parameter.
### Text Embeddings
Class descriptions are encoded by Qwen 2.5-1.5B-Instruct using mean-pooled last hidden states. Each of the 38 classes has a 2-shot geometric description (e.g., *"A flat triangular outline formed by three connected edges lying in the horizontal xy-plane, the simplest polygon"*).
## Training
### Classifier (Cell 3)
| Parameter | Value |
|-----------|-------|
| Dataset | 500K procedurally generated samples (400K train / 100K val) |
| Grid size | 5Γ5Γ5 binary occupancy |
| Batch size | 4,096 |
| Optimizer | AdamW (lr=3e-3, wd=1e-4) |
| Schedule | Cosine with 5-epoch warmup |
| Precision | BF16 autocast (no GradScaler) |
| Compile | torch.compile (default mode) |
| Augmentation | Voxel dropout (5%), random addition (5%), spatial shift (8%) |
| Epochs | 80 |
The classifier is trained with a composite loss: cross-entropy on initial and refined logits, capacity fill ratio supervision, peak dimension classification, overflow regularization, capacity diversity, volume regression (log1p MSE), Cayley-Menger determinant sign prediction, curvature binary/type classification, flow matching loss, arbiter confidence calibration, and blend weight supervision. 13 weighted terms total.
### Cross-Contrast (Cell 4)
| Parameter | Value |
|-----------|-------|
| Dataset | Reuses Cell 3 cached dataset |
| Voxel encoder | Frozen GeometricShapeClassifier |
| Text encoder | Frozen Qwen 2.5-1.5B-Instruct |
| Latent dim | 256 |
| Batch size | 4,096 |
| Optimizer | AdamW (lr=2e-3, wd=1e-4) |
| Schedule | Cosine with 3-epoch warmup |
| Loss | Symmetric InfoNCE |
| Temperature | Learned (init 0.07) |
| Epochs | 40 |
## Quick Start
```python
import torch
from safetensors.torch import load_file
# Load classifier
weights = load_file("geometric_classifier/model.safetensors")
# Instantiate GeometricShapeClassifier and load_state_dict(weights)
# Load cross-contrast
text_proj_w = load_file("crosscontrast/text_proj.safetensors")
voxel_proj_w = load_file("crosscontrast/voxel_proj.safetensors")
temp = load_file("crosscontrast/temperature.safetensors")
# Load cached embeddings
emb = load_file("qwen_embeddings/embeddings.safetensors")
text_embeddings = emb["embeddings"] # (38, 1536)
# Classify a voxel grid
grid = torch.zeros(1, 5, 5, 5) # your binary occupancy grid
grid[0, 2, 2, 2] = 1 # single point
with torch.no_grad():
out = model(grid)
predicted_class = out["class_logits"].argmax(1)
```
## What This Is (and Isn't)
This is a **prototype** exploring geometricβlinguistic alignment at small scale. The 5Γ5Γ5 grid is intentionally minimal β large enough to represent 38 distinct geometric primitives with curvature distinctions, small enough to train in minutes on a single GPU. The interesting questions are about the structure of the shared latent space: whether text-space confusions mirror geometric failure modes, whether the alignment generalizes beyond the training vocabulary, and what happens at scale.
This is not a production classifier. The procedural dataset is synthetic, the grid resolution is toy-scale, and the cross-contrast vocabulary is fixed at 38 classes.
## License
MIT |