File size: 7,868 Bytes
7fc1cb7
 
 
 
 
 
 
 
 
 
 
 
 
 
58eb211
adbf562
7fc1cb7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
license: mit
tags:
  - geometric-deep-learning
  - voxel-classifier
  - cross-contrast
  - pentachoron
  - contrastive-learning
  - 3d-classification
pipeline_tag: other
---

# Grid Geometric Classifier Proto

This is a subcomponent experiment for the larger scene classification experiments. Coming full circle back to the original geometric vocabulary soon.

A prototype system for geometric primitive classification and text–geometry alignment. A voxel classifier learns to identify 38 shape classes from 5Γ—5Γ—5 binary occupancy grids using capacity cascades, curvature analysis, differentiation gates, and a rectified flow arbiter. A cross-contrast module then aligns the classifier's learned features with Qwen 2.5-1.5B text embeddings via InfoNCE, producing a shared latent space where geometric structure and natural language descriptions are jointly represented.

This is a research prototype exploring whether a geometric vocabulary learned from pure structure can meaningfully align with linguistic semantics.

## Repository Structure

```
geometric_classifier/          ← Voxel classifier (~1.85M params)
β”œβ”€β”€ config.json                # Architecture: dims, classes, shape catalog
β”œβ”€β”€ training_config.json       # Hyperparams, loss weights, results
└── model.safetensors          # Weights

crosscontrast/                 ← Text↔Voxel alignment heads
β”œβ”€β”€ config.json                # Projection dims, latent space config
β”œβ”€β”€ training_config.json       # Contrastive training params & results
β”œβ”€β”€ text_proj.safetensors      # Text β†’ latent projection
β”œβ”€β”€ voxel_proj.safetensors     # Voxel β†’ latent projection
└── temperature.safetensors    # Learned temperature scalar

qwen_embeddings/               ← Cached Qwen 2.5-1.5B embeddings
β”œβ”€β”€ config.json                # Model name, hidden dim, extraction method
β”œβ”€β”€ embeddings.safetensors     # (38, 1536) class embeddings
└── descriptions.json          # Natural language shape descriptions
```

## Shape Vocabulary: 38 Classes

The vocabulary spans 0D–3D primitives, both rigid and curved, organized by intrinsic dimensionality:

| Dim | Rigid | Curved |
|-----|-------|--------|
| 0D  | point | β€” |
| 1D  | line_x, line_y, line_z, line_diag, cross, l_shape, collinear | arc, helix |
| 2D  | triangle_xy, triangle_xz, triangle_3d, square_xy, square_xz, rectangle, coplanar, plane | circle, ellipse, disc |
| 3D  | tetrahedron, pyramid, pentachoron, cube, cuboid, triangular_prism, octahedron | sphere, hemisphere, cylinder, cone, capsule, torus, shell, tube, bowl, saddle |

Eight curvature types: `none`, `convex`, `concave`, `cylindrical`, `conical`, `toroidal`, `hyperbolic`, `helical`.

## Architecture

### GeometricShapeClassifier (v8)

Input is a 5Γ—5Γ—5 binary voxel grid. The forward pass has four stages:

**1. Tracer Attention** β€” 5 learned tracer tokens attend over 125 voxel embeddings (occupancy + normalized 3D position β†’ 64-dim via MLP). All C(5,2)=10 tracer pairs compute interaction features and edge detection scores via SwiGLU heads. Pool dimension: 320 (5 tracers Γ— 64-dim).

**2. Capacity Cascade** — Four `CapacityHead` modules with learned capacities (initialized at 0.5, 1.0, 1.5, 2.0) process features sequentially. Each outputs a fill ratio (sigmoid), overflow signal, and residual features. The cascade partitions representation capacity across intrinsic dimensions (0D→3D), with fill ratios serving as soft dimensionality indicators.

**3. Curvature Analysis** β€” A `DifferentiationGate` computes radial distance profiles binned into 5 shells, producing sigmoid gates and additive directional features that differentiate convex/concave curvature. A `CurvatureHead` combines rigid features with gated curvature features to predict: is_curved (binary), curvature_type (8-class), and a curvature embedding used downstream.

**4. Rectified Flow Arbiter** β€” For ambiguous cases, a `RectifiedFlowArbiter` integrates a learned velocity field over 4 flow-matching steps from noise to class prototypes. Produces refined logits, trajectory logits at each step, confidence scores, and a blend weight that gates between initial and refined predictions. Trained with OT-conditioned flow matching loss.

The final class prediction blends initial and arbiter-refined logits via the learned blend weight.

### CrossContrastModel

Two MLP projection heads map frozen voxel features (645-dim) and frozen Qwen text embeddings (1536-dim) into a shared 256-dim latent space. Architecture per head: `Linear β†’ LayerNorm β†’ GELU β†’ Linear β†’ LayerNorm β†’ GELU β†’ Linear`. Trained with symmetric InfoNCE loss and a learned temperature parameter.

### Text Embeddings

Class descriptions are encoded by Qwen 2.5-1.5B-Instruct using mean-pooled last hidden states. Each of the 38 classes has a 2-shot geometric description (e.g., *"A flat triangular outline formed by three connected edges lying in the horizontal xy-plane, the simplest polygon"*).

## Training

### Classifier (Cell 3)

| Parameter | Value |
|-----------|-------|
| Dataset | 500K procedurally generated samples (400K train / 100K val) |
| Grid size | 5Γ—5Γ—5 binary occupancy |
| Batch size | 4,096 |
| Optimizer | AdamW (lr=3e-3, wd=1e-4) |
| Schedule | Cosine with 5-epoch warmup |
| Precision | BF16 autocast (no GradScaler) |
| Compile | torch.compile (default mode) |
| Augmentation | Voxel dropout (5%), random addition (5%), spatial shift (8%) |
| Epochs | 80 |

The classifier is trained with a composite loss: cross-entropy on initial and refined logits, capacity fill ratio supervision, peak dimension classification, overflow regularization, capacity diversity, volume regression (log1p MSE), Cayley-Menger determinant sign prediction, curvature binary/type classification, flow matching loss, arbiter confidence calibration, and blend weight supervision. 13 weighted terms total.

### Cross-Contrast (Cell 4)

| Parameter | Value |
|-----------|-------|
| Dataset | Reuses Cell 3 cached dataset |
| Voxel encoder | Frozen GeometricShapeClassifier |
| Text encoder | Frozen Qwen 2.5-1.5B-Instruct |
| Latent dim | 256 |
| Batch size | 4,096 |
| Optimizer | AdamW (lr=2e-3, wd=1e-4) |
| Schedule | Cosine with 3-epoch warmup |
| Loss | Symmetric InfoNCE |
| Temperature | Learned (init 0.07) |
| Epochs | 40 |

## Quick Start

```python
import torch
from safetensors.torch import load_file

# Load classifier
weights = load_file("geometric_classifier/model.safetensors")
# Instantiate GeometricShapeClassifier and load_state_dict(weights)

# Load cross-contrast
text_proj_w = load_file("crosscontrast/text_proj.safetensors")
voxel_proj_w = load_file("crosscontrast/voxel_proj.safetensors")
temp = load_file("crosscontrast/temperature.safetensors")

# Load cached embeddings
emb = load_file("qwen_embeddings/embeddings.safetensors")
text_embeddings = emb["embeddings"]  # (38, 1536)

# Classify a voxel grid
grid = torch.zeros(1, 5, 5, 5)  # your binary occupancy grid
grid[0, 2, 2, 2] = 1  # single point
with torch.no_grad():
    out = model(grid)
    predicted_class = out["class_logits"].argmax(1)
```

## What This Is (and Isn't)

This is a **prototype** exploring geometric–linguistic alignment at small scale. The 5Γ—5Γ—5 grid is intentionally minimal β€” large enough to represent 38 distinct geometric primitives with curvature distinctions, small enough to train in minutes on a single GPU. The interesting questions are about the structure of the shared latent space: whether text-space confusions mirror geometric failure modes, whether the alignment generalizes beyond the training vocabulary, and what happens at scale.

This is not a production classifier. The procedural dataset is synthetic, the grid resolution is toy-scale, and the cross-contrast vocabulary is fixed at 38 classes.

## License

MIT