|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- geometric-deep-learning |
|
|
- voxel-classifier |
|
|
- cross-contrast |
|
|
- pentachoron |
|
|
- contrastive-learning |
|
|
- 3d-classification |
|
|
pipeline_tag: other |
|
|
--- |
|
|
|
|
|
# Grid Geometric Classifier Proto |
|
|
|
|
|
This is a subcomponent experiment for the larger scene classification experiments. Coming full circle back to the original geometric vocabulary soon. |
|
|
|
|
|
A prototype system for geometric primitive classification and textβgeometry alignment. A voxel classifier learns to identify 38 shape classes from 5Γ5Γ5 binary occupancy grids using capacity cascades, curvature analysis, differentiation gates, and a rectified flow arbiter. A cross-contrast module then aligns the classifier's learned features with Qwen 2.5-1.5B text embeddings via InfoNCE, producing a shared latent space where geometric structure and natural language descriptions are jointly represented. |
|
|
|
|
|
This is a research prototype exploring whether a geometric vocabulary learned from pure structure can meaningfully align with linguistic semantics. |
|
|
|
|
|
## Repository Structure |
|
|
|
|
|
``` |
|
|
geometric_classifier/ β Voxel classifier (~1.85M params) |
|
|
βββ config.json # Architecture: dims, classes, shape catalog |
|
|
βββ training_config.json # Hyperparams, loss weights, results |
|
|
βββ model.safetensors # Weights |
|
|
|
|
|
crosscontrast/ β TextβVoxel alignment heads |
|
|
βββ config.json # Projection dims, latent space config |
|
|
βββ training_config.json # Contrastive training params & results |
|
|
βββ text_proj.safetensors # Text β latent projection |
|
|
βββ voxel_proj.safetensors # Voxel β latent projection |
|
|
βββ temperature.safetensors # Learned temperature scalar |
|
|
|
|
|
qwen_embeddings/ β Cached Qwen 2.5-1.5B embeddings |
|
|
βββ config.json # Model name, hidden dim, extraction method |
|
|
βββ embeddings.safetensors # (38, 1536) class embeddings |
|
|
βββ descriptions.json # Natural language shape descriptions |
|
|
``` |
|
|
|
|
|
## Shape Vocabulary: 38 Classes |
|
|
|
|
|
The vocabulary spans 0Dβ3D primitives, both rigid and curved, organized by intrinsic dimensionality: |
|
|
|
|
|
| Dim | Rigid | Curved | |
|
|
|-----|-------|--------| |
|
|
| 0D | point | β | |
|
|
| 1D | line_x, line_y, line_z, line_diag, cross, l_shape, collinear | arc, helix | |
|
|
| 2D | triangle_xy, triangle_xz, triangle_3d, square_xy, square_xz, rectangle, coplanar, plane | circle, ellipse, disc | |
|
|
| 3D | tetrahedron, pyramid, pentachoron, cube, cuboid, triangular_prism, octahedron | sphere, hemisphere, cylinder, cone, capsule, torus, shell, tube, bowl, saddle | |
|
|
|
|
|
Eight curvature types: `none`, `convex`, `concave`, `cylindrical`, `conical`, `toroidal`, `hyperbolic`, `helical`. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
### GeometricShapeClassifier (v8) |
|
|
|
|
|
Input is a 5Γ5Γ5 binary voxel grid. The forward pass has four stages: |
|
|
|
|
|
**1. Tracer Attention** β 5 learned tracer tokens attend over 125 voxel embeddings (occupancy + normalized 3D position β 64-dim via MLP). All C(5,2)=10 tracer pairs compute interaction features and edge detection scores via SwiGLU heads. Pool dimension: 320 (5 tracers Γ 64-dim). |
|
|
|
|
|
**2. Capacity Cascade** β Four `CapacityHead` modules with learned capacities (initialized at 0.5, 1.0, 1.5, 2.0) process features sequentially. Each outputs a fill ratio (sigmoid), overflow signal, and residual features. The cascade partitions representation capacity across intrinsic dimensions (0Dβ3D), with fill ratios serving as soft dimensionality indicators. |
|
|
|
|
|
**3. Curvature Analysis** β A `DifferentiationGate` computes radial distance profiles binned into 5 shells, producing sigmoid gates and additive directional features that differentiate convex/concave curvature. A `CurvatureHead` combines rigid features with gated curvature features to predict: is_curved (binary), curvature_type (8-class), and a curvature embedding used downstream. |
|
|
|
|
|
**4. Rectified Flow Arbiter** β For ambiguous cases, a `RectifiedFlowArbiter` integrates a learned velocity field over 4 flow-matching steps from noise to class prototypes. Produces refined logits, trajectory logits at each step, confidence scores, and a blend weight that gates between initial and refined predictions. Trained with OT-conditioned flow matching loss. |
|
|
|
|
|
The final class prediction blends initial and arbiter-refined logits via the learned blend weight. |
|
|
|
|
|
### CrossContrastModel |
|
|
|
|
|
Two MLP projection heads map frozen voxel features (645-dim) and frozen Qwen text embeddings (1536-dim) into a shared 256-dim latent space. Architecture per head: `Linear β LayerNorm β GELU β Linear β LayerNorm β GELU β Linear`. Trained with symmetric InfoNCE loss and a learned temperature parameter. |
|
|
|
|
|
### Text Embeddings |
|
|
|
|
|
Class descriptions are encoded by Qwen 2.5-1.5B-Instruct using mean-pooled last hidden states. Each of the 38 classes has a 2-shot geometric description (e.g., *"A flat triangular outline formed by three connected edges lying in the horizontal xy-plane, the simplest polygon"*). |
|
|
|
|
|
## Training |
|
|
|
|
|
### Classifier (Cell 3) |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Dataset | 500K procedurally generated samples (400K train / 100K val) | |
|
|
| Grid size | 5Γ5Γ5 binary occupancy | |
|
|
| Batch size | 4,096 | |
|
|
| Optimizer | AdamW (lr=3e-3, wd=1e-4) | |
|
|
| Schedule | Cosine with 5-epoch warmup | |
|
|
| Precision | BF16 autocast (no GradScaler) | |
|
|
| Compile | torch.compile (default mode) | |
|
|
| Augmentation | Voxel dropout (5%), random addition (5%), spatial shift (8%) | |
|
|
| Epochs | 80 | |
|
|
|
|
|
The classifier is trained with a composite loss: cross-entropy on initial and refined logits, capacity fill ratio supervision, peak dimension classification, overflow regularization, capacity diversity, volume regression (log1p MSE), Cayley-Menger determinant sign prediction, curvature binary/type classification, flow matching loss, arbiter confidence calibration, and blend weight supervision. 13 weighted terms total. |
|
|
|
|
|
### Cross-Contrast (Cell 4) |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Dataset | Reuses Cell 3 cached dataset | |
|
|
| Voxel encoder | Frozen GeometricShapeClassifier | |
|
|
| Text encoder | Frozen Qwen 2.5-1.5B-Instruct | |
|
|
| Latent dim | 256 | |
|
|
| Batch size | 4,096 | |
|
|
| Optimizer | AdamW (lr=2e-3, wd=1e-4) | |
|
|
| Schedule | Cosine with 3-epoch warmup | |
|
|
| Loss | Symmetric InfoNCE | |
|
|
| Temperature | Learned (init 0.07) | |
|
|
| Epochs | 40 | |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from safetensors.torch import load_file |
|
|
|
|
|
# Load classifier |
|
|
weights = load_file("geometric_classifier/model.safetensors") |
|
|
# Instantiate GeometricShapeClassifier and load_state_dict(weights) |
|
|
|
|
|
# Load cross-contrast |
|
|
text_proj_w = load_file("crosscontrast/text_proj.safetensors") |
|
|
voxel_proj_w = load_file("crosscontrast/voxel_proj.safetensors") |
|
|
temp = load_file("crosscontrast/temperature.safetensors") |
|
|
|
|
|
# Load cached embeddings |
|
|
emb = load_file("qwen_embeddings/embeddings.safetensors") |
|
|
text_embeddings = emb["embeddings"] # (38, 1536) |
|
|
|
|
|
# Classify a voxel grid |
|
|
grid = torch.zeros(1, 5, 5, 5) # your binary occupancy grid |
|
|
grid[0, 2, 2, 2] = 1 # single point |
|
|
with torch.no_grad(): |
|
|
out = model(grid) |
|
|
predicted_class = out["class_logits"].argmax(1) |
|
|
``` |
|
|
|
|
|
## What This Is (and Isn't) |
|
|
|
|
|
This is a **prototype** exploring geometricβlinguistic alignment at small scale. The 5Γ5Γ5 grid is intentionally minimal β large enough to represent 38 distinct geometric primitives with curvature distinctions, small enough to train in minutes on a single GPU. The interesting questions are about the structure of the shared latent space: whether text-space confusions mirror geometric failure modes, whether the alignment generalizes beyond the training vocabulary, and what happens at scale. |
|
|
|
|
|
This is not a production classifier. The procedural dataset is synthetic, the grid resolution is toy-scale, and the cross-contrast vocabulary is fixed at 38 classes. |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |