grid-geometric-classifier-proto / README.md

Update README.md

58eb211 verified about 19 hours ago

7.87 kB

	---
	license: mit
	tags:
	- geometric-deep-learning
	- voxel-classifier
	- cross-contrast
	- pentachoron
	- contrastive-learning
	- 3d-classification
	pipeline_tag: other
	---

	# Grid Geometric Classifier Proto

	This is a subcomponent experiment for the larger scene classification experiments. Coming full circle back to the original geometric vocabulary soon.

	A prototype system for geometric primitive classification and text–geometry alignment. A voxel classifier learns to identify 38 shape classes from 5×5×5 binary occupancy grids using capacity cascades, curvature analysis, differentiation gates, and a rectified flow arbiter. A cross-contrast module then aligns the classifier's learned features with Qwen 2.5-1.5B text embeddings via InfoNCE, producing a shared latent space where geometric structure and natural language descriptions are jointly represented.

	This is a research prototype exploring whether a geometric vocabulary learned from pure structure can meaningfully align with linguistic semantics.

	## Repository Structure

	```
	geometric_classifier/ ← Voxel classifier (~1.85M params)
	├── config.json # Architecture: dims, classes, shape catalog
	├── training_config.json # Hyperparams, loss weights, results
	└── model.safetensors # Weights

	crosscontrast/ ← Text↔Voxel alignment heads
	├── config.json # Projection dims, latent space config
	├── training_config.json # Contrastive training params & results
	├── text_proj.safetensors # Text → latent projection
	├── voxel_proj.safetensors # Voxel → latent projection
	└── temperature.safetensors # Learned temperature scalar

	qwen_embeddings/ ← Cached Qwen 2.5-1.5B embeddings
	├── config.json # Model name, hidden dim, extraction method
	├── embeddings.safetensors # (38, 1536) class embeddings
	└── descriptions.json # Natural language shape descriptions
	```

	## Shape Vocabulary: 38 Classes

	The vocabulary spans 0D–3D primitives, both rigid and curved, organized by intrinsic dimensionality:

	\| Dim \| Rigid \| Curved \|
	\|-----\|-------\|--------\|
	\| 0D \| point \| — \|
	\| 1D \| line_x, line_y, line_z, line_diag, cross, l_shape, collinear \| arc, helix \|
	\| 2D \| triangle_xy, triangle_xz, triangle_3d, square_xy, square_xz, rectangle, coplanar, plane \| circle, ellipse, disc \|
	\| 3D \| tetrahedron, pyramid, pentachoron, cube, cuboid, triangular_prism, octahedron \| sphere, hemisphere, cylinder, cone, capsule, torus, shell, tube, bowl, saddle \|

	Eight curvature types: `none`, `convex`, `concave`, `cylindrical`, `conical`, `toroidal`, `hyperbolic`, `helical`.

	## Architecture

	### GeometricShapeClassifier (v8)

	Input is a 5×5×5 binary voxel grid. The forward pass has four stages:

	1. Tracer Attention — 5 learned tracer tokens attend over 125 voxel embeddings (occupancy + normalized 3D position → 64-dim via MLP). All C(5,2)=10 tracer pairs compute interaction features and edge detection scores via SwiGLU heads. Pool dimension: 320 (5 tracers × 64-dim).

	2. Capacity Cascade — Four `CapacityHead` modules with learned capacities (initialized at 0.5, 1.0, 1.5, 2.0) process features sequentially. Each outputs a fill ratio (sigmoid), overflow signal, and residual features. The cascade partitions representation capacity across intrinsic dimensions (0D→3D), with fill ratios serving as soft dimensionality indicators.

	3. Curvature Analysis — A `DifferentiationGate` computes radial distance profiles binned into 5 shells, producing sigmoid gates and additive directional features that differentiate convex/concave curvature. A `CurvatureHead` combines rigid features with gated curvature features to predict: is_curved (binary), curvature_type (8-class), and a curvature embedding used downstream.

	4. Rectified Flow Arbiter — For ambiguous cases, a `RectifiedFlowArbiter` integrates a learned velocity field over 4 flow-matching steps from noise to class prototypes. Produces refined logits, trajectory logits at each step, confidence scores, and a blend weight that gates between initial and refined predictions. Trained with OT-conditioned flow matching loss.

	The final class prediction blends initial and arbiter-refined logits via the learned blend weight.

	### CrossContrastModel

	Two MLP projection heads map frozen voxel features (645-dim) and frozen Qwen text embeddings (1536-dim) into a shared 256-dim latent space. Architecture per head: `Linear → LayerNorm → GELU → Linear → LayerNorm → GELU → Linear`. Trained with symmetric InfoNCE loss and a learned temperature parameter.

	### Text Embeddings

	Class descriptions are encoded by Qwen 2.5-1.5B-Instruct using mean-pooled last hidden states. Each of the 38 classes has a 2-shot geometric description (e.g., "A flat triangular outline formed by three connected edges lying in the horizontal xy-plane, the simplest polygon").

	## Training

	### Classifier (Cell 3)

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Dataset \| 500K procedurally generated samples (400K train / 100K val) \|
	\| Grid size \| 5×5×5 binary occupancy \|
	\| Batch size \| 4,096 \|
	\| Optimizer \| AdamW (lr=3e-3, wd=1e-4) \|
	\| Schedule \| Cosine with 5-epoch warmup \|
	\| Precision \| BF16 autocast (no GradScaler) \|
	\| Compile \| torch.compile (default mode) \|
	\| Augmentation \| Voxel dropout (5%), random addition (5%), spatial shift (8%) \|
	\| Epochs \| 80 \|

	The classifier is trained with a composite loss: cross-entropy on initial and refined logits, capacity fill ratio supervision, peak dimension classification, overflow regularization, capacity diversity, volume regression (log1p MSE), Cayley-Menger determinant sign prediction, curvature binary/type classification, flow matching loss, arbiter confidence calibration, and blend weight supervision. 13 weighted terms total.

	### Cross-Contrast (Cell 4)

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Dataset \| Reuses Cell 3 cached dataset \|
	\| Voxel encoder \| Frozen GeometricShapeClassifier \|
	\| Text encoder \| Frozen Qwen 2.5-1.5B-Instruct \|
	\| Latent dim \| 256 \|
	\| Batch size \| 4,096 \|
	\| Optimizer \| AdamW (lr=2e-3, wd=1e-4) \|
	\| Schedule \| Cosine with 3-epoch warmup \|
	\| Loss \| Symmetric InfoNCE \|
	\| Temperature \| Learned (init 0.07) \|
	\| Epochs \| 40 \|

	## Quick Start

	```python
	import torch
	from safetensors.torch import load_file

	# Load classifier
	weights = load_file("geometric_classifier/model.safetensors")
	# Instantiate GeometricShapeClassifier and load_state_dict(weights)

	# Load cross-contrast
	text_proj_w = load_file("crosscontrast/text_proj.safetensors")
	voxel_proj_w = load_file("crosscontrast/voxel_proj.safetensors")
	temp = load_file("crosscontrast/temperature.safetensors")

	# Load cached embeddings
	emb = load_file("qwen_embeddings/embeddings.safetensors")
	text_embeddings = emb["embeddings"] # (38, 1536)

	# Classify a voxel grid
	grid = torch.zeros(1, 5, 5, 5) # your binary occupancy grid
	grid[0, 2, 2, 2] = 1 # single point
	with torch.no_grad():
	out = model(grid)
	predicted_class = out["class_logits"].argmax(1)
	```

	## What This Is (and Isn't)

	This is a prototype exploring geometric–linguistic alignment at small scale. The 5×5×5 grid is intentionally minimal — large enough to represent 38 distinct geometric primitives with curvature distinctions, small enough to train in minutes on a single GPU. The interesting questions are about the structure of the shared latent space: whether text-space confusions mirror geometric failure modes, whether the alignment generalizes beyond the training vocabulary, and what happens at scale.

	This is not a production classifier. The procedural dataset is synthetic, the grid resolution is toy-scale, and the cross-contrast vocabulary is fixed at 38 classes.

	## License

	MIT