Reading the Geometry of Learned Representations: How Synthetic Primitives Became a Rosetta Stone for VAE Latent Spaces

Published February 18, 2026

Author: AbstractPhil
Date: February 2026
Repository: AbstractPhil/grid-geometric-classifier-sliding-proto

Abstract

We present a tool for analyzing the intrinsic geometry of VAE latent spaces using a classifier trained entirely on synthetic geometric primitives. This work emerged from a research program exploring whether geometric structure could replace learned parameters in neural networks.

The journey began with a simple question: Can we classify geometric shapes in binary voxel grids? It led to a discovery: VAE latent spaces contain consistent, classifiable geometric structure.

Applied to four major VAE architectures (SD 1.5, SDXL, Flux.1, Flux.2), we find:

Three of four VAEs produce saddle-dominated geometry (50-70% saddle primitives)
Flux.1 is structurally distinct — more diverse geometry with pentachorons (31%), planes (29%), and saddles (15%) more evenly distributed
Classification confidence of 0.87-0.88 across all models, indicating consistent structure rather than noise

We do not claim to explain why this geometry emerges — only that it exists and can be measured. The synthetic primitives serve as a vocabulary for describing VAE latent structure.

Part I: The Experimental Journey

Chapter 1: The 5×5 Grid — Where It Started

The research began with a minimal setup: a 5×5 2D grid with 16 shape classes. The goal was to build a classifier that could distinguish geometric primitives using a capacity cascade — a novel architecture where evidence fills dimensional "buckets" (0D points → 1D lines → 2D faces → 3D volumes) and overflows when saturated.

First results (16 classes, 2D grid):

Overall accuracy: 85.6%
Pentachoron: 100%
Pyramid: 100%
Circle: 70.9%
Ellipse: 60.3%

The circle/ellipse failure was diagnostic. At 5×5 resolution, a circle and an octagon are indistinguishable. The capacity cascade correctly identified dimensionality (fill=[1.00 1.00 0.96 0.03] for circle — edges and faces, no volume), but couldn't separate aspect ratios.

Key insight: The architecture was sound. The resolution was the limit.

Chapter 2: The 5×5×5 Voxel Grid — True 3D

Moving to 5×5×5 = 125 voxels unlocked true volumetric classification. The capacity cascade now had depth to work with.

Architecture: Tracer Attention + Capacity Cascade

Input: (5, 5, 5) binary voxels
  ↓
5 Learned Tracer Tokens attend over 125 voxel embeddings
  ↓
10 Tracer Pairs compute interaction features
  ↓
Capacity Cascade: dim0 → dim1 → dim2 → dim3
  (evidence fills, saturates, overflows)
  ↓
Curvature Head: flat vs curved + curvature type
  ↓
DifferentiationGate: convex vs concave
  ↓
Classification

Results after fixing overflow propagation:

Validation accuracy: 94.5%
Learned capacities: dim0=0.030, dim1=0.029, dim2=9.475, dim3=8.967

The capacities collapsed to extremes — point and edge detection became binary switches, while face and volume detection needed nuance. The model discovered the natural structure of geometric dimensionality.

Fill profiles became classification signatures:

Shape	Fill [d0 d1 d2 d3]	Interpretation
point	[1.00 0.25 0.00 0.00]	"point exists, maybe edge, nothing higher"
line	[1.00 1.00 0.00 0.00]	"points + edges, no faces"
triangle	[1.00 1.00 0.95 0.00]	"all 2D structure, no volume"
tetrahedron	[1.00 1.00 1.00 0.96]	"all four dimensions active"
pentachoron	[1.00 1.00 1.00 1.00]	"fully saturated everywhere"

Chapter 3: Scaling to 25×25×25 — The v11 Architecture

To achieve near-perfect classification, we scaled to 25×25×25 = 15,625 voxels with hierarchical block decomposition:

25×25×25 grid
  ↓
Decompose into 5×5×5 macro grid of 5×5×5 local blocks
  ↓
BlockEncoder (shared) processes each local block
  ↓
Tracer Attention across macro grid
  ↓
Capacity Cascade + Curvature Heads
  ↓
38-class classification

v11 Results:

Validation accuracy: 97.01%
Parameters: 7.5M
38 shape classes with full curvature taxonomy

The classifier was now good enough to be a tool rather than an experiment.

Chapter 4: The Hypothesis — What If We Applied This to VAE Latents?

The question emerged: If we can classify geometric primitives in synthetic voxel grids, can we classify the geometry that VAEs learn?

A VAE latent from Flux 2 is shaped (32, 64, 64) — 32 channels × 64×64 spatial. That's a natural 3D volume if we treat channels as depth.

The pipeline idea:

Encode images → VAE latent (C, H, W)
Treat channels as depth dimension
Extract sliding window patches
Binarize (top 10% of values → 1, rest → 0)
Resize to canonical grid size
Classify with our geometric classifier
Build a geometric fingerprint of the VAE

The key insight: Channel deviances aren't random — they encode the VAE's learned relational structure. A sphere-shaped deviance means two channels disagree uniformly in all directions. A cylindrical deviance means they disagree along an axis. A saddle means they have competing curvature.

Chapter 5: The Non-Cubic Problem — 8×16×16

The 25³ classifier was too slow for extraction (thousands of patches per image). More critically, VAE latents aren't cubic — Flux has 16 or 32 channels but 128×128 spatial resolution. We needed an aspect-ratio-matched classifier.

New canonical shape: 8×16×16 (2,048 voxels)

This preserves the 1:2:2 aspect ratio of VAE latents and is 7.6× smaller than 25³.

First attempt: 3D CNN (~2.5M params)

Standard ResBlock3D architecture. It worked but was slow — the convolutions over non-cubic grids were inefficient.

Second attempt: Patch Cross-Attention (638K params)

Input: (8, 16, 16) binary voxels
  ↓
Decompose into 64 patches of size 2×4×4
  ↓
Shared PatchEncoder (MLP + handcrafted features)
  ↓
3× Cross-Attention Blocks (patches attend to each other)
  ↓
Global Pool + Classification Heads

Results:

Validation accuracy: 98.10%
Parameters: 638,387 (12× smaller than v11)
Inference: ~50K samples/sec on H100

The cross-attention approach beat the hierarchical tracer architecture at 1/12th the parameters. Patches reasoning about each other was more efficient than tracers reasoning about voxels.

Chapter 6: Speed Optimization — From Minutes to Seconds

Initial extraction: 138 seconds per image. Unacceptable.

Bottlenecks identified:

torch.quantile is O(n log n) — use torch.kthvalue instead (O(n) average)
Sequential per-image processing — batch multiple images
CPU round-trips for channel clustering — GPU-only implementation
torch.cuda.empty_cache() in hot loop — forces sync, remove it

Optimizations applied:

# Before: 138s/img
threshold = torch.quantile(patch, 0.9)

# After: <1s/img  
k = int(volume * 0.10)
threshold = patch.kthvalue(volume - k + 1).values

Batched extraction:

Process 32-64 images simultaneously
Single mega-batch classify call across all images × all scales
GPU-only channel clustering (no numpy)

Final speed: 0.33 seconds per image (418× faster)

Chapter 7: The Discovery — Four VAEs Compared

We analyzed 2,074 images through four VAE encoders:

VAE	Latent Shape	Annotations	Confidence
SD 1.5	4×64×64	229,618	0.880
SDXL	4×128×128	1,092,657	0.874
Flux.1	16×128×128	2,943,743	0.878
Flux.2	32×128×128	4,365,328	0.875

Class distributions:

SD 1.5:   saddle 57% | pentachoron 35% | triangular_prism 3%
SDXL:     saddle 53% | pentachoron 30% | triangular_prism 6%
Flux.1:   pentachoron 31% | plane 29% | saddle 15% | square_xy 15%
Flux.2:   saddle 70% | pentachoron 21% | tetrahedron 4%

The pattern:

SD 1.5, SDXL, and Flux.2 converge to saddle-dominated hyperbolic manifolds
Flux.1 breaks the pattern — diverse geometry, no single class above 31%

Cross-VAE similarity (cosine between class distributions):

	SD 1.5	SDXL	Flux.1	Flux.2
SD 1.5	1.000	0.996	0.615	0.966
Flux.1	0.615	0.641	1.000	0.499
Flux.2	0.966	0.970	0.499	1.000

Flux.1 is geometrically distinct. The others form one family.

Part II: Interpretation

What We Measured

Important: This analysis examines VAE encoder outputs only — the compressed latent representation of images. We did not analyze diffusion trajectories, transformer attention, or the denoising process. The geometry we found is the geometry of learned compression, not generation.

Why Saddles? (Hypotheses, Not Conclusions)

Saddle geometry dominates three of four VAEs. We don't know why from this data alone, but plausible explanations include:

Hypothesis 1: Compression Geometry Autoencoders must preserve enough information to reconstruct while compressing aggressively. Saddle geometry (positive curvature in some directions, negative in others) may be optimal for separating modes in a compressed space — similar images cluster along stable directions while dissimilar images separate along unstable directions.

Hypothesis 2: Training Dynamics VAEs are trained with reconstruction loss + KL divergence. The KL term encourages the latent space toward a unit Gaussian prior. Saddles may emerge at the interface between the prior's spherical geometry and the data's natural structure.

Hypothesis 3: Architectural Bias Convolutional encoders with strided downsampling produce features with specific spatial correlation patterns. These patterns might naturally organize into hyperbolic local structure regardless of the data.

What we can say definitively:

The geometry is consistent (0.87+ classification confidence)
It's not noise (same patterns across thousands of images)
Different VAEs learn measurably different geometries
Flux.1 is structurally distinct from SD 1.5, SDXL, and Flux.2

What we cannot say:

Why saddles emerge
Whether this geometry affects generation quality
How the diffusion transformer interacts with this structure
Whether the geometry is optimal or incidental

Pentachorons at Macro Scale

Across all four VAEs, pentachorons dominate at the largest extraction scale (L0). A pentachoron is a 4-simplex — 5 vertices, 10 edges, 10 triangular faces, 5 tetrahedral cells.

This is an observation, not an explanation. Possible interpretations:

The 5-way structure may reflect how the encoder organizes information at global scale
It could be an artifact of the 16-channel depth interacting with 64×64 spatial resolution
It might indicate natural clustering in the learned representation

We note that pentachorons appear consistently across architecturally different VAEs, suggesting this may be a property of visual compression rather than a specific architectural choice.

Why Flux.1 Differs

Flux.1's latent space has measurably different geometry:

39% 2D content (planes, squares) vs <3% for others
More even distribution across primitive classes
Lower cross-similarity to other VAEs (0.50-0.64 vs 0.97+ within the SD/SDXL/Flux.2 family)

Observations:

Flux.2 has unused batch normalization weights (bn.running_var, bn.running_mean) that Flux.1 lacks
Flux.1 uses 16 channels; Flux.2 uses 32
The channel groupings differ (Flux.1: pairs, Flux.2: quads)

We cannot conclude that batch norm "collapsed" the geometry or that more channels caused the difference. We only know the geometries differ. Understanding why would require access to training details, ablation studies, or direct communication with the Flux team.

The per-scale breakdown:

Scale	SD 1.5	Flux.1	Flux.2
L0 (macro)	pentachoron	pentachoron	pentachoron
L2 (mid)	saddle	plane 40%	saddle
L3 (local)	saddle 59%	saddle	saddle 70%

Flux.1's mid-levels show planar structure where others show hyperbolic. This is the measurable difference.

Part III: Implications

For VAE Design

If saddle geometry emerges naturally and pentachorons organize macro structure, we can design VAEs with these priors built in:

Initialize encoders to produce pentachoron-structured outputs
Regularize toward specific curvature profiles during training
Use the geometric classifier as a training loss component

For Latent Space Manipulation

Knowing the geometry enables targeted intervention:

Navigate along saddle directions to move between modes
Stay within planar cross-sections for smooth interpolation
Identify pentachoron vertices as semantic anchors

For Understanding Learned Representations

This method generalizes beyond VAEs. Any learned representation that can be spatially organized and binarized can be geometrically fingerprinted:

Transformer attention patterns
CNN feature maps
Language model embeddings (via reshaping)

Part IV: Connection to Broader Research

This work is part of a research program on geometric deep learning with pentachoron structures — the hypothesis that geometric structure can replace learned parameters.

Core Results

System	Parameters	Performance	Notes
MNIST classifier	~750	85% accuracy	Geometry encodes structure
ImageNet head	72KB	Competitive	Crystal vocabulary
CIFAR-100 head	78KB	92% accuracy	Geometric vocabulary
This classifier	638K	98.1% accuracy, 0.88 conf on VAEs	Reads learned geometry

The Observation

The 38 primitives we defined from mathematical first principles — pentachorons, saddles, planes, tetrahedra — appear consistently in VAE latent spaces trained on natural images.

This could mean:

These geometries are attractors for learned compression
Our primitive vocabulary happens to span common structures in high-dimensional data
The binarization and extraction process biases toward certain shapes

We observe correlation, not causation. The VAEs weren't trained to produce pentachorons — but pentachorons appear. Whether this reflects deep structure or methodological artifact requires further investigation.

Reproducibility

All code and data available at:
https://huggingface.co/AbstractPhil/grid-geometric-classifier-sliding-proto

Files

File	Purpose
`cell1_shape_generator.py`	38-class synthetic shape generation
`cell2_model.py`	PatchCrossAttentionClassifier (638K params)
`cell3_trainer.py`	Training pipeline
`cell4_vae_pipeline.py`	Multi-scale batched extraction
`cell5_quad_vae_geometric_analysis.py`	Single VAE analysis
`cell6_quad_vae_analysis_mega_liminal.py`	Multi-VAE comparison
`best_vae_ca_classifier.pt`	Trained weights
`liminal.zip`	Test images (957)
`mega_liminal_captioned.zip`	Extended test images (2,074)
`multi_vae_comparison_*.json`	Raw results

Running the Full Pipeline

# In Google Colab with GPU

# Generate shapes and train classifier
%run cell1_shape_generator.py
%run cell2_model.py
%run cell3_trainer.py  # → 98.1% accuracy

# Define extraction pipeline
%run cell4_vae_pipeline.py

# Analyze single VAE
%run cell5_quad_vae_geometric_analysis.py

# Compare multiple VAEs
%run cell6_quad_vae_analysis_mega_liminal.py

Timeline of Key Experiments

Date	Experiment	Result
2025	5×5 2D grid, 16 classes	85.6% accuracy, circle/ellipse failure
	5×5×5 voxel grid, capacity cascade	94.5% accuracy, fill profiles work
	25×25×25 v11, tracer architecture	97.01% accuracy, 7.5M params
Feb 2026	Hypothesis: apply to VAE latents	—
	8×16×16 non-cubic conversion	Aspect-ratio matched
	3D CNN attempt	Works but slow
	PatchCrossAttention	98.1% accuracy, 638K params
	Speed optimization	138s/img → 0.33s/img
	Quad-VAE comparison	Flux.1 outlier discovered

Future Directions

Geometric VAE Training: Use the classifier as a regularizer during VAE training
Cross-VAE Translation: Map representations between VAEs using geometric alignment
Semantic-Geometry Correspondence: Which geometric classes correspond to which visual features?
Temporal Analysis: Track geometric structure through diffusion timesteps
Geometric Conditioning: Steer generation toward specific geometric classes

Acknowledgments

This research emerged from a year of exploration into pentachoron mathematics, crystalline vocabulary systems, and the hypothesis that universal geometric structures underlie efficient neural computation.

The capacity cascade architecture, the tracer attention mechanism, and the geometric primitive taxonomy were developed through extensive iteration, failure analysis, and refinement. Every "failure" — the circle/ellipse confusion at 5×5, the overflow scalar that killed dim1, the slow extraction pipeline — taught us something that made the final system stronger.

Citation

@misc{abstractphil2026geometric,
  author = {AbstractPhil},
  title = {Reading the Geometry of Learned Representations: 
           How Synthetic Primitives Became a Rosetta Stone for VAE Latent Spaces},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/AbstractPhil/grid-geometric-classifier-sliding-proto}
}

"We cannot fight the universe, only exist within it."

The classifier finds consistent geometry in VAE latent spaces. Whether those shapes are fundamental or incidental remains an open question. But they're measurably, repeatably there.

Models mentioned in this article 1

Subject Bucketing: Teaching a Diffusion Model New Prompt Languages Without Forgetting

June 25, 2026

geolip-aleph-void: The First Relational Geometric Vocabulary Patchwork

June 8, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote