Reading the Geometry of Learned Representations: How Synthetic Primitives Became a Rosetta Stone for VAE Latent Spaces
Author: AbstractPhil
Date: February 2026
Repository: AbstractPhil/grid-geometric-classifier-sliding-proto
Abstract
We present a tool for analyzing the intrinsic geometry of VAE latent spaces using a classifier trained entirely on synthetic geometric primitives. This work emerged from a research program exploring whether geometric structure could replace learned parameters in neural networks.
The journey began with a simple question: Can we classify geometric shapes in binary voxel grids? It led to a discovery: VAE latent spaces contain consistent, classifiable geometric structure.
Applied to four major VAE architectures (SD 1.5, SDXL, Flux.1, Flux.2), we find:
- Three of four VAEs produce saddle-dominated geometry (50-70% saddle primitives)
- Flux.1 is structurally distinct — more diverse geometry with pentachorons (31%), planes (29%), and saddles (15%) more evenly distributed
- Classification confidence of 0.87-0.88 across all models, indicating consistent structure rather than noise
We do not claim to explain why this geometry emerges — only that it exists and can be measured. The synthetic primitives serve as a vocabulary for describing VAE latent structure.
Part I: The Experimental Journey
Chapter 1: The 5×5 Grid — Where It Started
The research began with a minimal setup: a 5×5 2D grid with 16 shape classes. The goal was to build a classifier that could distinguish geometric primitives using a capacity cascade — a novel architecture where evidence fills dimensional "buckets" (0D points → 1D lines → 2D faces → 3D volumes) and overflows when saturated.
First results (16 classes, 2D grid):
- Overall accuracy: 85.6%
- Pentachoron: 100%
- Pyramid: 100%
- Circle: 70.9%
- Ellipse: 60.3%
The circle/ellipse failure was diagnostic. At 5×5 resolution, a circle and an octagon are indistinguishable. The capacity cascade correctly identified dimensionality (fill=[1.00 1.00 0.96 0.03] for circle — edges and faces, no volume), but couldn't separate aspect ratios.
Key insight: The architecture was sound. The resolution was the limit.
Chapter 2: The 5×5×5 Voxel Grid — True 3D
Moving to 5×5×5 = 125 voxels unlocked true volumetric classification. The capacity cascade now had depth to work with.
Architecture: Tracer Attention + Capacity Cascade
Input: (5, 5, 5) binary voxels
↓
5 Learned Tracer Tokens attend over 125 voxel embeddings
↓
10 Tracer Pairs compute interaction features
↓
Capacity Cascade: dim0 → dim1 → dim2 → dim3
(evidence fills, saturates, overflows)
↓
Curvature Head: flat vs curved + curvature type
↓
DifferentiationGate: convex vs concave
↓
Classification
Results after fixing overflow propagation:
- Validation accuracy: 94.5%
- Learned capacities: dim0=0.030, dim1=0.029, dim2=9.475, dim3=8.967
The capacities collapsed to extremes — point and edge detection became binary switches, while face and volume detection needed nuance. The model discovered the natural structure of geometric dimensionality.
Fill profiles became classification signatures:
| Shape | Fill [d0 d1 d2 d3] | Interpretation |
|---|---|---|
| point | [1.00 0.25 0.00 0.00] | "point exists, maybe edge, nothing higher" |
| line | [1.00 1.00 0.00 0.00] | "points + edges, no faces" |
| triangle | [1.00 1.00 0.95 0.00] | "all 2D structure, no volume" |
| tetrahedron | [1.00 1.00 1.00 0.96] | "all four dimensions active" |
| pentachoron | [1.00 1.00 1.00 1.00] | "fully saturated everywhere" |
Chapter 3: Scaling to 25×25×25 — The v11 Architecture
To achieve near-perfect classification, we scaled to 25×25×25 = 15,625 voxels with hierarchical block decomposition:
25×25×25 grid
↓
Decompose into 5×5×5 macro grid of 5×5×5 local blocks
↓
BlockEncoder (shared) processes each local block
↓
Tracer Attention across macro grid
↓
Capacity Cascade + Curvature Heads
↓
38-class classification
v11 Results:
- Validation accuracy: 97.01%
- Parameters: 7.5M
- 38 shape classes with full curvature taxonomy
The classifier was now good enough to be a tool rather than an experiment.
Chapter 4: The Hypothesis — What If We Applied This to VAE Latents?
The question emerged: If we can classify geometric primitives in synthetic voxel grids, can we classify the geometry that VAEs learn?
A VAE latent from Flux 2 is shaped (32, 64, 64) — 32 channels × 64×64 spatial. That's a natural 3D volume if we treat channels as depth.
The pipeline idea:
- Encode images → VAE latent (C, H, W)
- Treat channels as depth dimension
- Extract sliding window patches
- Binarize (top 10% of values → 1, rest → 0)
- Resize to canonical grid size
- Classify with our geometric classifier
- Build a geometric fingerprint of the VAE
The key insight: Channel deviances aren't random — they encode the VAE's learned relational structure. A sphere-shaped deviance means two channels disagree uniformly in all directions. A cylindrical deviance means they disagree along an axis. A saddle means they have competing curvature.
Chapter 5: The Non-Cubic Problem — 8×16×16
The 25³ classifier was too slow for extraction (thousands of patches per image). More critically, VAE latents aren't cubic — Flux has 16 or 32 channels but 128×128 spatial resolution. We needed an aspect-ratio-matched classifier.
New canonical shape: 8×16×16 (2,048 voxels)
This preserves the 1:2:2 aspect ratio of VAE latents and is 7.6× smaller than 25³.
First attempt: 3D CNN (~2.5M params)
Standard ResBlock3D architecture. It worked but was slow — the convolutions over non-cubic grids were inefficient.
Second attempt: Patch Cross-Attention (638K params)
Input: (8, 16, 16) binary voxels
↓
Decompose into 64 patches of size 2×4×4
↓
Shared PatchEncoder (MLP + handcrafted features)
↓
3× Cross-Attention Blocks (patches attend to each other)
↓
Global Pool + Classification Heads
Results:
- Validation accuracy: 98.10%
- Parameters: 638,387 (12× smaller than v11)
- Inference: ~50K samples/sec on H100
The cross-attention approach beat the hierarchical tracer architecture at 1/12th the parameters. Patches reasoning about each other was more efficient than tracers reasoning about voxels.
Chapter 6: Speed Optimization — From Minutes to Seconds
Initial extraction: 138 seconds per image. Unacceptable.
Bottlenecks identified:
torch.quantileis O(n log n) — usetorch.kthvalueinstead (O(n) average)- Sequential per-image processing — batch multiple images
- CPU round-trips for channel clustering — GPU-only implementation
torch.cuda.empty_cache()in hot loop — forces sync, remove it
Optimizations applied:
# Before: 138s/img
threshold = torch.quantile(patch, 0.9)
# After: <1s/img
k = int(volume * 0.10)
threshold = patch.kthvalue(volume - k + 1).values
Batched extraction:
- Process 32-64 images simultaneously
- Single mega-batch classify call across all images × all scales
- GPU-only channel clustering (no numpy)
Final speed: 0.33 seconds per image (418× faster)
Chapter 7: The Discovery — Four VAEs Compared
We analyzed 2,074 images through four VAE encoders:
| VAE | Latent Shape | Annotations | Confidence |
|---|---|---|---|
| SD 1.5 | 4×64×64 | 229,618 | 0.880 |
| SDXL | 4×128×128 | 1,092,657 | 0.874 |
| Flux.1 | 16×128×128 | 2,943,743 | 0.878 |
| Flux.2 | 32×128×128 | 4,365,328 | 0.875 |
Class distributions:
SD 1.5: saddle 57% | pentachoron 35% | triangular_prism 3%
SDXL: saddle 53% | pentachoron 30% | triangular_prism 6%
Flux.1: pentachoron 31% | plane 29% | saddle 15% | square_xy 15%
Flux.2: saddle 70% | pentachoron 21% | tetrahedron 4%
The pattern:
- SD 1.5, SDXL, and Flux.2 converge to saddle-dominated hyperbolic manifolds
- Flux.1 breaks the pattern — diverse geometry, no single class above 31%
Cross-VAE similarity (cosine between class distributions):
| SD 1.5 | SDXL | Flux.1 | Flux.2 | |
|---|---|---|---|---|
| SD 1.5 | 1.000 | 0.996 | 0.615 | 0.966 |
| Flux.1 | 0.615 | 0.641 | 1.000 | 0.499 |
| Flux.2 | 0.966 | 0.970 | 0.499 | 1.000 |
Flux.1 is geometrically distinct. The others form one family.
Part II: Interpretation
What We Measured
Important: This analysis examines VAE encoder outputs only — the compressed latent representation of images. We did not analyze diffusion trajectories, transformer attention, or the denoising process. The geometry we found is the geometry of learned compression, not generation.
Why Saddles? (Hypotheses, Not Conclusions)
Saddle geometry dominates three of four VAEs. We don't know why from this data alone, but plausible explanations include:
Hypothesis 1: Compression Geometry Autoencoders must preserve enough information to reconstruct while compressing aggressively. Saddle geometry (positive curvature in some directions, negative in others) may be optimal for separating modes in a compressed space — similar images cluster along stable directions while dissimilar images separate along unstable directions.
Hypothesis 2: Training Dynamics VAEs are trained with reconstruction loss + KL divergence. The KL term encourages the latent space toward a unit Gaussian prior. Saddles may emerge at the interface between the prior's spherical geometry and the data's natural structure.
Hypothesis 3: Architectural Bias Convolutional encoders with strided downsampling produce features with specific spatial correlation patterns. These patterns might naturally organize into hyperbolic local structure regardless of the data.
What we can say definitively:
- The geometry is consistent (0.87+ classification confidence)
- It's not noise (same patterns across thousands of images)
- Different VAEs learn measurably different geometries
- Flux.1 is structurally distinct from SD 1.5, SDXL, and Flux.2
What we cannot say:
- Why saddles emerge
- Whether this geometry affects generation quality
- How the diffusion transformer interacts with this structure
- Whether the geometry is optimal or incidental
Pentachorons at Macro Scale
Across all four VAEs, pentachorons dominate at the largest extraction scale (L0). A pentachoron is a 4-simplex — 5 vertices, 10 edges, 10 triangular faces, 5 tetrahedral cells.
This is an observation, not an explanation. Possible interpretations:
- The 5-way structure may reflect how the encoder organizes information at global scale
- It could be an artifact of the 16-channel depth interacting with 64×64 spatial resolution
- It might indicate natural clustering in the learned representation
We note that pentachorons appear consistently across architecturally different VAEs, suggesting this may be a property of visual compression rather than a specific architectural choice.
Why Flux.1 Differs
Flux.1's latent space has measurably different geometry:
- 39% 2D content (planes, squares) vs <3% for others
- More even distribution across primitive classes
- Lower cross-similarity to other VAEs (0.50-0.64 vs 0.97+ within the SD/SDXL/Flux.2 family)
Observations:
- Flux.2 has unused batch normalization weights (
bn.running_var,bn.running_mean) that Flux.1 lacks - Flux.1 uses 16 channels; Flux.2 uses 32
- The channel groupings differ (Flux.1: pairs, Flux.2: quads)
We cannot conclude that batch norm "collapsed" the geometry or that more channels caused the difference. We only know the geometries differ. Understanding why would require access to training details, ablation studies, or direct communication with the Flux team.
The per-scale breakdown:
| Scale | SD 1.5 | Flux.1 | Flux.2 |
|---|---|---|---|
| L0 (macro) | pentachoron | pentachoron | pentachoron |
| L2 (mid) | saddle | plane 40% | saddle |
| L3 (local) | saddle 59% | saddle | saddle 70% |
Flux.1's mid-levels show planar structure where others show hyperbolic. This is the measurable difference.
Part III: Implications
For VAE Design
If saddle geometry emerges naturally and pentachorons organize macro structure, we can design VAEs with these priors built in:
- Initialize encoders to produce pentachoron-structured outputs
- Regularize toward specific curvature profiles during training
- Use the geometric classifier as a training loss component
For Latent Space Manipulation
Knowing the geometry enables targeted intervention:
- Navigate along saddle directions to move between modes
- Stay within planar cross-sections for smooth interpolation
- Identify pentachoron vertices as semantic anchors
For Understanding Learned Representations
This method generalizes beyond VAEs. Any learned representation that can be spatially organized and binarized can be geometrically fingerprinted:
- Transformer attention patterns
- CNN feature maps
- Language model embeddings (via reshaping)
Part IV: Connection to Broader Research
This work is part of a research program on geometric deep learning with pentachoron structures — the hypothesis that geometric structure can replace learned parameters.
Core Results
| System | Parameters | Performance | Notes |
|---|---|---|---|
| MNIST classifier | ~750 | 85% accuracy | Geometry encodes structure |
| ImageNet head | 72KB | Competitive | Crystal vocabulary |
| CIFAR-100 head | 78KB | 92% accuracy | Geometric vocabulary |
| This classifier | 638K | 98.1% accuracy, 0.88 conf on VAEs | Reads learned geometry |
The Observation
The 38 primitives we defined from mathematical first principles — pentachorons, saddles, planes, tetrahedra — appear consistently in VAE latent spaces trained on natural images.
This could mean:
- These geometries are attractors for learned compression
- Our primitive vocabulary happens to span common structures in high-dimensional data
- The binarization and extraction process biases toward certain shapes
We observe correlation, not causation. The VAEs weren't trained to produce pentachorons — but pentachorons appear. Whether this reflects deep structure or methodological artifact requires further investigation.
Reproducibility
All code and data available at:
https://huggingface.co/AbstractPhil/grid-geometric-classifier-sliding-proto
Files
| File | Purpose |
|---|---|
cell1_shape_generator.py |
38-class synthetic shape generation |
cell2_model.py |
PatchCrossAttentionClassifier (638K params) |
cell3_trainer.py |
Training pipeline |
cell4_vae_pipeline.py |
Multi-scale batched extraction |
cell5_quad_vae_geometric_analysis.py |
Single VAE analysis |
cell6_quad_vae_analysis_mega_liminal.py |
Multi-VAE comparison |
best_vae_ca_classifier.pt |
Trained weights |
liminal.zip |
Test images (957) |
mega_liminal_captioned.zip |
Extended test images (2,074) |
multi_vae_comparison_*.json |
Raw results |
Running the Full Pipeline
# In Google Colab with GPU
# Generate shapes and train classifier
%run cell1_shape_generator.py
%run cell2_model.py
%run cell3_trainer.py # → 98.1% accuracy
# Define extraction pipeline
%run cell4_vae_pipeline.py
# Analyze single VAE
%run cell5_quad_vae_geometric_analysis.py
# Compare multiple VAEs
%run cell6_quad_vae_analysis_mega_liminal.py
Timeline of Key Experiments
| Date | Experiment | Result |
|---|---|---|
| 2025 | 5×5 2D grid, 16 classes | 85.6% accuracy, circle/ellipse failure |
| 5×5×5 voxel grid, capacity cascade | 94.5% accuracy, fill profiles work | |
| 25×25×25 v11, tracer architecture | 97.01% accuracy, 7.5M params | |
| Feb 2026 | Hypothesis: apply to VAE latents | — |
| 8×16×16 non-cubic conversion | Aspect-ratio matched | |
| 3D CNN attempt | Works but slow | |
| PatchCrossAttention | 98.1% accuracy, 638K params | |
| Speed optimization | 138s/img → 0.33s/img | |
| Quad-VAE comparison | Flux.1 outlier discovered |
Future Directions
- Geometric VAE Training: Use the classifier as a regularizer during VAE training
- Cross-VAE Translation: Map representations between VAEs using geometric alignment
- Semantic-Geometry Correspondence: Which geometric classes correspond to which visual features?
- Temporal Analysis: Track geometric structure through diffusion timesteps
- Geometric Conditioning: Steer generation toward specific geometric classes
Acknowledgments
This research emerged from a year of exploration into pentachoron mathematics, crystalline vocabulary systems, and the hypothesis that universal geometric structures underlie efficient neural computation.
The capacity cascade architecture, the tracer attention mechanism, and the geometric primitive taxonomy were developed through extensive iteration, failure analysis, and refinement. Every "failure" — the circle/ellipse confusion at 5×5, the overflow scalar that killed dim1, the slow extraction pipeline — taught us something that made the final system stronger.
Citation
@misc{abstractphil2026geometric,
author = {AbstractPhil},
title = {Reading the Geometry of Learned Representations:
How Synthetic Primitives Became a Rosetta Stone for VAE Latent Spaces},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/AbstractPhil/grid-geometric-classifier-sliding-proto}
}
"We cannot fight the universe, only exist within it."
The classifier finds consistent geometry in VAE latent spaces. Whether those shapes are fundamental or incidental remains an open question. But they're measurably, repeatably there.