GeoSpatial Prior: Synthetic 3D β Geometric Substrate Training
Incomplete Documentation
Claude provided an early document and it's full of problems that will need smoothing and refining.
This will not behave exactly as Claude says it will and there will be multiple refactors and compromises along the way.
Claude below
Abstract
A system for teaching geometric spatial reasoning to neural networks by rendering deterministic 3D scenes where every spatial relationship β position, occlusion, depth, lighting direction, scale β maps directly to known simplex coordinates. Rather than inferring geometric structure from 2D pixel statistics, we construct ground-truth spatial labels from a sectorized 5Γ5Γ5 perspective volume and use those labels to pretrain both a geometric classifier and a geometric CLIP variant. The result is a transferable spatial reasoning backbone that can be finetuned into any vision model, providing compositional understanding that current models lack.
1. Core Concept: The 5Γ5Γ5 Perspective Volume
1.1 Sectorized Space
The viewing frustum is divided into a 5Γ5Γ5 grid of sectors:
- X axis (horizontal): 5 columns, left to right
- Y axis (vertical): 5 rows, bottom to top
- Z axis (depth): 5 layers, near to far
This produces 125 sectors, each representing a unique spatial address (x, y, z) where x, y, z β {0, 1, 2, 3, 4}.
1.2 Perspective Scaling
Critical insight: sectors are not uniform cubes in world space. The perspective projection means:
- Near sectors (z=0): Small world-space volume, large screen-space coverage. A coffee cup fills a near sector.
- Far sectors (z=4): Enormous world-space volume, small screen-space coverage. A football stadium fits in a far sector.
Each sector's world-space dimensions scale with depth:
sector_width(z) = 2 * z_distance * tan(fov_h / 2) / 5
sector_height(z) = 2 * z_distance * tan(fov_v / 2) / 5
sector_depth(z) = (z_far - z_near) / 5
This means the same 5Γ5Γ5 grid naturally encodes both tabletop scenes (near sectors) and landscape vistas (far sectors) in a single unified coordinate system.
1.3 Sector Labels
Every object placement generates a deterministic label vector:
| Label | Type | Description |
|---|---|---|
sector_xyz |
(int, int, int) | Primary sector address |
sector_coverage |
list[(int,int,int)] | All sectors the object spans |
depth_order |
int | Front-to-back ordering among all objects |
occlusion_pct |
float | Percentage of object occluded by nearer objects |
occluded_by |
list[int] | IDs of occluding objects |
screen_bbox |
(x1, y1, x2, y2) | Normalized screen-space bounding box |
relative_scale |
float | Object's screen size relative to its true size |
lighting_sector |
(int, int, int) | Sector of dominant light source |
shadow_direction |
(float, float) | Normalized shadow vector on ground plane |
viewing_angle |
(float, float, float) | Object's rotation relative to camera |
1.4 Simplex Coordinate Mapping
The pentachoron's 5 vertices map to spatial dimensions:
| Vertex | Spatial Meaning |
|---|---|
| vβ | Horizontal position (x) |
| vβ | Vertical position (y) |
| vβ | Depth / distance (z) |
| vβ | Scale / size relationship |
| vβ | Viewpoint / rotation encoding |
An object at sector (2, 3, 1) with moderate scale and frontal view maps to a specific barycentric coordinate on the simplex. This mapping is defined, not learned β the training teaches the network to predict these coordinates from pixels.
2. Rendering Pipeline
2.1 Requirements
- Speed: Must generate millions of training pairs. Target: 100+ scenes/second on a single GPU.
- Geometric precision: Exact depth buffers, clean occlusion boundaries, mathematically correct perspective.
- Visual diversity: Varied lighting, materials, and object complexity to prevent shortcut learning.
- NOT photorealism: The signal is spatial structure, not texture fidelity.
2.2 Renderer Selection
Primary: ModernGL (OpenGL via Python)
- GPU-accelerated, 200+ FPS for simple scenes
- Exact depth buffer access via framebuffer objects
- Programmable shaders for lighting control
- Clean Python API, Colab-compatible via EGL offscreen
Fallback: Analytical raymarcher (PyTorch-native)
- Zero external dependencies
- Every pixel's depth/normal is mathematically exact
- Slower but fully differentiable if needed
- Good for validation / ground truth comparison
2.3 Scene Composition Strategy
Each rendered scene is a configuration of:
- Camera: Position, FOV, near/far planes β defines the 5Γ5Γ5 frustum
- Objects: 1β8 primitive or composite objects placed in specific sectors
- Lighting: 1β3 lights placed at known sector positions
- Background: Solid color, gradient, or simple environment for depth contrast
Object types (progressive complexity):
| Phase | Objects | Purpose |
|---|---|---|
| Phase 1 | Geometric primitives (sphere, cube, cylinder, cone, torus) | Pure spatial reasoning, no semantic content |
| Phase 2 | Composite primitives (chair = cubes + cylinders, table = box + legs) | Multi-part spatial binding |
| Phase 3 | Low-poly meshes (human figure, car, tree, building) | Scale-appropriate object recognition |
| Phase 4 | Textured meshes with varied materials | Material-independent spatial reasoning |
2.4 Scene Generation Parameters
SceneConfig:
n_objects: randint(1, 8)
camera_fov: uniform(40Β°, 90Β°)
camera_distance: log_uniform(2, 100) # controls near/far content
lighting:
n_lights: randint(1, 3)
light_sectors: random sectors from 5Γ5Γ5
light_colors: random warm/cool/neutral
ambient: uniform(0.1, 0.4)
objects[i]:
type: random from phase vocabulary
sector: random (x, y, z)
local_rotation: random euler angles
scale_factor: uniform(0.5, 2.0) Γ sector_appropriate_base
material: random (diffuse color, roughness)
2.5 Output Per Scene
| Output | Shape | Description |
|---|---|---|
rgb |
(H, W, 3) | Rendered color image |
depth |
(H, W, 1) | Linear depth buffer |
normals |
(H, W, 3) | Surface normals |
instance_mask |
(H, W, 1) | Per-object instance segmentation |
sector_map |
(H, W, 3) | Per-pixel sector assignment |
labels |
dict | Full label set per object (Β§1.3) |
scene_graph |
list[dict] | Complete spatial relationships between all objects |
3. Training Architecture
3.1 Stage 1: Geometric Spatial Classifier
Input: RGB image (rendered scene)
Output: Per-object sector predictions, depth ordering, occlusion graph
Architecture:
Image (512Γ512)
β Vision backbone (ViT-B/16 or ResNet-50)
β Sector prediction head:
For each detected object:
- Sector classification: (5Γ5Γ5) = 125-way softmax
- Depth order: scalar regression
- Occlusion: binary matrix (who occludes whom)
- Scale: relative size regression
- Viewing angle: (ΞΈ, Ο, Ο) regression
β Scene graph head:
For each object pair:
- Spatial relation: {in_front, behind, left, right, above, below, overlapping}
- Distance in sector space
Loss function:
L_total = L_sector + Ξ»βΒ·L_depth + Ξ»βΒ·L_occlusion + Ξ»βΒ·L_scale + Ξ»βΒ·L_scene_graph
Where L_sector includes a geometric component that penalizes predictions that violate simplex constraints (e.g., predicted sector coordinates must form valid configurations on the pentachoron).
Training data: 1Mβ10M rendered scenes with exact labels.
3.2 Stage 2: Geometric CLIP
Take the pretrained spatial backbone from Stage 1 and use it as the vision encoder for a CLIP-style contrastive model.
Vision encoder: Stage 1 backbone (frozen or lightly finetuned)
Text encoder: Standard transformer, initialized from existing CLIP text encoder
Contrastive target: Align image embeddings with text descriptions that include spatial language
Training captions are generated from scene labels:
def scene_to_caption(labels):
"""Generate spatial text from ground-truth labels."""
parts = []
for obj in labels["objects"]:
# Position
x, y, z = obj["sector_xyz"]
h_pos = ["far left", "left", "center", "right", "far right"][x]
v_pos = ["bottom", "lower", "middle", "upper", "top"][y]
d_pos = ["very close", "near", "middle distance", "far", "very far"][z]
parts.append(f"a {obj['type']} at {h_pos} {v_pos}, {d_pos}")
# Occlusion
if obj["occlusion_pct"] > 0.2:
occluder = labels["objects"][obj["occluded_by"][0]]["type"]
parts.append(f"partially behind the {occluder}")
# Scale context
if z >= 3:
parts.append(f"appearing small in the distance")
# Spatial relations
for rel in labels["scene_graph"]:
parts.append(f"the {rel['obj_a']} is {rel['relation']} the {rel['obj_b']}")
return ", ".join(parts)
Key insight: The geometric CLIP doesn't just learn "cat" β image-of-cat. It learns "cat at upper-right, far distance, partially behind a tree" β specific simplex configuration. Spatial prepositions become geometric operations, not statistical associations.
3.3 Stage 3: Transfer to Real Images
The pretrained geometric backbone transfers to real-world tasks:
Direct finetuning: Replace the geometric CLIP's vision encoder in SD1.5's conditioning path. Now "cup on top of book" activates specific simplex configurations that were grounded in actual 3D relationships.
Inverse embedding: Given a real image, extract simplex coordinates that describe its spatial structure. These become geometric conditioning signals for diffusion models.
Hybrid: Use the geometric backbone as an auxiliary encoder alongside standard CLIP. The geometric channel provides spatial structure; CLIP provides semantic content. The geo_prior blends them on the simplex.
4. Dataset Scaling Strategy
4.1 Procedural Generation Tiers
| Tier | Scenes | Resolution | Objects | Purpose |
|---|---|---|---|---|
| Tier 1 | 1M | 256Γ256 | Primitives only | Fast pretraining of spatial reasoning |
| Tier 2 | 5M | 512Γ512 | Composites + varied lighting | Full spatial classifier training |
| Tier 3 | 10M | 512Γ512 | Low-poly meshes + textures | Geometric CLIP pretraining |
| Tier 4 | 1M | 512Γ512 | Complex scenes + real textures | Bridge to photorealism |
4.2 Augmentation via Sector Permutation
Because the 5Γ5Γ5 grid is symmetric, we can generate 6Γ data for free by:
- Horizontal flip: maps sector (x, y, z) β (4-x, y, z)
- Vertical flip: maps sector (x, y, z) β (x, 4-y, z)
- 90Β° rotation: maps sector (x, y, z) β (y, 4-x, z)
Labels transform deterministically with the augmentation.
4.3 Hard Example Mining
After initial training, identify failure modes:
- Sectors with high confusion rates (e.g., depth ordering at similar z-values)
- Occlusion patterns the model struggles with
- Scale ambiguities (large far objects vs small near objects)
Regenerate scenes specifically targeting these failure modes.
5. Simplex Integration
5.1 Geometric Loss During Pretraining
The existing Cayley-Menger + volume preservation loss from the geo_prior applies directly:
- CM validity: Ensures predicted sector coordinates form valid configurations on the pentachoron
- Volume preservation: Prevents collapse β all 5 spatial dimensions must remain discriminable
- Edge regularity: Maintains uniform spacing between spatial concepts
5.2 Vertex Assignment as Spatial Binding
The vertex weight entropy findings from the triad study directly inform the architecture:
- Multi-object scenes (like object-relations): Should produce LOW vertex entropy β hard routing of objects to separate vertices
- Single-object attribute scenes (like characters): Should produce HIGH vertex entropy β soft blending of attributes on shared vertices
- The 5Γ5Γ5 scenes will produce BOTH patterns depending on object count and spatial configuration
This means the pretraining naturally teaches the prior when to use hard routing vs soft blending β the fundamental compositional operation.
5.3 Sector β Simplex Coordinate Function
The mapping from sector space to simplex space is a learnable function initialized as:
def sector_to_simplex(sector_xyz, scale, viewpoint):
"""
Map 5Γ5Γ5 sector + metadata to pentachoron barycentric coordinates.
Args:
sector_xyz: (3,) integers in [0, 4]
scale: float, relative object scale
viewpoint: (3,) euler angles
Returns:
(5,) barycentric coordinates summing to 1
"""
# Normalize inputs to [0, 1]
x, y, z = sector_xyz / 4.0
s = sigmoid(scale)
v = mean(normalize(viewpoint))
# Initial mapping: spread across 5 vertices
raw = torch.tensor([x, y, z, s, v])
# Softmax to get valid barycentric coordinates
return F.softmax(raw / temperature, dim=0)
The temperature and any learned transformations are trained alongside the classifier.
6. Implementation Roadmap
Phase 1: Renderer + Data Pipeline [HIGH PRIORITY]
ModernGL offscreen renderer
- EGL context setup (headless / Colab compatible)
- Programmable camera with configurable FOV, near/far
- 5Γ5Γ5 frustum sector calculation from camera params
- Depth buffer extraction
- Instance mask via unique-color rendering pass
- Basic Phong/Lambert lighting with positioned lights
Primitive library
- Sphere, cube, cylinder, cone, torus mesh generators
- UV-mapped for future texture support
- Per-primitive bounding box for sector assignment
Scene composer
- Random object placement within specified sectors
- Collision detection (prevent overlapping placements)
- Automatic occlusion calculation from depth buffer
- Scene graph generation (pairwise spatial relations)
- Perspective-correct sector scaling
Label generator
- Per-object label vector (Β§1.3)
- Scene-level spatial relation graph
- Caption generator (Β§3.2)
- Simplex coordinate ground truth
Data pipeline
- Parallel scene generation (multiprocess)
- WebDataset / streaming format for large-scale
- On-the-fly augmentation (sector permutations)
- HuggingFace dataset upload integration
Phase 2: Geometric Classifier [HIGH PRIORITY]
Model architecture
- Vision backbone selection (ViT-B/16 vs ResNet-50)
- Sector prediction head (125-way classification per object)
- Depth ordering head
- Occlusion prediction head
- Scene graph prediction head
Training pipeline
- Multi-task loss with geometric regularization
- Simplex constraint loss (CM validity on predictions)
- Curriculum: primitives β composites β meshes
- Evaluation metrics: sector accuracy, depth ordering accuracy, occlusion F1
Ablation studies
- With vs without simplex constraint loss
- Effect of vertex count (k=4, k=5, k=6)
- Depth bucket resolution (3Γ3Γ3 vs 5Γ5Γ5 vs 7Γ7Γ7)
Phase 3: Geometric CLIP [MEDIUM PRIORITY]
Architecture
- Vision encoder: frozen Stage 1 backbone + projection
- Text encoder: initialize from OpenAI CLIP text encoder
- Contrastive loss with hard negatives (spatial near-misses)
Training
- Caption generation from scene labels
- Hard negative mining (swap spatial relations in text)
- Spatial preposition evaluation benchmark
- Transfer evaluation: zero-shot spatial classification on real images
Integration with existing pipeline
- Replace SD1.5 CLIP encoder with geometric CLIP
- Measure impact on compositional generation
- Compare geo_prior behavior with geometric vs standard CLIP
Phase 4: Transfer + Real-World Bridge [LOWER PRIORITY]
Domain transfer
- Finetune geometric backbone on COCO with spatial annotations
- Evaluate on spatial reasoning benchmarks (CLEVR, SpatialBench)
- Test compositional generation improvement in SD1.5
Inverse embedding pipeline
- Given real image β extract simplex coordinates
- Use as conditioning signal for diffusion
- Compare with CLIP-only conditioning
Hybrid encoder
- Dual-stream: geometric backbone + CLIP
- Learnable fusion on simplex
- Evaluate on attribute binding + spatial composition jointly
7. Key Hypotheses to Validate
Sector classification transfers to real images: A model trained entirely on synthetic 3D scenes can identify spatial sectors in photographs with >60% accuracy.
Geometric CLIP improves compositional generation: Replacing standard CLIP with geometric CLIP in the SD1.5 pipeline produces measurably better spatial composition (evaluated via CLEVR-style spatial accuracy metrics).
Simplex coordinates are a natural spatial language: The 5-vertex pentachoron provides sufficient dimensionality to encode the spatial relationships humans use in language (in front of, behind, above, below, next to, far away, etc.).
Vertex entropy predicts spatial complexity: Scenes with more objects produce lower vertex entropy (hard routing), while single-object scenes produce higher entropy (attribute binding). This pattern, observed in the triad study, should emerge naturally from synthetic training.
The 5Γ5Γ5 grid scales: The same sectorization works for both close-up tabletop scenes and panoramic landscapes by leveraging perspective scaling of sector volumes.
8. Technical Notes
8.1 Why 5Γ5Γ5?
- 5 matches the pentachoron vertex count β direct 1:1 axis-to-vertex mapping
- 125 sectors is fine enough for meaningful spatial discrimination without being computationally prohibitive
- 5 depth layers capture: immediate foreground, near, mid, far, background β matching natural perceptual depth bands
- The number 5 appears consistently across the geometric vocabulary work (k=4 simplex has 5 vertices, 5 edge dimensions, etc.)
8.2 Why Not Photorealistic Rendering?
- Photorealism introduces texture/material confounds that obscure spatial signal
- Simple rendering is 100-1000Γ faster, enabling much larger datasets
- Transfer from simpleβreal is well-studied (sim2real in robotics)
- The geometric prior should learn spatial structure independent of visual style β using simple rendering enforces this
- Phase 4 progressively adds visual complexity once spatial reasoning is established
8.3 Relationship to Existing Work
- CLEVR: Similar synthetic scene approach but limited to 2D spatial relations. Our 5Γ5Γ5 grid adds depth, scale, and perspective.
- NeRF / 3D Gaussians: Reconstruct 3D from 2D. We go the opposite direction β start with known 3D, teach networks to infer it from 2D.
- Spatial transformers: Learn attention over spatial positions. Our approach provides explicit spatial supervision rather than hoping attention learns spatial structure.
- Scene graphs: Prior work on scene graph prediction from images. Our contribution is grounding scene graphs in simplex geometry rather than abstract relation classification.
9. Resource Estimates
| Component | Compute | Storage |
|---|---|---|
| Tier 1 data generation (1M scenes) | 1 GPU, ~3 hours | ~50 GB |
| Tier 2 data generation (5M scenes) | 1 GPU, ~15 hours | ~250 GB |
| Classifier pretraining | 1 A100, ~24 hours | ~2 GB model |
| Geometric CLIP training | 1-4 A100s, ~48 hours | ~2 GB model |
| Transfer experiments | 1 A100, ~8 hours each | - |
10. Success Criteria
| Milestone | Metric | Target |
|---|---|---|
| Renderer works | Scenes/second | >100 on single GPU |
| Sector classifier | Top-1 sector accuracy (synthetic) | >90% |
| Depth ordering | Kendall's Ο (synthetic) | >0.95 |
| Geometric CLIP | Spatial preposition accuracy (synthetic) | >85% |
| Real image transfer | Spatial sector accuracy (COCO) | >50% |
| Compositional generation | Spatial relation accuracy (SD1.5 + geometric CLIP) | >70% |
| Vertex entropy pattern | Matches triad study predictions | Qualitative match |