GeoSpatial Prior: Synthetic 3D → Geometric Substrate Training

Incomplete Documentation

Claude provided an early document and it's full of problems that will need smoothing and refining.

This will not behave exactly as Claude says it will and there will be multiple refactors and compromises along the way.

Claude below

Abstract

A system for teaching geometric spatial reasoning to neural networks by rendering deterministic 3D scenes where every spatial relationship — position, occlusion, depth, lighting direction, scale — maps directly to known simplex coordinates. Rather than inferring geometric structure from 2D pixel statistics, we construct ground-truth spatial labels from a sectorized 5×5×5 perspective volume and use those labels to pretrain both a geometric classifier and a geometric CLIP variant. The result is a transferable spatial reasoning backbone that can be finetuned into any vision model, providing compositional understanding that current models lack.

1. Core Concept: The 5×5×5 Perspective Volume

1.1 Sectorized Space

The viewing frustum is divided into a 5×5×5 grid of sectors:

X axis (horizontal): 5 columns, left to right
Y axis (vertical): 5 rows, bottom to top
Z axis (depth): 5 layers, near to far

This produces 125 sectors, each representing a unique spatial address (x, y, z) where x, y, z ∈ {0, 1, 2, 3, 4}.

1.2 Perspective Scaling

Critical insight: sectors are not uniform cubes in world space. The perspective projection means:

Near sectors (z=0): Small world-space volume, large screen-space coverage. A coffee cup fills a near sector.
Far sectors (z=4): Enormous world-space volume, small screen-space coverage. A football stadium fits in a far sector.

Each sector's world-space dimensions scale with depth:

sector_width(z)  = 2 * z_distance * tan(fov_h / 2) / 5
sector_height(z) = 2 * z_distance * tan(fov_v / 2) / 5
sector_depth(z)  = (z_far - z_near) / 5

This means the same 5×5×5 grid naturally encodes both tabletop scenes (near sectors) and landscape vistas (far sectors) in a single unified coordinate system.

1.3 Sector Labels

Every object placement generates a deterministic label vector:

Label	Type	Description
`sector_xyz`	(int, int, int)	Primary sector address
`sector_coverage`	list[(int,int,int)]	All sectors the object spans
`depth_order`	int	Front-to-back ordering among all objects
`occlusion_pct`	float	Percentage of object occluded by nearer objects
`occluded_by`	list[int]	IDs of occluding objects
`screen_bbox`	(x1, y1, x2, y2)	Normalized screen-space bounding box
`relative_scale`	float	Object's screen size relative to its true size
`lighting_sector`	(int, int, int)	Sector of dominant light source
`shadow_direction`	(float, float)	Normalized shadow vector on ground plane
`viewing_angle`	(float, float, float)	Object's rotation relative to camera

1.4 Simplex Coordinate Mapping

The pentachoron's 5 vertices map to spatial dimensions:

Vertex	Spatial Meaning
v₀	Horizontal position (x)
v₁	Vertical position (y)
v₂	Depth / distance (z)
v₃	Scale / size relationship
v₄	Viewpoint / rotation encoding

An object at sector (2, 3, 1) with moderate scale and frontal view maps to a specific barycentric coordinate on the simplex. This mapping is defined, not learned — the training teaches the network to predict these coordinates from pixels.

2. Rendering Pipeline

2.1 Requirements

Speed: Must generate millions of training pairs. Target: 100+ scenes/second on a single GPU.
Geometric precision: Exact depth buffers, clean occlusion boundaries, mathematically correct perspective.
Visual diversity: Varied lighting, materials, and object complexity to prevent shortcut learning.
NOT photorealism: The signal is spatial structure, not texture fidelity.

2.2 Renderer Selection

Primary: ModernGL (OpenGL via Python)

GPU-accelerated, 200+ FPS for simple scenes
Exact depth buffer access via framebuffer objects
Programmable shaders for lighting control
Clean Python API, Colab-compatible via EGL offscreen

Fallback: Analytical raymarcher (PyTorch-native)

Zero external dependencies
Every pixel's depth/normal is mathematically exact
Slower but fully differentiable if needed
Good for validation / ground truth comparison

2.3 Scene Composition Strategy

Each rendered scene is a configuration of:

Camera: Position, FOV, near/far planes → defines the 5×5×5 frustum
Objects: 1–8 primitive or composite objects placed in specific sectors
Lighting: 1–3 lights placed at known sector positions
Background: Solid color, gradient, or simple environment for depth contrast

Object types (progressive complexity):

Phase	Objects	Purpose
Phase 1	Geometric primitives (sphere, cube, cylinder, cone, torus)	Pure spatial reasoning, no semantic content
Phase 2	Composite primitives (chair = cubes + cylinders, table = box + legs)	Multi-part spatial binding
Phase 3	Low-poly meshes (human figure, car, tree, building)	Scale-appropriate object recognition
Phase 4	Textured meshes with varied materials	Material-independent spatial reasoning

2.4 Scene Generation Parameters

SceneConfig:
  n_objects:        randint(1, 8)
  camera_fov:       uniform(40°, 90°)
  camera_distance:  log_uniform(2, 100)   # controls near/far content
  lighting:
    n_lights:       randint(1, 3)
    light_sectors:  random sectors from 5×5×5
    light_colors:   random warm/cool/neutral
    ambient:        uniform(0.1, 0.4)
  objects[i]:
    type:           random from phase vocabulary
    sector:         random (x, y, z)
    local_rotation: random euler angles
    scale_factor:   uniform(0.5, 2.0) × sector_appropriate_base
    material:       random (diffuse color, roughness)

2.5 Output Per Scene

Output	Shape	Description
`rgb`	(H, W, 3)	Rendered color image
`depth`	(H, W, 1)	Linear depth buffer
`normals`	(H, W, 3)	Surface normals
`instance_mask`	(H, W, 1)	Per-object instance segmentation
`sector_map`	(H, W, 3)	Per-pixel sector assignment
`labels`	dict	Full label set per object (§1.3)
`scene_graph`	list[dict]	Complete spatial relationships between all objects

3. Training Architecture

3.1 Stage 1: Geometric Spatial Classifier

Input: RGB image (rendered scene)
Output: Per-object sector predictions, depth ordering, occlusion graph

Architecture:

Image (512×512) 
  → Vision backbone (ViT-B/16 or ResNet-50)
  → Sector prediction head: 
      For each detected object:
        - Sector classification: (5×5×5) = 125-way softmax
        - Depth order: scalar regression
        - Occlusion: binary matrix (who occludes whom)
        - Scale: relative size regression
        - Viewing angle: (θ, φ, ψ) regression
  → Scene graph head:
      For each object pair:
        - Spatial relation: {in_front, behind, left, right, above, below, overlapping}
        - Distance in sector space

Loss function:

L_total = L_sector + λ₁·L_depth + λ₂·L_occlusion + λ₃·L_scale + λ₄·L_scene_graph

Where L_sector includes a geometric component that penalizes predictions that violate simplex constraints (e.g., predicted sector coordinates must form valid configurations on the pentachoron).

Training data: 1M–10M rendered scenes with exact labels.

3.2 Stage 2: Geometric CLIP

Take the pretrained spatial backbone from Stage 1 and use it as the vision encoder for a CLIP-style contrastive model.

Vision encoder: Stage 1 backbone (frozen or lightly finetuned)
Text encoder: Standard transformer, initialized from existing CLIP text encoder
Contrastive target: Align image embeddings with text descriptions that include spatial language

Training captions are generated from scene labels:

def scene_to_caption(labels):
    """Generate spatial text from ground-truth labels."""
    parts = []
    for obj in labels["objects"]:
        # Position
        x, y, z = obj["sector_xyz"]
        h_pos = ["far left", "left", "center", "right", "far right"][x]
        v_pos = ["bottom", "lower", "middle", "upper", "top"][y]
        d_pos = ["very close", "near", "middle distance", "far", "very far"][z]
        
        parts.append(f"a {obj['type']} at {h_pos} {v_pos}, {d_pos}")
        
        # Occlusion
        if obj["occlusion_pct"] > 0.2:
            occluder = labels["objects"][obj["occluded_by"][0]]["type"]
            parts.append(f"partially behind the {occluder}")
        
        # Scale context
        if z >= 3:
            parts.append(f"appearing small in the distance")
    
    # Spatial relations
    for rel in labels["scene_graph"]:
        parts.append(f"the {rel['obj_a']} is {rel['relation']} the {rel['obj_b']}")
    
    return ", ".join(parts)

Key insight: The geometric CLIP doesn't just learn "cat" ↔ image-of-cat. It learns "cat at upper-right, far distance, partially behind a tree" ↔ specific simplex configuration. Spatial prepositions become geometric operations, not statistical associations.

3.3 Stage 3: Transfer to Real Images

The pretrained geometric backbone transfers to real-world tasks:

Direct finetuning: Replace the geometric CLIP's vision encoder in SD1.5's conditioning path. Now "cup on top of book" activates specific simplex configurations that were grounded in actual 3D relationships.
Inverse embedding: Given a real image, extract simplex coordinates that describe its spatial structure. These become geometric conditioning signals for diffusion models.
Hybrid: Use the geometric backbone as an auxiliary encoder alongside standard CLIP. The geometric channel provides spatial structure; CLIP provides semantic content. The geo_prior blends them on the simplex.

4. Dataset Scaling Strategy

4.1 Procedural Generation Tiers

Tier	Scenes	Resolution	Objects	Purpose
Tier 1	1M	256×256	Primitives only	Fast pretraining of spatial reasoning
Tier 2	5M	512×512	Composites + varied lighting	Full spatial classifier training
Tier 3	10M	512×512	Low-poly meshes + textures	Geometric CLIP pretraining
Tier 4	1M	512×512	Complex scenes + real textures	Bridge to photorealism

4.2 Augmentation via Sector Permutation

Because the 5×5×5 grid is symmetric, we can generate 6× data for free by:

Horizontal flip: maps sector (x, y, z) → (4-x, y, z)
Vertical flip: maps sector (x, y, z) → (x, 4-y, z)
90° rotation: maps sector (x, y, z) → (y, 4-x, z)

Labels transform deterministically with the augmentation.

4.3 Hard Example Mining

After initial training, identify failure modes:

Sectors with high confusion rates (e.g., depth ordering at similar z-values)
Occlusion patterns the model struggles with
Scale ambiguities (large far objects vs small near objects)

Regenerate scenes specifically targeting these failure modes.

5. Simplex Integration

5.1 Geometric Loss During Pretraining

The existing Cayley-Menger + volume preservation loss from the geo_prior applies directly:

CM validity: Ensures predicted sector coordinates form valid configurations on the pentachoron
Volume preservation: Prevents collapse — all 5 spatial dimensions must remain discriminable
Edge regularity: Maintains uniform spacing between spatial concepts

5.2 Vertex Assignment as Spatial Binding

The vertex weight entropy findings from the triad study directly inform the architecture:

Multi-object scenes (like object-relations): Should produce LOW vertex entropy — hard routing of objects to separate vertices
Single-object attribute scenes (like characters): Should produce HIGH vertex entropy — soft blending of attributes on shared vertices
The 5×5×5 scenes will produce BOTH patterns depending on object count and spatial configuration

This means the pretraining naturally teaches the prior when to use hard routing vs soft blending — the fundamental compositional operation.

5.3 Sector → Simplex Coordinate Function

The mapping from sector space to simplex space is a learnable function initialized as:

def sector_to_simplex(sector_xyz, scale, viewpoint):
    """
    Map 5×5×5 sector + metadata to pentachoron barycentric coordinates.
    
    Args:
        sector_xyz: (3,) integers in [0, 4]
        scale: float, relative object scale
        viewpoint: (3,) euler angles
    
    Returns:
        (5,) barycentric coordinates summing to 1
    """
    # Normalize inputs to [0, 1]
    x, y, z = sector_xyz / 4.0
    s = sigmoid(scale)
    v = mean(normalize(viewpoint))
    
    # Initial mapping: spread across 5 vertices
    raw = torch.tensor([x, y, z, s, v])
    
    # Softmax to get valid barycentric coordinates
    return F.softmax(raw / temperature, dim=0)

The temperature and any learned transformations are trained alongside the classifier.

6. Implementation Roadmap

Phase 1: Renderer + Data Pipeline [HIGH PRIORITY]

ModernGL offscreen renderer
- EGL context setup (headless / Colab compatible)
- Programmable camera with configurable FOV, near/far
- 5×5×5 frustum sector calculation from camera params
- Depth buffer extraction
- Instance mask via unique-color rendering pass
- Basic Phong/Lambert lighting with positioned lights
Primitive library
- Sphere, cube, cylinder, cone, torus mesh generators
- UV-mapped for future texture support
- Per-primitive bounding box for sector assignment
Scene composer
- Random object placement within specified sectors
- Collision detection (prevent overlapping placements)
- Automatic occlusion calculation from depth buffer
- Scene graph generation (pairwise spatial relations)
- Perspective-correct sector scaling
Label generator
- Per-object label vector (§1.3)
- Scene-level spatial relation graph
- Caption generator (§3.2)
- Simplex coordinate ground truth
Data pipeline
- Parallel scene generation (multiprocess)
- WebDataset / streaming format for large-scale
- On-the-fly augmentation (sector permutations)
- HuggingFace dataset upload integration

Phase 2: Geometric Classifier [HIGH PRIORITY]

Model architecture
- Vision backbone selection (ViT-B/16 vs ResNet-50)
- Sector prediction head (125-way classification per object)
- Depth ordering head
- Occlusion prediction head
- Scene graph prediction head
Training pipeline
- Multi-task loss with geometric regularization
- Simplex constraint loss (CM validity on predictions)
- Curriculum: primitives → composites → meshes
- Evaluation metrics: sector accuracy, depth ordering accuracy, occlusion F1
Ablation studies
- With vs without simplex constraint loss
- Effect of vertex count (k=4, k=5, k=6)
- Depth bucket resolution (3×3×3 vs 5×5×5 vs 7×7×7)

Phase 3: Geometric CLIP [MEDIUM PRIORITY]

Architecture
- Vision encoder: frozen Stage 1 backbone + projection
- Text encoder: initialize from OpenAI CLIP text encoder
- Contrastive loss with hard negatives (spatial near-misses)
Training
- Caption generation from scene labels
- Hard negative mining (swap spatial relations in text)
- Spatial preposition evaluation benchmark
- Transfer evaluation: zero-shot spatial classification on real images
Integration with existing pipeline
- Replace SD1.5 CLIP encoder with geometric CLIP
- Measure impact on compositional generation
- Compare geo_prior behavior with geometric vs standard CLIP

Phase 4: Transfer + Real-World Bridge [LOWER PRIORITY]

Domain transfer
- Finetune geometric backbone on COCO with spatial annotations
- Evaluate on spatial reasoning benchmarks (CLEVR, SpatialBench)
- Test compositional generation improvement in SD1.5
Inverse embedding pipeline
- Given real image → extract simplex coordinates
- Use as conditioning signal for diffusion
- Compare with CLIP-only conditioning
Hybrid encoder
- Dual-stream: geometric backbone + CLIP
- Learnable fusion on simplex
- Evaluate on attribute binding + spatial composition jointly

7. Key Hypotheses to Validate

Sector classification transfers to real images: A model trained entirely on synthetic 3D scenes can identify spatial sectors in photographs with >60% accuracy.
Geometric CLIP improves compositional generation: Replacing standard CLIP with geometric CLIP in the SD1.5 pipeline produces measurably better spatial composition (evaluated via CLEVR-style spatial accuracy metrics).
Simplex coordinates are a natural spatial language: The 5-vertex pentachoron provides sufficient dimensionality to encode the spatial relationships humans use in language (in front of, behind, above, below, next to, far away, etc.).
Vertex entropy predicts spatial complexity: Scenes with more objects produce lower vertex entropy (hard routing), while single-object scenes produce higher entropy (attribute binding). This pattern, observed in the triad study, should emerge naturally from synthetic training.
The 5×5×5 grid scales: The same sectorization works for both close-up tabletop scenes and panoramic landscapes by leveraging perspective scaling of sector volumes.

8. Technical Notes

8.1 Why 5×5×5?

5 matches the pentachoron vertex count — direct 1:1 axis-to-vertex mapping
125 sectors is fine enough for meaningful spatial discrimination without being computationally prohibitive
5 depth layers capture: immediate foreground, near, mid, far, background — matching natural perceptual depth bands
The number 5 appears consistently across the geometric vocabulary work (k=4 simplex has 5 vertices, 5 edge dimensions, etc.)

8.2 Why Not Photorealistic Rendering?

Photorealism introduces texture/material confounds that obscure spatial signal
Simple rendering is 100-1000× faster, enabling much larger datasets
Transfer from simple→real is well-studied (sim2real in robotics)
The geometric prior should learn spatial structure independent of visual style — using simple rendering enforces this
Phase 4 progressively adds visual complexity once spatial reasoning is established

8.3 Relationship to Existing Work

CLEVR: Similar synthetic scene approach but limited to 2D spatial relations. Our 5×5×5 grid adds depth, scale, and perspective.
NeRF / 3D Gaussians: Reconstruct 3D from 2D. We go the opposite direction — start with known 3D, teach networks to infer it from 2D.
Spatial transformers: Learn attention over spatial positions. Our approach provides explicit spatial supervision rather than hoping attention learns spatial structure.
Scene graphs: Prior work on scene graph prediction from images. Our contribution is grounding scene graphs in simplex geometry rather than abstract relation classification.

9. Resource Estimates

Component	Compute	Storage
Tier 1 data generation (1M scenes)	1 GPU, ~3 hours	~50 GB
Tier 2 data generation (5M scenes)	1 GPU, ~15 hours	~250 GB
Classifier pretraining	1 A100, ~24 hours	~2 GB model
Geometric CLIP training	1-4 A100s, ~48 hours	~2 GB model
Transfer experiments	1 A100, ~8 hours each	-

10. Success Criteria

Milestone	Metric	Target
Renderer works	Scenes/second	>100 on single GPU
Sector classifier	Top-1 sector accuracy (synthetic)	>90%
Depth ordering	Kendall's τ (synthetic)	>0.95
Geometric CLIP	Spatial preposition accuracy (synthetic)	>85%
Real image transfer	Spatial sector accuracy (COCO)	>50%
Compositional generation	Spatial relation accuracy (SD1.5 + geometric CLIP)	>70%
Vertex entropy pattern	Matches triad study predictions	Qualitative match

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support