GeoSpatial Prior: Synthetic 3D β†’ Geometric Substrate Training

Incomplete Documentation

Claude provided an early document and it's full of problems that will need smoothing and refining.

This will not behave exactly as Claude says it will and there will be multiple refactors and compromises along the way.

Claude below

Abstract

A system for teaching geometric spatial reasoning to neural networks by rendering deterministic 3D scenes where every spatial relationship β€” position, occlusion, depth, lighting direction, scale β€” maps directly to known simplex coordinates. Rather than inferring geometric structure from 2D pixel statistics, we construct ground-truth spatial labels from a sectorized 5Γ—5Γ—5 perspective volume and use those labels to pretrain both a geometric classifier and a geometric CLIP variant. The result is a transferable spatial reasoning backbone that can be finetuned into any vision model, providing compositional understanding that current models lack.


1. Core Concept: The 5Γ—5Γ—5 Perspective Volume

1.1 Sectorized Space

The viewing frustum is divided into a 5Γ—5Γ—5 grid of sectors:

  • X axis (horizontal): 5 columns, left to right
  • Y axis (vertical): 5 rows, bottom to top
  • Z axis (depth): 5 layers, near to far

This produces 125 sectors, each representing a unique spatial address (x, y, z) where x, y, z ∈ {0, 1, 2, 3, 4}.

1.2 Perspective Scaling

Critical insight: sectors are not uniform cubes in world space. The perspective projection means:

  • Near sectors (z=0): Small world-space volume, large screen-space coverage. A coffee cup fills a near sector.
  • Far sectors (z=4): Enormous world-space volume, small screen-space coverage. A football stadium fits in a far sector.

Each sector's world-space dimensions scale with depth:

sector_width(z)  = 2 * z_distance * tan(fov_h / 2) / 5
sector_height(z) = 2 * z_distance * tan(fov_v / 2) / 5
sector_depth(z)  = (z_far - z_near) / 5

This means the same 5Γ—5Γ—5 grid naturally encodes both tabletop scenes (near sectors) and landscape vistas (far sectors) in a single unified coordinate system.

1.3 Sector Labels

Every object placement generates a deterministic label vector:

Label Type Description
sector_xyz (int, int, int) Primary sector address
sector_coverage list[(int,int,int)] All sectors the object spans
depth_order int Front-to-back ordering among all objects
occlusion_pct float Percentage of object occluded by nearer objects
occluded_by list[int] IDs of occluding objects
screen_bbox (x1, y1, x2, y2) Normalized screen-space bounding box
relative_scale float Object's screen size relative to its true size
lighting_sector (int, int, int) Sector of dominant light source
shadow_direction (float, float) Normalized shadow vector on ground plane
viewing_angle (float, float, float) Object's rotation relative to camera

1.4 Simplex Coordinate Mapping

The pentachoron's 5 vertices map to spatial dimensions:

Vertex Spatial Meaning
vβ‚€ Horizontal position (x)
v₁ Vertical position (y)
vβ‚‚ Depth / distance (z)
v₃ Scale / size relationship
vβ‚„ Viewpoint / rotation encoding

An object at sector (2, 3, 1) with moderate scale and frontal view maps to a specific barycentric coordinate on the simplex. This mapping is defined, not learned β€” the training teaches the network to predict these coordinates from pixels.


2. Rendering Pipeline

2.1 Requirements

  • Speed: Must generate millions of training pairs. Target: 100+ scenes/second on a single GPU.
  • Geometric precision: Exact depth buffers, clean occlusion boundaries, mathematically correct perspective.
  • Visual diversity: Varied lighting, materials, and object complexity to prevent shortcut learning.
  • NOT photorealism: The signal is spatial structure, not texture fidelity.

2.2 Renderer Selection

Primary: ModernGL (OpenGL via Python)

  • GPU-accelerated, 200+ FPS for simple scenes
  • Exact depth buffer access via framebuffer objects
  • Programmable shaders for lighting control
  • Clean Python API, Colab-compatible via EGL offscreen

Fallback: Analytical raymarcher (PyTorch-native)

  • Zero external dependencies
  • Every pixel's depth/normal is mathematically exact
  • Slower but fully differentiable if needed
  • Good for validation / ground truth comparison

2.3 Scene Composition Strategy

Each rendered scene is a configuration of:

  1. Camera: Position, FOV, near/far planes β†’ defines the 5Γ—5Γ—5 frustum
  2. Objects: 1–8 primitive or composite objects placed in specific sectors
  3. Lighting: 1–3 lights placed at known sector positions
  4. Background: Solid color, gradient, or simple environment for depth contrast

Object types (progressive complexity):

Phase Objects Purpose
Phase 1 Geometric primitives (sphere, cube, cylinder, cone, torus) Pure spatial reasoning, no semantic content
Phase 2 Composite primitives (chair = cubes + cylinders, table = box + legs) Multi-part spatial binding
Phase 3 Low-poly meshes (human figure, car, tree, building) Scale-appropriate object recognition
Phase 4 Textured meshes with varied materials Material-independent spatial reasoning

2.4 Scene Generation Parameters

SceneConfig:
  n_objects:        randint(1, 8)
  camera_fov:       uniform(40Β°, 90Β°)
  camera_distance:  log_uniform(2, 100)   # controls near/far content
  lighting:
    n_lights:       randint(1, 3)
    light_sectors:  random sectors from 5Γ—5Γ—5
    light_colors:   random warm/cool/neutral
    ambient:        uniform(0.1, 0.4)
  objects[i]:
    type:           random from phase vocabulary
    sector:         random (x, y, z)
    local_rotation: random euler angles
    scale_factor:   uniform(0.5, 2.0) Γ— sector_appropriate_base
    material:       random (diffuse color, roughness)

2.5 Output Per Scene

Output Shape Description
rgb (H, W, 3) Rendered color image
depth (H, W, 1) Linear depth buffer
normals (H, W, 3) Surface normals
instance_mask (H, W, 1) Per-object instance segmentation
sector_map (H, W, 3) Per-pixel sector assignment
labels dict Full label set per object (Β§1.3)
scene_graph list[dict] Complete spatial relationships between all objects

3. Training Architecture

3.1 Stage 1: Geometric Spatial Classifier

Input: RGB image (rendered scene)
Output: Per-object sector predictions, depth ordering, occlusion graph

Architecture:

Image (512Γ—512) 
  β†’ Vision backbone (ViT-B/16 or ResNet-50)
  β†’ Sector prediction head: 
      For each detected object:
        - Sector classification: (5Γ—5Γ—5) = 125-way softmax
        - Depth order: scalar regression
        - Occlusion: binary matrix (who occludes whom)
        - Scale: relative size regression
        - Viewing angle: (ΞΈ, Ο†, ψ) regression
  β†’ Scene graph head:
      For each object pair:
        - Spatial relation: {in_front, behind, left, right, above, below, overlapping}
        - Distance in sector space

Loss function:

L_total = L_sector + λ₁·L_depth + Ξ»β‚‚Β·L_occlusion + λ₃·L_scale + Ξ»β‚„Β·L_scene_graph

Where L_sector includes a geometric component that penalizes predictions that violate simplex constraints (e.g., predicted sector coordinates must form valid configurations on the pentachoron).

Training data: 1M–10M rendered scenes with exact labels.

3.2 Stage 2: Geometric CLIP

Take the pretrained spatial backbone from Stage 1 and use it as the vision encoder for a CLIP-style contrastive model.

Vision encoder: Stage 1 backbone (frozen or lightly finetuned)
Text encoder: Standard transformer, initialized from existing CLIP text encoder
Contrastive target: Align image embeddings with text descriptions that include spatial language

Training captions are generated from scene labels:

def scene_to_caption(labels):
    """Generate spatial text from ground-truth labels."""
    parts = []
    for obj in labels["objects"]:
        # Position
        x, y, z = obj["sector_xyz"]
        h_pos = ["far left", "left", "center", "right", "far right"][x]
        v_pos = ["bottom", "lower", "middle", "upper", "top"][y]
        d_pos = ["very close", "near", "middle distance", "far", "very far"][z]
        
        parts.append(f"a {obj['type']} at {h_pos} {v_pos}, {d_pos}")
        
        # Occlusion
        if obj["occlusion_pct"] > 0.2:
            occluder = labels["objects"][obj["occluded_by"][0]]["type"]
            parts.append(f"partially behind the {occluder}")
        
        # Scale context
        if z >= 3:
            parts.append(f"appearing small in the distance")
    
    # Spatial relations
    for rel in labels["scene_graph"]:
        parts.append(f"the {rel['obj_a']} is {rel['relation']} the {rel['obj_b']}")
    
    return ", ".join(parts)

Key insight: The geometric CLIP doesn't just learn "cat" ↔ image-of-cat. It learns "cat at upper-right, far distance, partially behind a tree" ↔ specific simplex configuration. Spatial prepositions become geometric operations, not statistical associations.

3.3 Stage 3: Transfer to Real Images

The pretrained geometric backbone transfers to real-world tasks:

  1. Direct finetuning: Replace the geometric CLIP's vision encoder in SD1.5's conditioning path. Now "cup on top of book" activates specific simplex configurations that were grounded in actual 3D relationships.

  2. Inverse embedding: Given a real image, extract simplex coordinates that describe its spatial structure. These become geometric conditioning signals for diffusion models.

  3. Hybrid: Use the geometric backbone as an auxiliary encoder alongside standard CLIP. The geometric channel provides spatial structure; CLIP provides semantic content. The geo_prior blends them on the simplex.


4. Dataset Scaling Strategy

4.1 Procedural Generation Tiers

Tier Scenes Resolution Objects Purpose
Tier 1 1M 256Γ—256 Primitives only Fast pretraining of spatial reasoning
Tier 2 5M 512Γ—512 Composites + varied lighting Full spatial classifier training
Tier 3 10M 512Γ—512 Low-poly meshes + textures Geometric CLIP pretraining
Tier 4 1M 512Γ—512 Complex scenes + real textures Bridge to photorealism

4.2 Augmentation via Sector Permutation

Because the 5Γ—5Γ—5 grid is symmetric, we can generate 6Γ— data for free by:

  • Horizontal flip: maps sector (x, y, z) β†’ (4-x, y, z)
  • Vertical flip: maps sector (x, y, z) β†’ (x, 4-y, z)
  • 90Β° rotation: maps sector (x, y, z) β†’ (y, 4-x, z)

Labels transform deterministically with the augmentation.

4.3 Hard Example Mining

After initial training, identify failure modes:

  • Sectors with high confusion rates (e.g., depth ordering at similar z-values)
  • Occlusion patterns the model struggles with
  • Scale ambiguities (large far objects vs small near objects)

Regenerate scenes specifically targeting these failure modes.


5. Simplex Integration

5.1 Geometric Loss During Pretraining

The existing Cayley-Menger + volume preservation loss from the geo_prior applies directly:

  • CM validity: Ensures predicted sector coordinates form valid configurations on the pentachoron
  • Volume preservation: Prevents collapse β€” all 5 spatial dimensions must remain discriminable
  • Edge regularity: Maintains uniform spacing between spatial concepts

5.2 Vertex Assignment as Spatial Binding

The vertex weight entropy findings from the triad study directly inform the architecture:

  • Multi-object scenes (like object-relations): Should produce LOW vertex entropy β€” hard routing of objects to separate vertices
  • Single-object attribute scenes (like characters): Should produce HIGH vertex entropy β€” soft blending of attributes on shared vertices
  • The 5Γ—5Γ—5 scenes will produce BOTH patterns depending on object count and spatial configuration

This means the pretraining naturally teaches the prior when to use hard routing vs soft blending β€” the fundamental compositional operation.

5.3 Sector β†’ Simplex Coordinate Function

The mapping from sector space to simplex space is a learnable function initialized as:

def sector_to_simplex(sector_xyz, scale, viewpoint):
    """
    Map 5Γ—5Γ—5 sector + metadata to pentachoron barycentric coordinates.
    
    Args:
        sector_xyz: (3,) integers in [0, 4]
        scale: float, relative object scale
        viewpoint: (3,) euler angles
    
    Returns:
        (5,) barycentric coordinates summing to 1
    """
    # Normalize inputs to [0, 1]
    x, y, z = sector_xyz / 4.0
    s = sigmoid(scale)
    v = mean(normalize(viewpoint))
    
    # Initial mapping: spread across 5 vertices
    raw = torch.tensor([x, y, z, s, v])
    
    # Softmax to get valid barycentric coordinates
    return F.softmax(raw / temperature, dim=0)

The temperature and any learned transformations are trained alongside the classifier.


6. Implementation Roadmap

Phase 1: Renderer + Data Pipeline [HIGH PRIORITY]

  • ModernGL offscreen renderer

    • EGL context setup (headless / Colab compatible)
    • Programmable camera with configurable FOV, near/far
    • 5Γ—5Γ—5 frustum sector calculation from camera params
    • Depth buffer extraction
    • Instance mask via unique-color rendering pass
    • Basic Phong/Lambert lighting with positioned lights
  • Primitive library

    • Sphere, cube, cylinder, cone, torus mesh generators
    • UV-mapped for future texture support
    • Per-primitive bounding box for sector assignment
  • Scene composer

    • Random object placement within specified sectors
    • Collision detection (prevent overlapping placements)
    • Automatic occlusion calculation from depth buffer
    • Scene graph generation (pairwise spatial relations)
    • Perspective-correct sector scaling
  • Label generator

    • Per-object label vector (Β§1.3)
    • Scene-level spatial relation graph
    • Caption generator (Β§3.2)
    • Simplex coordinate ground truth
  • Data pipeline

    • Parallel scene generation (multiprocess)
    • WebDataset / streaming format for large-scale
    • On-the-fly augmentation (sector permutations)
    • HuggingFace dataset upload integration

Phase 2: Geometric Classifier [HIGH PRIORITY]

  • Model architecture

    • Vision backbone selection (ViT-B/16 vs ResNet-50)
    • Sector prediction head (125-way classification per object)
    • Depth ordering head
    • Occlusion prediction head
    • Scene graph prediction head
  • Training pipeline

    • Multi-task loss with geometric regularization
    • Simplex constraint loss (CM validity on predictions)
    • Curriculum: primitives β†’ composites β†’ meshes
    • Evaluation metrics: sector accuracy, depth ordering accuracy, occlusion F1
  • Ablation studies

    • With vs without simplex constraint loss
    • Effect of vertex count (k=4, k=5, k=6)
    • Depth bucket resolution (3Γ—3Γ—3 vs 5Γ—5Γ—5 vs 7Γ—7Γ—7)

Phase 3: Geometric CLIP [MEDIUM PRIORITY]

  • Architecture

    • Vision encoder: frozen Stage 1 backbone + projection
    • Text encoder: initialize from OpenAI CLIP text encoder
    • Contrastive loss with hard negatives (spatial near-misses)
  • Training

    • Caption generation from scene labels
    • Hard negative mining (swap spatial relations in text)
    • Spatial preposition evaluation benchmark
    • Transfer evaluation: zero-shot spatial classification on real images
  • Integration with existing pipeline

    • Replace SD1.5 CLIP encoder with geometric CLIP
    • Measure impact on compositional generation
    • Compare geo_prior behavior with geometric vs standard CLIP

Phase 4: Transfer + Real-World Bridge [LOWER PRIORITY]

  • Domain transfer

    • Finetune geometric backbone on COCO with spatial annotations
    • Evaluate on spatial reasoning benchmarks (CLEVR, SpatialBench)
    • Test compositional generation improvement in SD1.5
  • Inverse embedding pipeline

    • Given real image β†’ extract simplex coordinates
    • Use as conditioning signal for diffusion
    • Compare with CLIP-only conditioning
  • Hybrid encoder

    • Dual-stream: geometric backbone + CLIP
    • Learnable fusion on simplex
    • Evaluate on attribute binding + spatial composition jointly

7. Key Hypotheses to Validate

  1. Sector classification transfers to real images: A model trained entirely on synthetic 3D scenes can identify spatial sectors in photographs with >60% accuracy.

  2. Geometric CLIP improves compositional generation: Replacing standard CLIP with geometric CLIP in the SD1.5 pipeline produces measurably better spatial composition (evaluated via CLEVR-style spatial accuracy metrics).

  3. Simplex coordinates are a natural spatial language: The 5-vertex pentachoron provides sufficient dimensionality to encode the spatial relationships humans use in language (in front of, behind, above, below, next to, far away, etc.).

  4. Vertex entropy predicts spatial complexity: Scenes with more objects produce lower vertex entropy (hard routing), while single-object scenes produce higher entropy (attribute binding). This pattern, observed in the triad study, should emerge naturally from synthetic training.

  5. The 5Γ—5Γ—5 grid scales: The same sectorization works for both close-up tabletop scenes and panoramic landscapes by leveraging perspective scaling of sector volumes.


8. Technical Notes

8.1 Why 5Γ—5Γ—5?

  • 5 matches the pentachoron vertex count β€” direct 1:1 axis-to-vertex mapping
  • 125 sectors is fine enough for meaningful spatial discrimination without being computationally prohibitive
  • 5 depth layers capture: immediate foreground, near, mid, far, background β€” matching natural perceptual depth bands
  • The number 5 appears consistently across the geometric vocabulary work (k=4 simplex has 5 vertices, 5 edge dimensions, etc.)

8.2 Why Not Photorealistic Rendering?

  • Photorealism introduces texture/material confounds that obscure spatial signal
  • Simple rendering is 100-1000Γ— faster, enabling much larger datasets
  • Transfer from simpleβ†’real is well-studied (sim2real in robotics)
  • The geometric prior should learn spatial structure independent of visual style β€” using simple rendering enforces this
  • Phase 4 progressively adds visual complexity once spatial reasoning is established

8.3 Relationship to Existing Work

  • CLEVR: Similar synthetic scene approach but limited to 2D spatial relations. Our 5Γ—5Γ—5 grid adds depth, scale, and perspective.
  • NeRF / 3D Gaussians: Reconstruct 3D from 2D. We go the opposite direction β€” start with known 3D, teach networks to infer it from 2D.
  • Spatial transformers: Learn attention over spatial positions. Our approach provides explicit spatial supervision rather than hoping attention learns spatial structure.
  • Scene graphs: Prior work on scene graph prediction from images. Our contribution is grounding scene graphs in simplex geometry rather than abstract relation classification.

9. Resource Estimates

Component Compute Storage
Tier 1 data generation (1M scenes) 1 GPU, ~3 hours ~50 GB
Tier 2 data generation (5M scenes) 1 GPU, ~15 hours ~250 GB
Classifier pretraining 1 A100, ~24 hours ~2 GB model
Geometric CLIP training 1-4 A100s, ~48 hours ~2 GB model
Transfer experiments 1 A100, ~8 hours each -

10. Success Criteria

Milestone Metric Target
Renderer works Scenes/second >100 on single GPU
Sector classifier Top-1 sector accuracy (synthetic) >90%
Depth ordering Kendall's Ο„ (synthetic) >0.95
Geometric CLIP Spatial preposition accuracy (synthetic) >85%
Real image transfer Spatial sector accuracy (COCO) >50%
Compositional generation Spatial relation accuracy (SD1.5 + geometric CLIP) >70%
Vertex entropy pattern Matches triad study predictions Qualitative match
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support