File size: 20,721 Bytes

---
license: mit
---
# GeoSpatial Prior: Synthetic 3D → Geometric Substrate Training

# Incomplete Documentation

Claude provided an early document and it's full of problems that will need smoothing and refining.

This will not behave exactly as Claude says it will and there will be multiple refactors and compromises along the way.

# Claude below


## Abstract

A system for teaching geometric spatial reasoning to neural networks by rendering deterministic 3D scenes where every spatial relationship — position, occlusion, depth, lighting direction, scale — maps directly to known simplex coordinates. Rather than inferring geometric structure from 2D pixel statistics, we construct ground-truth spatial labels from a sectorized 5×5×5 perspective volume and use those labels to pretrain both a geometric classifier and a geometric CLIP variant. The result is a transferable spatial reasoning backbone that can be finetuned into any vision model, providing compositional understanding that current models lack.

---

## 1. Core Concept: The 5×5×5 Perspective Volume

### 1.1 Sectorized Space

The viewing frustum is divided into a 5×5×5 grid of sectors:

- **X axis** (horizontal): 5 columns, left to right
- **Y axis** (vertical): 5 rows, bottom to top  
- **Z axis** (depth): 5 layers, near to far

This produces **125 sectors**, each representing a unique spatial address `(x, y, z)` where `x, y, z ∈ {0, 1, 2, 3, 4}`.

### 1.2 Perspective Scaling

Critical insight: sectors are not uniform cubes in world space. The perspective projection means:

- **Near sectors (z=0)**: Small world-space volume, large screen-space coverage. A coffee cup fills a near sector.
- **Far sectors (z=4)**: Enormous world-space volume, small screen-space coverage. A football stadium fits in a far sector.

Each sector's world-space dimensions scale with depth:

```
sector_width(z)  = 2 * z_distance * tan(fov_h / 2) / 5
sector_height(z) = 2 * z_distance * tan(fov_v / 2) / 5
sector_depth(z)  = (z_far - z_near) / 5
```

This means the same 5×5×5 grid naturally encodes both tabletop scenes (near sectors) and landscape vistas (far sectors) in a single unified coordinate system.

### 1.3 Sector Labels

Every object placement generates a deterministic label vector:

| Label | Type | Description |
|-------|------|-------------|
| `sector_xyz` | (int, int, int) | Primary sector address |
| `sector_coverage` | list[(int,int,int)] | All sectors the object spans |
| `depth_order` | int | Front-to-back ordering among all objects |
| `occlusion_pct` | float | Percentage of object occluded by nearer objects |
| `occluded_by` | list[int] | IDs of occluding objects |
| `screen_bbox` | (x1, y1, x2, y2) | Normalized screen-space bounding box |
| `relative_scale` | float | Object's screen size relative to its true size |
| `lighting_sector` | (int, int, int) | Sector of dominant light source |
| `shadow_direction` | (float, float) | Normalized shadow vector on ground plane |
| `viewing_angle` | (float, float, float) | Object's rotation relative to camera |

### 1.4 Simplex Coordinate Mapping

The pentachoron's 5 vertices map to spatial dimensions:

| Vertex | Spatial Meaning |
|--------|----------------|
| v₀ | Horizontal position (x) |
| v₁ | Vertical position (y) |
| v₂ | Depth / distance (z) |
| v₃ | Scale / size relationship |
| v₄ | Viewpoint / rotation encoding |

An object at sector (2, 3, 1) with moderate scale and frontal view maps to a specific barycentric coordinate on the simplex. This mapping is **defined**, not learned — the training teaches the network to predict these coordinates from pixels.

---

## 2. Rendering Pipeline

### 2.1 Requirements

- **Speed**: Must generate millions of training pairs. Target: 100+ scenes/second on a single GPU.
- **Geometric precision**: Exact depth buffers, clean occlusion boundaries, mathematically correct perspective.
- **Visual diversity**: Varied lighting, materials, and object complexity to prevent shortcut learning.
- **NOT photorealism**: The signal is spatial structure, not texture fidelity.

### 2.2 Renderer Selection

**Primary: ModernGL (OpenGL via Python)**

- GPU-accelerated, 200+ FPS for simple scenes
- Exact depth buffer access via framebuffer objects
- Programmable shaders for lighting control
- Clean Python API, Colab-compatible via EGL offscreen

**Fallback: Analytical raymarcher (PyTorch-native)**

- Zero external dependencies
- Every pixel's depth/normal is mathematically exact
- Slower but fully differentiable if needed
- Good for validation / ground truth comparison

### 2.3 Scene Composition Strategy

Each rendered scene is a configuration of:

1. **Camera**: Position, FOV, near/far planes → defines the 5×5×5 frustum
2. **Objects**: 1–8 primitive or composite objects placed in specific sectors
3. **Lighting**: 1–3 lights placed at known sector positions
4. **Background**: Solid color, gradient, or simple environment for depth contrast

Object types (progressive complexity):

| Phase | Objects | Purpose |
|-------|---------|---------|
| Phase 1 | Geometric primitives (sphere, cube, cylinder, cone, torus) | Pure spatial reasoning, no semantic content |
| Phase 2 | Composite primitives (chair = cubes + cylinders, table = box + legs) | Multi-part spatial binding |
| Phase 3 | Low-poly meshes (human figure, car, tree, building) | Scale-appropriate object recognition |
| Phase 4 | Textured meshes with varied materials | Material-independent spatial reasoning |

### 2.4 Scene Generation Parameters

```
SceneConfig:
  n_objects:        randint(1, 8)
  camera_fov:       uniform(40°, 90°)
  camera_distance:  log_uniform(2, 100)   # controls near/far content
  lighting:
    n_lights:       randint(1, 3)
    light_sectors:  random sectors from 5×5×5
    light_colors:   random warm/cool/neutral
    ambient:        uniform(0.1, 0.4)
  objects[i]:
    type:           random from phase vocabulary
    sector:         random (x, y, z)
    local_rotation: random euler angles
    scale_factor:   uniform(0.5, 2.0) × sector_appropriate_base
    material:       random (diffuse color, roughness)
```

### 2.5 Output Per Scene

| Output | Shape | Description |
|--------|-------|-------------|
| `rgb` | (H, W, 3) | Rendered color image |
| `depth` | (H, W, 1) | Linear depth buffer |
| `normals` | (H, W, 3) | Surface normals |
| `instance_mask` | (H, W, 1) | Per-object instance segmentation |
| `sector_map` | (H, W, 3) | Per-pixel sector assignment |
| `labels` | dict | Full label set per object (§1.3) |
| `scene_graph` | list[dict] | Complete spatial relationships between all objects |

---

## 3. Training Architecture

### 3.1 Stage 1: Geometric Spatial Classifier

**Input**: RGB image (rendered scene)  
**Output**: Per-object sector predictions, depth ordering, occlusion graph

Architecture:

```
Image (512×512) 
  → Vision backbone (ViT-B/16 or ResNet-50)
  → Sector prediction head: 
      For each detected object:
        - Sector classification: (5×5×5) = 125-way softmax
        - Depth order: scalar regression
        - Occlusion: binary matrix (who occludes whom)
        - Scale: relative size regression
        - Viewing angle: (θ, φ, ψ) regression
  → Scene graph head:
      For each object pair:
        - Spatial relation: {in_front, behind, left, right, above, below, overlapping}
        - Distance in sector space
```

**Loss function**:

```
L_total = L_sector + λ₁·L_depth + λ₂·L_occlusion + λ₃·L_scale + λ₄·L_scene_graph
```

Where `L_sector` includes a geometric component that penalizes predictions that violate simplex constraints (e.g., predicted sector coordinates must form valid configurations on the pentachoron).

**Training data**: 1M–10M rendered scenes with exact labels.

### 3.2 Stage 2: Geometric CLIP

Take the pretrained spatial backbone from Stage 1 and use it as the vision encoder for a CLIP-style contrastive model.

**Vision encoder**: Stage 1 backbone (frozen or lightly finetuned)  
**Text encoder**: Standard transformer, initialized from existing CLIP text encoder  
**Contrastive target**: Align image embeddings with text descriptions that include spatial language

Training captions are generated from scene labels:

```python
def scene_to_caption(labels):
    """Generate spatial text from ground-truth labels."""
    parts = []
    for obj in labels["objects"]:
        # Position
        x, y, z = obj["sector_xyz"]
        h_pos = ["far left", "left", "center", "right", "far right"][x]
        v_pos = ["bottom", "lower", "middle", "upper", "top"][y]
        d_pos = ["very close", "near", "middle distance", "far", "very far"][z]
        
        parts.append(f"a {obj['type']} at {h_pos} {v_pos}, {d_pos}")
        
        # Occlusion
        if obj["occlusion_pct"] > 0.2:
            occluder = labels["objects"][obj["occluded_by"][0]]["type"]
            parts.append(f"partially behind the {occluder}")
        
        # Scale context
        if z >= 3:
            parts.append(f"appearing small in the distance")
    
    # Spatial relations
    for rel in labels["scene_graph"]:
        parts.append(f"the {rel['obj_a']} is {rel['relation']} the {rel['obj_b']}")
    
    return ", ".join(parts)
```

**Key insight**: The geometric CLIP doesn't just learn "cat" ↔ image-of-cat. It learns "cat at upper-right, far distance, partially behind a tree" ↔ specific simplex configuration. Spatial prepositions become geometric operations, not statistical associations.

### 3.3 Stage 3: Transfer to Real Images

The pretrained geometric backbone transfers to real-world tasks:

1. **Direct finetuning**: Replace the geometric CLIP's vision encoder in SD1.5's conditioning path. Now "cup on top of book" activates specific simplex configurations that were grounded in actual 3D relationships.

2. **Inverse embedding**: Given a real image, extract simplex coordinates that describe its spatial structure. These become geometric conditioning signals for diffusion models.

3. **Hybrid**: Use the geometric backbone as an auxiliary encoder alongside standard CLIP. The geometric channel provides spatial structure; CLIP provides semantic content. The geo_prior blends them on the simplex.

---

## 4. Dataset Scaling Strategy

### 4.1 Procedural Generation Tiers

| Tier | Scenes | Resolution | Objects | Purpose |
|------|--------|------------|---------|---------|
| Tier 1 | 1M | 256×256 | Primitives only | Fast pretraining of spatial reasoning |
| Tier 2 | 5M | 512×512 | Composites + varied lighting | Full spatial classifier training |
| Tier 3 | 10M | 512×512 | Low-poly meshes + textures | Geometric CLIP pretraining |
| Tier 4 | 1M | 512×512 | Complex scenes + real textures | Bridge to photorealism |

### 4.2 Augmentation via Sector Permutation

Because the 5×5×5 grid is symmetric, we can generate 6× data for free by:

- Horizontal flip: maps sector (x, y, z) → (4-x, y, z)
- Vertical flip: maps sector (x, y, z) → (x, 4-y, z)
- 90° rotation: maps sector (x, y, z) → (y, 4-x, z)

Labels transform deterministically with the augmentation.

### 4.3 Hard Example Mining

After initial training, identify failure modes:

- Sectors with high confusion rates (e.g., depth ordering at similar z-values)
- Occlusion patterns the model struggles with
- Scale ambiguities (large far objects vs small near objects)

Regenerate scenes specifically targeting these failure modes.

---

## 5. Simplex Integration

### 5.1 Geometric Loss During Pretraining

The existing Cayley-Menger + volume preservation loss from the geo_prior applies directly:

- **CM validity**: Ensures predicted sector coordinates form valid configurations on the pentachoron
- **Volume preservation**: Prevents collapse — all 5 spatial dimensions must remain discriminable
- **Edge regularity**: Maintains uniform spacing between spatial concepts

### 5.2 Vertex Assignment as Spatial Binding

The vertex weight entropy findings from the triad study directly inform the architecture:

- **Multi-object scenes** (like object-relations): Should produce LOW vertex entropy — hard routing of objects to separate vertices
- **Single-object attribute scenes** (like characters): Should produce HIGH vertex entropy — soft blending of attributes on shared vertices
- **The 5×5×5 scenes will produce BOTH patterns** depending on object count and spatial configuration

This means the pretraining naturally teaches the prior when to use hard routing vs soft blending — the fundamental compositional operation.

### 5.3 Sector → Simplex Coordinate Function

The mapping from sector space to simplex space is a learnable function initialized as:

```python
def sector_to_simplex(sector_xyz, scale, viewpoint):
    """
    Map 5×5×5 sector + metadata to pentachoron barycentric coordinates.
    
    Args:
        sector_xyz: (3,) integers in [0, 4]
        scale: float, relative object scale
        viewpoint: (3,) euler angles
    
    Returns:
        (5,) barycentric coordinates summing to 1
    """
    # Normalize inputs to [0, 1]
    x, y, z = sector_xyz / 4.0
    s = sigmoid(scale)
    v = mean(normalize(viewpoint))
    
    # Initial mapping: spread across 5 vertices
    raw = torch.tensor([x, y, z, s, v])
    
    # Softmax to get valid barycentric coordinates
    return F.softmax(raw / temperature, dim=0)
```

The temperature and any learned transformations are trained alongside the classifier.

---

## 6. Implementation Roadmap

### Phase 1: Renderer + Data Pipeline [HIGH PRIORITY]

- [ ] **ModernGL offscreen renderer**
  - [ ] EGL context setup (headless / Colab compatible)
  - [ ] Programmable camera with configurable FOV, near/far
  - [ ] 5×5×5 frustum sector calculation from camera params
  - [ ] Depth buffer extraction
  - [ ] Instance mask via unique-color rendering pass
  - [ ] Basic Phong/Lambert lighting with positioned lights

- [ ] **Primitive library**
  - [ ] Sphere, cube, cylinder, cone, torus mesh generators
  - [ ] UV-mapped for future texture support
  - [ ] Per-primitive bounding box for sector assignment

- [ ] **Scene composer**
  - [ ] Random object placement within specified sectors
  - [ ] Collision detection (prevent overlapping placements)
  - [ ] Automatic occlusion calculation from depth buffer
  - [ ] Scene graph generation (pairwise spatial relations)
  - [ ] Perspective-correct sector scaling

- [ ] **Label generator**
  - [ ] Per-object label vector (§1.3)
  - [ ] Scene-level spatial relation graph
  - [ ] Caption generator (§3.2)
  - [ ] Simplex coordinate ground truth

- [ ] **Data pipeline**
  - [ ] Parallel scene generation (multiprocess)
  - [ ] WebDataset / streaming format for large-scale
  - [ ] On-the-fly augmentation (sector permutations)
  - [ ] HuggingFace dataset upload integration

### Phase 2: Geometric Classifier [HIGH PRIORITY]

- [ ] **Model architecture**
  - [ ] Vision backbone selection (ViT-B/16 vs ResNet-50)
  - [ ] Sector prediction head (125-way classification per object)
  - [ ] Depth ordering head
  - [ ] Occlusion prediction head
  - [ ] Scene graph prediction head

- [ ] **Training pipeline**
  - [ ] Multi-task loss with geometric regularization
  - [ ] Simplex constraint loss (CM validity on predictions)
  - [ ] Curriculum: primitives → composites → meshes
  - [ ] Evaluation metrics: sector accuracy, depth ordering accuracy, occlusion F1

- [ ] **Ablation studies**
  - [ ] With vs without simplex constraint loss
  - [ ] Effect of vertex count (k=4, k=5, k=6)
  - [ ] Depth bucket resolution (3×3×3 vs 5×5×5 vs 7×7×7)

### Phase 3: Geometric CLIP [MEDIUM PRIORITY]

- [ ] **Architecture**
  - [ ] Vision encoder: frozen Stage 1 backbone + projection
  - [ ] Text encoder: initialize from OpenAI CLIP text encoder
  - [ ] Contrastive loss with hard negatives (spatial near-misses)

- [ ] **Training**
  - [ ] Caption generation from scene labels
  - [ ] Hard negative mining (swap spatial relations in text)
  - [ ] Spatial preposition evaluation benchmark
  - [ ] Transfer evaluation: zero-shot spatial classification on real images

- [ ] **Integration with existing pipeline**
  - [ ] Replace SD1.5 CLIP encoder with geometric CLIP
  - [ ] Measure impact on compositional generation
  - [ ] Compare geo_prior behavior with geometric vs standard CLIP

### Phase 4: Transfer + Real-World Bridge [LOWER PRIORITY]

- [ ] **Domain transfer**
  - [ ] Finetune geometric backbone on COCO with spatial annotations
  - [ ] Evaluate on spatial reasoning benchmarks (CLEVR, SpatialBench)
  - [ ] Test compositional generation improvement in SD1.5

- [ ] **Inverse embedding pipeline**
  - [ ] Given real image → extract simplex coordinates
  - [ ] Use as conditioning signal for diffusion
  - [ ] Compare with CLIP-only conditioning

- [ ] **Hybrid encoder**
  - [ ] Dual-stream: geometric backbone + CLIP
  - [ ] Learnable fusion on simplex
  - [ ] Evaluate on attribute binding + spatial composition jointly

---

## 7. Key Hypotheses to Validate

1. **Sector classification transfers to real images**: A model trained entirely on synthetic 3D scenes can identify spatial sectors in photographs with >60% accuracy.

2. **Geometric CLIP improves compositional generation**: Replacing standard CLIP with geometric CLIP in the SD1.5 pipeline produces measurably better spatial composition (evaluated via CLEVR-style spatial accuracy metrics).

3. **Simplex coordinates are a natural spatial language**: The 5-vertex pentachoron provides sufficient dimensionality to encode the spatial relationships humans use in language (in front of, behind, above, below, next to, far away, etc.).

4. **Vertex entropy predicts spatial complexity**: Scenes with more objects produce lower vertex entropy (hard routing), while single-object scenes produce higher entropy (attribute binding). This pattern, observed in the triad study, should emerge naturally from synthetic training.

5. **The 5×5×5 grid scales**: The same sectorization works for both close-up tabletop scenes and panoramic landscapes by leveraging perspective scaling of sector volumes.

6. 
---

## 8. Technical Notes

### 8.1 Why 5×5×5?

- 5 matches the pentachoron vertex count — direct 1:1 axis-to-vertex mapping
- 125 sectors is fine enough for meaningful spatial discrimination without being computationally prohibitive
- 5 depth layers capture: immediate foreground, near, mid, far, background — matching natural perceptual depth bands
- The number 5 appears consistently across the geometric vocabulary work (k=4 simplex has 5 vertices, 5 edge dimensions, etc.)

### 8.2 Why Not Photorealistic Rendering?

- Photorealism introduces texture/material confounds that obscure spatial signal
- Simple rendering is 100-1000× faster, enabling much larger datasets
- Transfer from simple→real is well-studied (sim2real in robotics)
- The geometric prior should learn spatial structure independent of visual style — using simple rendering enforces this
- Phase 4 progressively adds visual complexity once spatial reasoning is established

### 8.3 Relationship to Existing Work

- **CLEVR**: Similar synthetic scene approach but limited to 2D spatial relations. Our 5×5×5 grid adds depth, scale, and perspective.
- **NeRF / 3D Gaussians**: Reconstruct 3D from 2D. We go the opposite direction — start with known 3D, teach networks to infer it from 2D.
- **Spatial transformers**: Learn attention over spatial positions. Our approach provides explicit spatial supervision rather than hoping attention learns spatial structure.
- **Scene graphs**: Prior work on scene graph prediction from images. Our contribution is grounding scene graphs in simplex geometry rather than abstract relation classification.

---

## 9. Resource Estimates

| Component | Compute | Storage |
|-----------|---------|---------|
| Tier 1 data generation (1M scenes) | 1 GPU, ~3 hours | ~50 GB |
| Tier 2 data generation (5M scenes) | 1 GPU, ~15 hours | ~250 GB |
| Classifier pretraining | 1 A100, ~24 hours | ~2 GB model |
| Geometric CLIP training | 1-4 A100s, ~48 hours | ~2 GB model |
| Transfer experiments | 1 A100, ~8 hours each | - |

---

## 10. Success Criteria

| Milestone | Metric | Target |
|-----------|--------|--------|
| Renderer works | Scenes/second | >100 on single GPU |
| Sector classifier | Top-1 sector accuracy (synthetic) | >90% |
| Depth ordering | Kendall's τ (synthetic) | >0.95 |
| Geometric CLIP | Spatial preposition accuracy (synthetic) | >85% |
| Real image transfer | Spatial sector accuracy (COCO) | >50% |
| Compositional generation | Spatial relation accuracy (SD1.5 + geometric CLIP) | >70% |
| Vertex entropy pattern | Matches triad study predictions | Qualitative match |