|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
# GeoSpatial Prior: Synthetic 3D → Geometric Substrate Training |
|
|
|
|
|
# Incomplete Documentation |
|
|
|
|
|
Claude provided an early document and it's full of problems that will need smoothing and refining. |
|
|
|
|
|
This will not behave exactly as Claude says it will and there will be multiple refactors and compromises along the way. |
|
|
|
|
|
# Claude below |
|
|
|
|
|
|
|
|
## Abstract |
|
|
|
|
|
A system for teaching geometric spatial reasoning to neural networks by rendering deterministic 3D scenes where every spatial relationship — position, occlusion, depth, lighting direction, scale — maps directly to known simplex coordinates. Rather than inferring geometric structure from 2D pixel statistics, we construct ground-truth spatial labels from a sectorized 5×5×5 perspective volume and use those labels to pretrain both a geometric classifier and a geometric CLIP variant. The result is a transferable spatial reasoning backbone that can be finetuned into any vision model, providing compositional understanding that current models lack. |
|
|
|
|
|
--- |
|
|
|
|
|
## 1. Core Concept: The 5×5×5 Perspective Volume |
|
|
|
|
|
### 1.1 Sectorized Space |
|
|
|
|
|
The viewing frustum is divided into a 5×5×5 grid of sectors: |
|
|
|
|
|
- **X axis** (horizontal): 5 columns, left to right |
|
|
- **Y axis** (vertical): 5 rows, bottom to top |
|
|
- **Z axis** (depth): 5 layers, near to far |
|
|
|
|
|
This produces **125 sectors**, each representing a unique spatial address `(x, y, z)` where `x, y, z ∈ {0, 1, 2, 3, 4}`. |
|
|
|
|
|
### 1.2 Perspective Scaling |
|
|
|
|
|
Critical insight: sectors are not uniform cubes in world space. The perspective projection means: |
|
|
|
|
|
- **Near sectors (z=0)**: Small world-space volume, large screen-space coverage. A coffee cup fills a near sector. |
|
|
- **Far sectors (z=4)**: Enormous world-space volume, small screen-space coverage. A football stadium fits in a far sector. |
|
|
|
|
|
Each sector's world-space dimensions scale with depth: |
|
|
|
|
|
``` |
|
|
sector_width(z) = 2 * z_distance * tan(fov_h / 2) / 5 |
|
|
sector_height(z) = 2 * z_distance * tan(fov_v / 2) / 5 |
|
|
sector_depth(z) = (z_far - z_near) / 5 |
|
|
``` |
|
|
|
|
|
This means the same 5×5×5 grid naturally encodes both tabletop scenes (near sectors) and landscape vistas (far sectors) in a single unified coordinate system. |
|
|
|
|
|
### 1.3 Sector Labels |
|
|
|
|
|
Every object placement generates a deterministic label vector: |
|
|
|
|
|
| Label | Type | Description | |
|
|
|-------|------|-------------| |
|
|
| `sector_xyz` | (int, int, int) | Primary sector address | |
|
|
| `sector_coverage` | list[(int,int,int)] | All sectors the object spans | |
|
|
| `depth_order` | int | Front-to-back ordering among all objects | |
|
|
| `occlusion_pct` | float | Percentage of object occluded by nearer objects | |
|
|
| `occluded_by` | list[int] | IDs of occluding objects | |
|
|
| `screen_bbox` | (x1, y1, x2, y2) | Normalized screen-space bounding box | |
|
|
| `relative_scale` | float | Object's screen size relative to its true size | |
|
|
| `lighting_sector` | (int, int, int) | Sector of dominant light source | |
|
|
| `shadow_direction` | (float, float) | Normalized shadow vector on ground plane | |
|
|
| `viewing_angle` | (float, float, float) | Object's rotation relative to camera | |
|
|
|
|
|
### 1.4 Simplex Coordinate Mapping |
|
|
|
|
|
The pentachoron's 5 vertices map to spatial dimensions: |
|
|
|
|
|
| Vertex | Spatial Meaning | |
|
|
|--------|----------------| |
|
|
| v₀ | Horizontal position (x) | |
|
|
| v₁ | Vertical position (y) | |
|
|
| v₂ | Depth / distance (z) | |
|
|
| v₃ | Scale / size relationship | |
|
|
| v₄ | Viewpoint / rotation encoding | |
|
|
|
|
|
An object at sector (2, 3, 1) with moderate scale and frontal view maps to a specific barycentric coordinate on the simplex. This mapping is **defined**, not learned — the training teaches the network to predict these coordinates from pixels. |
|
|
|
|
|
--- |
|
|
|
|
|
## 2. Rendering Pipeline |
|
|
|
|
|
### 2.1 Requirements |
|
|
|
|
|
- **Speed**: Must generate millions of training pairs. Target: 100+ scenes/second on a single GPU. |
|
|
- **Geometric precision**: Exact depth buffers, clean occlusion boundaries, mathematically correct perspective. |
|
|
- **Visual diversity**: Varied lighting, materials, and object complexity to prevent shortcut learning. |
|
|
- **NOT photorealism**: The signal is spatial structure, not texture fidelity. |
|
|
|
|
|
### 2.2 Renderer Selection |
|
|
|
|
|
**Primary: ModernGL (OpenGL via Python)** |
|
|
|
|
|
- GPU-accelerated, 200+ FPS for simple scenes |
|
|
- Exact depth buffer access via framebuffer objects |
|
|
- Programmable shaders for lighting control |
|
|
- Clean Python API, Colab-compatible via EGL offscreen |
|
|
|
|
|
**Fallback: Analytical raymarcher (PyTorch-native)** |
|
|
|
|
|
- Zero external dependencies |
|
|
- Every pixel's depth/normal is mathematically exact |
|
|
- Slower but fully differentiable if needed |
|
|
- Good for validation / ground truth comparison |
|
|
|
|
|
### 2.3 Scene Composition Strategy |
|
|
|
|
|
Each rendered scene is a configuration of: |
|
|
|
|
|
1. **Camera**: Position, FOV, near/far planes → defines the 5×5×5 frustum |
|
|
2. **Objects**: 1–8 primitive or composite objects placed in specific sectors |
|
|
3. **Lighting**: 1–3 lights placed at known sector positions |
|
|
4. **Background**: Solid color, gradient, or simple environment for depth contrast |
|
|
|
|
|
Object types (progressive complexity): |
|
|
|
|
|
| Phase | Objects | Purpose | |
|
|
|-------|---------|---------| |
|
|
| Phase 1 | Geometric primitives (sphere, cube, cylinder, cone, torus) | Pure spatial reasoning, no semantic content | |
|
|
| Phase 2 | Composite primitives (chair = cubes + cylinders, table = box + legs) | Multi-part spatial binding | |
|
|
| Phase 3 | Low-poly meshes (human figure, car, tree, building) | Scale-appropriate object recognition | |
|
|
| Phase 4 | Textured meshes with varied materials | Material-independent spatial reasoning | |
|
|
|
|
|
### 2.4 Scene Generation Parameters |
|
|
|
|
|
``` |
|
|
SceneConfig: |
|
|
n_objects: randint(1, 8) |
|
|
camera_fov: uniform(40°, 90°) |
|
|
camera_distance: log_uniform(2, 100) # controls near/far content |
|
|
lighting: |
|
|
n_lights: randint(1, 3) |
|
|
light_sectors: random sectors from 5×5×5 |
|
|
light_colors: random warm/cool/neutral |
|
|
ambient: uniform(0.1, 0.4) |
|
|
objects[i]: |
|
|
type: random from phase vocabulary |
|
|
sector: random (x, y, z) |
|
|
local_rotation: random euler angles |
|
|
scale_factor: uniform(0.5, 2.0) × sector_appropriate_base |
|
|
material: random (diffuse color, roughness) |
|
|
``` |
|
|
|
|
|
### 2.5 Output Per Scene |
|
|
|
|
|
| Output | Shape | Description | |
|
|
|--------|-------|-------------| |
|
|
| `rgb` | (H, W, 3) | Rendered color image | |
|
|
| `depth` | (H, W, 1) | Linear depth buffer | |
|
|
| `normals` | (H, W, 3) | Surface normals | |
|
|
| `instance_mask` | (H, W, 1) | Per-object instance segmentation | |
|
|
| `sector_map` | (H, W, 3) | Per-pixel sector assignment | |
|
|
| `labels` | dict | Full label set per object (§1.3) | |
|
|
| `scene_graph` | list[dict] | Complete spatial relationships between all objects | |
|
|
|
|
|
--- |
|
|
|
|
|
## 3. Training Architecture |
|
|
|
|
|
### 3.1 Stage 1: Geometric Spatial Classifier |
|
|
|
|
|
**Input**: RGB image (rendered scene) |
|
|
**Output**: Per-object sector predictions, depth ordering, occlusion graph |
|
|
|
|
|
Architecture: |
|
|
|
|
|
``` |
|
|
Image (512×512) |
|
|
→ Vision backbone (ViT-B/16 or ResNet-50) |
|
|
→ Sector prediction head: |
|
|
For each detected object: |
|
|
- Sector classification: (5×5×5) = 125-way softmax |
|
|
- Depth order: scalar regression |
|
|
- Occlusion: binary matrix (who occludes whom) |
|
|
- Scale: relative size regression |
|
|
- Viewing angle: (θ, φ, ψ) regression |
|
|
→ Scene graph head: |
|
|
For each object pair: |
|
|
- Spatial relation: {in_front, behind, left, right, above, below, overlapping} |
|
|
- Distance in sector space |
|
|
``` |
|
|
|
|
|
**Loss function**: |
|
|
|
|
|
``` |
|
|
L_total = L_sector + λ₁·L_depth + λ₂·L_occlusion + λ₃·L_scale + λ₄·L_scene_graph |
|
|
``` |
|
|
|
|
|
Where `L_sector` includes a geometric component that penalizes predictions that violate simplex constraints (e.g., predicted sector coordinates must form valid configurations on the pentachoron). |
|
|
|
|
|
**Training data**: 1M–10M rendered scenes with exact labels. |
|
|
|
|
|
### 3.2 Stage 2: Geometric CLIP |
|
|
|
|
|
Take the pretrained spatial backbone from Stage 1 and use it as the vision encoder for a CLIP-style contrastive model. |
|
|
|
|
|
**Vision encoder**: Stage 1 backbone (frozen or lightly finetuned) |
|
|
**Text encoder**: Standard transformer, initialized from existing CLIP text encoder |
|
|
**Contrastive target**: Align image embeddings with text descriptions that include spatial language |
|
|
|
|
|
Training captions are generated from scene labels: |
|
|
|
|
|
```python |
|
|
def scene_to_caption(labels): |
|
|
"""Generate spatial text from ground-truth labels.""" |
|
|
parts = [] |
|
|
for obj in labels["objects"]: |
|
|
# Position |
|
|
x, y, z = obj["sector_xyz"] |
|
|
h_pos = ["far left", "left", "center", "right", "far right"][x] |
|
|
v_pos = ["bottom", "lower", "middle", "upper", "top"][y] |
|
|
d_pos = ["very close", "near", "middle distance", "far", "very far"][z] |
|
|
|
|
|
parts.append(f"a {obj['type']} at {h_pos} {v_pos}, {d_pos}") |
|
|
|
|
|
# Occlusion |
|
|
if obj["occlusion_pct"] > 0.2: |
|
|
occluder = labels["objects"][obj["occluded_by"][0]]["type"] |
|
|
parts.append(f"partially behind the {occluder}") |
|
|
|
|
|
# Scale context |
|
|
if z >= 3: |
|
|
parts.append(f"appearing small in the distance") |
|
|
|
|
|
# Spatial relations |
|
|
for rel in labels["scene_graph"]: |
|
|
parts.append(f"the {rel['obj_a']} is {rel['relation']} the {rel['obj_b']}") |
|
|
|
|
|
return ", ".join(parts) |
|
|
``` |
|
|
|
|
|
**Key insight**: The geometric CLIP doesn't just learn "cat" ↔ image-of-cat. It learns "cat at upper-right, far distance, partially behind a tree" ↔ specific simplex configuration. Spatial prepositions become geometric operations, not statistical associations. |
|
|
|
|
|
### 3.3 Stage 3: Transfer to Real Images |
|
|
|
|
|
The pretrained geometric backbone transfers to real-world tasks: |
|
|
|
|
|
1. **Direct finetuning**: Replace the geometric CLIP's vision encoder in SD1.5's conditioning path. Now "cup on top of book" activates specific simplex configurations that were grounded in actual 3D relationships. |
|
|
|
|
|
2. **Inverse embedding**: Given a real image, extract simplex coordinates that describe its spatial structure. These become geometric conditioning signals for diffusion models. |
|
|
|
|
|
3. **Hybrid**: Use the geometric backbone as an auxiliary encoder alongside standard CLIP. The geometric channel provides spatial structure; CLIP provides semantic content. The geo_prior blends them on the simplex. |
|
|
|
|
|
--- |
|
|
|
|
|
## 4. Dataset Scaling Strategy |
|
|
|
|
|
### 4.1 Procedural Generation Tiers |
|
|
|
|
|
| Tier | Scenes | Resolution | Objects | Purpose | |
|
|
|------|--------|------------|---------|---------| |
|
|
| Tier 1 | 1M | 256×256 | Primitives only | Fast pretraining of spatial reasoning | |
|
|
| Tier 2 | 5M | 512×512 | Composites + varied lighting | Full spatial classifier training | |
|
|
| Tier 3 | 10M | 512×512 | Low-poly meshes + textures | Geometric CLIP pretraining | |
|
|
| Tier 4 | 1M | 512×512 | Complex scenes + real textures | Bridge to photorealism | |
|
|
|
|
|
### 4.2 Augmentation via Sector Permutation |
|
|
|
|
|
Because the 5×5×5 grid is symmetric, we can generate 6× data for free by: |
|
|
|
|
|
- Horizontal flip: maps sector (x, y, z) → (4-x, y, z) |
|
|
- Vertical flip: maps sector (x, y, z) → (x, 4-y, z) |
|
|
- 90° rotation: maps sector (x, y, z) → (y, 4-x, z) |
|
|
|
|
|
Labels transform deterministically with the augmentation. |
|
|
|
|
|
### 4.3 Hard Example Mining |
|
|
|
|
|
After initial training, identify failure modes: |
|
|
|
|
|
- Sectors with high confusion rates (e.g., depth ordering at similar z-values) |
|
|
- Occlusion patterns the model struggles with |
|
|
- Scale ambiguities (large far objects vs small near objects) |
|
|
|
|
|
Regenerate scenes specifically targeting these failure modes. |
|
|
|
|
|
--- |
|
|
|
|
|
## 5. Simplex Integration |
|
|
|
|
|
### 5.1 Geometric Loss During Pretraining |
|
|
|
|
|
The existing Cayley-Menger + volume preservation loss from the geo_prior applies directly: |
|
|
|
|
|
- **CM validity**: Ensures predicted sector coordinates form valid configurations on the pentachoron |
|
|
- **Volume preservation**: Prevents collapse — all 5 spatial dimensions must remain discriminable |
|
|
- **Edge regularity**: Maintains uniform spacing between spatial concepts |
|
|
|
|
|
### 5.2 Vertex Assignment as Spatial Binding |
|
|
|
|
|
The vertex weight entropy findings from the triad study directly inform the architecture: |
|
|
|
|
|
- **Multi-object scenes** (like object-relations): Should produce LOW vertex entropy — hard routing of objects to separate vertices |
|
|
- **Single-object attribute scenes** (like characters): Should produce HIGH vertex entropy — soft blending of attributes on shared vertices |
|
|
- **The 5×5×5 scenes will produce BOTH patterns** depending on object count and spatial configuration |
|
|
|
|
|
This means the pretraining naturally teaches the prior when to use hard routing vs soft blending — the fundamental compositional operation. |
|
|
|
|
|
### 5.3 Sector → Simplex Coordinate Function |
|
|
|
|
|
The mapping from sector space to simplex space is a learnable function initialized as: |
|
|
|
|
|
```python |
|
|
def sector_to_simplex(sector_xyz, scale, viewpoint): |
|
|
""" |
|
|
Map 5×5×5 sector + metadata to pentachoron barycentric coordinates. |
|
|
|
|
|
Args: |
|
|
sector_xyz: (3,) integers in [0, 4] |
|
|
scale: float, relative object scale |
|
|
viewpoint: (3,) euler angles |
|
|
|
|
|
Returns: |
|
|
(5,) barycentric coordinates summing to 1 |
|
|
""" |
|
|
# Normalize inputs to [0, 1] |
|
|
x, y, z = sector_xyz / 4.0 |
|
|
s = sigmoid(scale) |
|
|
v = mean(normalize(viewpoint)) |
|
|
|
|
|
# Initial mapping: spread across 5 vertices |
|
|
raw = torch.tensor([x, y, z, s, v]) |
|
|
|
|
|
# Softmax to get valid barycentric coordinates |
|
|
return F.softmax(raw / temperature, dim=0) |
|
|
``` |
|
|
|
|
|
The temperature and any learned transformations are trained alongside the classifier. |
|
|
|
|
|
--- |
|
|
|
|
|
## 6. Implementation Roadmap |
|
|
|
|
|
### Phase 1: Renderer + Data Pipeline [HIGH PRIORITY] |
|
|
|
|
|
- [ ] **ModernGL offscreen renderer** |
|
|
- [ ] EGL context setup (headless / Colab compatible) |
|
|
- [ ] Programmable camera with configurable FOV, near/far |
|
|
- [ ] 5×5×5 frustum sector calculation from camera params |
|
|
- [ ] Depth buffer extraction |
|
|
- [ ] Instance mask via unique-color rendering pass |
|
|
- [ ] Basic Phong/Lambert lighting with positioned lights |
|
|
|
|
|
- [ ] **Primitive library** |
|
|
- [ ] Sphere, cube, cylinder, cone, torus mesh generators |
|
|
- [ ] UV-mapped for future texture support |
|
|
- [ ] Per-primitive bounding box for sector assignment |
|
|
|
|
|
- [ ] **Scene composer** |
|
|
- [ ] Random object placement within specified sectors |
|
|
- [ ] Collision detection (prevent overlapping placements) |
|
|
- [ ] Automatic occlusion calculation from depth buffer |
|
|
- [ ] Scene graph generation (pairwise spatial relations) |
|
|
- [ ] Perspective-correct sector scaling |
|
|
|
|
|
- [ ] **Label generator** |
|
|
- [ ] Per-object label vector (§1.3) |
|
|
- [ ] Scene-level spatial relation graph |
|
|
- [ ] Caption generator (§3.2) |
|
|
- [ ] Simplex coordinate ground truth |
|
|
|
|
|
- [ ] **Data pipeline** |
|
|
- [ ] Parallel scene generation (multiprocess) |
|
|
- [ ] WebDataset / streaming format for large-scale |
|
|
- [ ] On-the-fly augmentation (sector permutations) |
|
|
- [ ] HuggingFace dataset upload integration |
|
|
|
|
|
### Phase 2: Geometric Classifier [HIGH PRIORITY] |
|
|
|
|
|
- [ ] **Model architecture** |
|
|
- [ ] Vision backbone selection (ViT-B/16 vs ResNet-50) |
|
|
- [ ] Sector prediction head (125-way classification per object) |
|
|
- [ ] Depth ordering head |
|
|
- [ ] Occlusion prediction head |
|
|
- [ ] Scene graph prediction head |
|
|
|
|
|
- [ ] **Training pipeline** |
|
|
- [ ] Multi-task loss with geometric regularization |
|
|
- [ ] Simplex constraint loss (CM validity on predictions) |
|
|
- [ ] Curriculum: primitives → composites → meshes |
|
|
- [ ] Evaluation metrics: sector accuracy, depth ordering accuracy, occlusion F1 |
|
|
|
|
|
- [ ] **Ablation studies** |
|
|
- [ ] With vs without simplex constraint loss |
|
|
- [ ] Effect of vertex count (k=4, k=5, k=6) |
|
|
- [ ] Depth bucket resolution (3×3×3 vs 5×5×5 vs 7×7×7) |
|
|
|
|
|
### Phase 3: Geometric CLIP [MEDIUM PRIORITY] |
|
|
|
|
|
- [ ] **Architecture** |
|
|
- [ ] Vision encoder: frozen Stage 1 backbone + projection |
|
|
- [ ] Text encoder: initialize from OpenAI CLIP text encoder |
|
|
- [ ] Contrastive loss with hard negatives (spatial near-misses) |
|
|
|
|
|
- [ ] **Training** |
|
|
- [ ] Caption generation from scene labels |
|
|
- [ ] Hard negative mining (swap spatial relations in text) |
|
|
- [ ] Spatial preposition evaluation benchmark |
|
|
- [ ] Transfer evaluation: zero-shot spatial classification on real images |
|
|
|
|
|
- [ ] **Integration with existing pipeline** |
|
|
- [ ] Replace SD1.5 CLIP encoder with geometric CLIP |
|
|
- [ ] Measure impact on compositional generation |
|
|
- [ ] Compare geo_prior behavior with geometric vs standard CLIP |
|
|
|
|
|
### Phase 4: Transfer + Real-World Bridge [LOWER PRIORITY] |
|
|
|
|
|
- [ ] **Domain transfer** |
|
|
- [ ] Finetune geometric backbone on COCO with spatial annotations |
|
|
- [ ] Evaluate on spatial reasoning benchmarks (CLEVR, SpatialBench) |
|
|
- [ ] Test compositional generation improvement in SD1.5 |
|
|
|
|
|
- [ ] **Inverse embedding pipeline** |
|
|
- [ ] Given real image → extract simplex coordinates |
|
|
- [ ] Use as conditioning signal for diffusion |
|
|
- [ ] Compare with CLIP-only conditioning |
|
|
|
|
|
- [ ] **Hybrid encoder** |
|
|
- [ ] Dual-stream: geometric backbone + CLIP |
|
|
- [ ] Learnable fusion on simplex |
|
|
- [ ] Evaluate on attribute binding + spatial composition jointly |
|
|
|
|
|
--- |
|
|
|
|
|
## 7. Key Hypotheses to Validate |
|
|
|
|
|
1. **Sector classification transfers to real images**: A model trained entirely on synthetic 3D scenes can identify spatial sectors in photographs with >60% accuracy. |
|
|
|
|
|
2. **Geometric CLIP improves compositional generation**: Replacing standard CLIP with geometric CLIP in the SD1.5 pipeline produces measurably better spatial composition (evaluated via CLEVR-style spatial accuracy metrics). |
|
|
|
|
|
3. **Simplex coordinates are a natural spatial language**: The 5-vertex pentachoron provides sufficient dimensionality to encode the spatial relationships humans use in language (in front of, behind, above, below, next to, far away, etc.). |
|
|
|
|
|
4. **Vertex entropy predicts spatial complexity**: Scenes with more objects produce lower vertex entropy (hard routing), while single-object scenes produce higher entropy (attribute binding). This pattern, observed in the triad study, should emerge naturally from synthetic training. |
|
|
|
|
|
5. **The 5×5×5 grid scales**: The same sectorization works for both close-up tabletop scenes and panoramic landscapes by leveraging perspective scaling of sector volumes. |
|
|
|
|
|
6. |
|
|
--- |
|
|
|
|
|
## 8. Technical Notes |
|
|
|
|
|
### 8.1 Why 5×5×5? |
|
|
|
|
|
- 5 matches the pentachoron vertex count — direct 1:1 axis-to-vertex mapping |
|
|
- 125 sectors is fine enough for meaningful spatial discrimination without being computationally prohibitive |
|
|
- 5 depth layers capture: immediate foreground, near, mid, far, background — matching natural perceptual depth bands |
|
|
- The number 5 appears consistently across the geometric vocabulary work (k=4 simplex has 5 vertices, 5 edge dimensions, etc.) |
|
|
|
|
|
### 8.2 Why Not Photorealistic Rendering? |
|
|
|
|
|
- Photorealism introduces texture/material confounds that obscure spatial signal |
|
|
- Simple rendering is 100-1000× faster, enabling much larger datasets |
|
|
- Transfer from simple→real is well-studied (sim2real in robotics) |
|
|
- The geometric prior should learn spatial structure independent of visual style — using simple rendering enforces this |
|
|
- Phase 4 progressively adds visual complexity once spatial reasoning is established |
|
|
|
|
|
### 8.3 Relationship to Existing Work |
|
|
|
|
|
- **CLEVR**: Similar synthetic scene approach but limited to 2D spatial relations. Our 5×5×5 grid adds depth, scale, and perspective. |
|
|
- **NeRF / 3D Gaussians**: Reconstruct 3D from 2D. We go the opposite direction — start with known 3D, teach networks to infer it from 2D. |
|
|
- **Spatial transformers**: Learn attention over spatial positions. Our approach provides explicit spatial supervision rather than hoping attention learns spatial structure. |
|
|
- **Scene graphs**: Prior work on scene graph prediction from images. Our contribution is grounding scene graphs in simplex geometry rather than abstract relation classification. |
|
|
|
|
|
--- |
|
|
|
|
|
## 9. Resource Estimates |
|
|
|
|
|
| Component | Compute | Storage | |
|
|
|-----------|---------|---------| |
|
|
| Tier 1 data generation (1M scenes) | 1 GPU, ~3 hours | ~50 GB | |
|
|
| Tier 2 data generation (5M scenes) | 1 GPU, ~15 hours | ~250 GB | |
|
|
| Classifier pretraining | 1 A100, ~24 hours | ~2 GB model | |
|
|
| Geometric CLIP training | 1-4 A100s, ~48 hours | ~2 GB model | |
|
|
| Transfer experiments | 1 A100, ~8 hours each | - | |
|
|
|
|
|
--- |
|
|
|
|
|
## 10. Success Criteria |
|
|
|
|
|
| Milestone | Metric | Target | |
|
|
|-----------|--------|--------| |
|
|
| Renderer works | Scenes/second | >100 on single GPU | |
|
|
| Sector classifier | Top-1 sector accuracy (synthetic) | >90% | |
|
|
| Depth ordering | Kendall's τ (synthetic) | >0.95 | |
|
|
| Geometric CLIP | Spatial preposition accuracy (synthetic) | >85% | |
|
|
| Real image transfer | Spatial sector accuracy (COCO) | >50% | |
|
|
| Compositional generation | Spatial relation accuracy (SD1.5 + geometric CLIP) | >70% | |
|
|
| Vertex entropy pattern | Matches triad study predictions | Qualitative match | |