Engram-Guided Structural Memory for Single-View 3D Reconstruction in Humanoid Robotics
Author: GrooveJ (Danny Lee)
Affiliation: Independent Researcher (With the assistance of Many LLMs)
Conceptual paper (no experimental results).
This article presents a complete mathematical specification with intuitive explanations of all operators and symbols, designed for future implementation and empirical validation.
Abstract
Single-view 3D reconstruction under occlusion is a critical bottleneck in humanoid robotics: functional structures—handles, joints, supports—are frequently hidden, causing grasp failures and unsafe actions. While Large Reconstruction Models (LRMs) produce visually plausible shapes, they lack mechanisms for reusing structural knowledge across objects. Inspired by the principle that decoupling memory from computation yields significant reasoning gains, we propose Engram-Guided 2D-to-3D (EG-3D): a framework that integrates structure-only 3D memory with LLM-conditioned perception and Vision-Language-Action (VLA) control.
Given an RGB image and a task prompt, an LLM-conditioned encoder extracts part-level structural keys via Slot Attention, anchored to functional parts through a prioritized supervision hierarchy (contact signals → geometric priors → relational consistency). These keys retrieve Modular Engrams—decomposed into geometric, relational, symmetry, and affordance subcomponents—each injected through dedicated pathways. We enforce structural constraints via kinematic consistency loss ensuring relational tokens are not ignored by the generator. For single-view pose estimation without ground truth, we leverage gravity-aligned priors (from robot IMU) combined with multi-view consistency. We also introduce a staged memory construction protocol that enables deployment with minimal initial annotation (geometry + symmetry only), progressively adding relations and affordances.
This paper provides: (1) a complete mathematical specification with detailed explanations for every equation, (2) a minimal "EG-3D Lite" configuration for practical deployment, (3) explicit mechanisms preventing generators from ignoring structural constraints, and (4) clear differentiation from prior retrieval-based methods.
1. Introduction
1.1 The Problem: Occlusion-Induced Failures in Manipulation
Humanoid robots increasingly rely on monocular head-mounted cameras. However, real-world manipulation involves inherent underspecification: crucial functional elements—door handles, hinges, support legs, tool tips, internal joints—are frequently occluded by the object itself or by the robot's body. When these structures are misreconstructed or not inferred at all, robots select incorrect grasp points, apply unsafe forces, collide unexpectedly, or fail tasks entirely.
1.2 Industry Context: Why Now?
The strategic importance of single-view 3D reconstruction for robotics has become increasingly apparent. Recent acquisitions and investments around single-image-to-3D asset generation highlight the urgency of robust 2D→3D pipelines for embodied systems. Major technology companies are actively acquiring startups in this space, recognizing that perception-to-action pipelines are critical bottlenecks for deploying robots at scale.
1.3 The Gap: Missing Structural Memory
Recent LRMs (TripoSR, LRM, Wonder3D) scale transformers and diffusion to generate plausible geometry from single images. Yet they exhibit systematic structural errors: wrong part counts, symmetry collapse, hallucinated connections, or missing functional subparts. We argue that a central missing ingredient is reusable structural memory.
1.4 Inspiration: Conditional Memory as a Modeling Primitive
The Engram-Guided Structural Memory concept demonstrates that decoupling memory from computation yields significant reasoning gains. We extend this principle to 3D perception.
1.5 Key Technical Contributions
Modular Engram Architecture with explicit composition operators and kinematic consistency loss ensuring relational constraints are enforced
Prioritized Slot-Part Alignment: contact/interaction signals (strongest) → geometric priors → relational consistency (weakest)
Gravity-Aligned Pose Estimation: leveraging robot IMU for coarse alignment without ground-truth poses
Staged Memory Construction: minimal viable memory (geo + sym) → progressive enrichment (rel → aff)
Clear Differentiation from retrieval-based priors: EG-3D is a structural assembly system, not a retrieval prior
2. Related Work and Differentiation
2.1 Retrieval-Based 3D Priors vs. Structural Assembly Memory
A key question is: How does EG-3D differ from existing retrieval-based 3D priors?
| Aspect | Retrieval-Based Priors | EG-3D (Ours) |
|---|---|---|
| Retrieval unit | Global shape or exemplar | Part-level structural motifs |
| Memory content | Full shapes (with texture/appearance) | Structure-only (geo + rel + sym + aff) |
| Composition | Single retrieval or blending | SE(3)-aware multi-part assembly |
| Key space | Category/appearance-based | Structure-only (invariance-regularized) |
| Constraints | Implicit (hope generator learns) | Explicit (kinematic consistency loss) |
| Affordance | Not modeled | First-class component |
| Action loop | Not considered | VLA integration |
Our key insight: EG-3D is not "retrieval + generation" but "structural assembly + generation". We retrieve parts, align them via SE(3), enforce kinematic constraints, and compose them with explicit operators.
2.2 Memory in Vision-Language-Action Models
Recent VLA models incorporate memory:
- MemoryVLA: Temporal memory for long-horizon tasks
- IVE: Episodic memory for exploration
Our work addresses structural memory (reusable 3D motifs across objects), complementary to temporal/episodic memory.
3. Notation and Preliminaries
3.1 Inputs and Embeddings
| Symbol | Type | Description |
|---|---|---|
| Input RGB image with height , width , 3 color channels | ||
| Natural-language prompt as sequence of tokens from vocabulary | ||
| each | Fused token sequence: visual tokens + text tokens, each -dimensional | |
| Embedding dimension (typical: 768 or 1024) |
3.2 Part-Level Queries (Slots)
| Symbol | Type | Description |
|---|---|---|
| Number of part-level query slots (typical: 8–16) | ||
| Structure-only query key for slot | ||
| Attention weight: how much slot attends to token |
3.3 Modular Engram Structure
Each Engram is a modular tuple:
Symbol-by-symbol explanation:
- : Index of the Engram in memory (\(m = 1, \ldots, M\))
- : Geometric component — latent vector encoding 3D shape
- : Relational component — tokens encoding part connectivity
- : Symmetry component — descriptor of symmetry type and axes
- : Affordance component — markers for interaction points
- : Canonical frame — reference pose for spatial alignment
| Component | Injection Pathway | Required in Minimal Config? |
|---|---|---|
| AdaLN | ✅ Yes | |
| Cross-Attention + Kinematic Loss | ❌ Optional (Stage 2) | |
| Symmetry-Aware Decoder | ✅ Yes | |
| Auxiliary Head | ❌ Optional (Stage 3) | |
| SE(3) Alignment | ✅ Yes (coarse) |
3.4 Memory Structure
| Symbol | Type | Description |
|---|---|---|
| Total number of Engrams in memory | ||
| Memory bank as key-Engram pairs | ||
| Retrieval key for Engram | ||
| Retrieval weight: how much slot retrieves Engram | ||
| Temperature for softmax retrieval (typical: 0.05–0.2) |
3.5 3D Generator and Action
| Symbol | Type | Description |
|---|---|---|
| — | Continuous 3D representation (NeRF, SDF, or 3DGS) | |
| — | 3D generator network with parameters | |
| Differentiable rendering of | ||
| Action vector (\(K\) dimensions, e.g., 7 for 6-DoF + gripper) | ||
| — | Vision-Language-Action policy |
4. Core Framework
4.1 Pipeline Overview
The complete pipeline consists of five stages:
Equation (1): Visual-Language Encoding
Symbol-by-symbol:
- : Visual-Language encoder (any VLM backbone)
- : Input text prompt
- : Input RGB image
- : Output token sequence (\(N\) visual + text tokens)
- : Pooling operation (mean or attention pooling)
- : Global embedding summarizing the entire input
Intuitive Explanation:
This is the robot's "perception + understanding" step. The encoder looks at the image and reads the instruction, producing a sequence of tokens that represent both visual content and linguistic intent. The pooled vector captures the overall context.
Why this matters:
Joint visual-language encoding allows the system to focus on task-relevant parts. For "grab the handle," the encoder emphasizes handle-related visual regions.
Equation (2): Structure-Only Key Extraction
Symbol-by-symbol:
- : Key extraction function (implemented via Slot Attention)
- : Input token sequence from encoder
- : Output key for slot
- : Number of slots (part hypotheses)
Expanded form (Slot Attention):
where attention weights are computed as:
Critical constraint (competition across slots):
Intuitive Explanation:
Slot Attention splits the scene into parts. Each slot "competes" to explain different tokens—one slot might capture "the handle," another "the door body," another "the hinge." The key summarizes what slot has captured.
Why the competition constraint?
Without it, all slots might attend to everything equally, producing identical keys. Competition forces specialization.
Equation (3): Modular Engram Retrieval with SE(3) Alignment
Symbol-by-symbol:
- : Retrieval function with SE(3) alignment
- : Query keys from slots
- : Memory bank
- : Retrieved and composed Engram components
Expanded form (retrieval weights):
Symbol-by-symbol:
- : Cosine similarity between query and memory key
- : Temperature (lower = sharper retrieval, higher = softer)
- : Normalized retrieval weight (sums to 1 over )
Intuitive Explanation:
For each slot's key, we find the most similar Engrams in memory. The softmax converts similarities into probability-like weights. Temperature controls sharpness: gives winner-take-all, gives uniform weighting.
Equation (4): Multi-Pathway 3D Generation with Kinematic Constraints
Symbol-by-symbol:
- : 3D generator network with parameters
- : Input image (provides appearance guidance)
- : Composed Engram components (structural guidance)
- : Kinematic consistency loss
- : Constraint tolerance
Intuitive Explanation:
The generator produces 3D geometry guided by both the image (for appearance) and retrieved Engrams (for structure). The kinematic constraint ensures the output respects relational information (e.g., joint axes, connection points).
Equation (5): Action Inference
Symbol-by-symbol:
- : Vision-Language-Action policy
- : Rendered images from 3D reconstruction
- : Original task prompt
- : Output action vector
Intuitive Explanation:
Given the reconstructed 3D (rendered as images) and the task instruction, the VLA policy outputs robot actions. The explicit 3D structure helps the policy reason about occluded parts.
4.2 Slot-Part Alignment with Prioritized Supervision
Critical Challenge: Slot Attention often produces "semantic blobs" rather than functional parts.
Our Solution: Prioritized Supervision Hierarchy
| Priority | Signal Source | Weight | Description |
|---|---|---|---|
| 1 (Highest) | Contact/interaction | Where humans/robots actually touch | |
| 2 | Geometric priors | Cylinders are handles, planes are supports | |
| 3 | Relational consistency | Connected parts should be spatially close | |
| 4 (Lowest) | Diversity regularization | Slots should differ from each other |
Combined Slot Alignment Loss:
Symbol-by-symbol:
- : Priority weights (\(w_1 > w_2 > w_3 > w_4\))
- : Contact signal loss
- : Geometric affordance prior loss
- : Relational consistency loss
- : Slot diversity regularization
Intuitive Explanation:
We stack multiple supervision signals by priority. Contact signals (where people actually grab objects) are strongest because they directly indicate functional parts. Lower-priority signals fill in when higher-priority signals are unavailable.
Priority 1: Contact Signal Loss
Symbol-by-symbol:
- : Attention mask for slot (where it attends in the image)
- : Ground-truth contact regions (from HOI data)
- : Binary Cross-Entropy loss
Expanded BCE:
Intuitive Explanation:
If we know where humans touch objects (from videos of people using objects), we train slots to attend to those regions. A slot attending to a handle should align with hand-contact regions on handles.
Why highest priority?
Contact directly indicates functional relevance. A handle is a handle because people grab it.
Priority 2: Geometric Affordance Prior Loss
Symbol-by-symbol:
- : Set of affordance types {grasp, support, hinge, button, ...}
- : Regions satisfying geometric prior for affordance
- : Indicator function (1 if condition true, 0 otherwise)
- : Predicted probability of affordance given slot key
Geometric priors (examples):
- Grasp (handle): Elongated cylindrical regions where (PCA eigenvalues)
- Support: Horizontal surfaces with upward normal:
- Hinge: Concave junctions between planar surfaces
- Button: Small convex protrusions on flat surfaces
Intuitive Explanation:
If a slot attends to a cylindrical region, it should predict "grasp" affordance. This loss encourages geometric consistency between attention patterns and affordance predictions.
Priority 3: Relational Consistency Loss
Symbol-by-symbol:
- : Set of slot pairs connected by relational tokens
- : 2D centroid of slot 's attention mask
- : Expected relative position from relational token (projected to 2D)
Intuitive Explanation:
If relational tokens say "slot connects to slot with offset ," then their attention centroids should reflect this spatial relationship. A handle connected to a door should attend to adjacent regions.
Priority 4: Slot Diversity Regularization
Symbol-by-symbol:
- : Entropy of slot 's attention distribution
- : Cosine similarity between slot keys
- : Similarity threshold (typical: 0.5)
- : Repulsion strength
Intuitive Explanation:
The entropy term encourages each slot to attend broadly (not collapse to a single token). The repulsion term pushes slot keys apart, preventing multiple slots from representing the same part.
4.3 Structure-Only Key Learning via Invariance
Problem: Keys might encode texture/color, causing retrieval to match appearance rather than structure.
Solution: Invariance regularization forces keys to ignore appearance.
Combined Invariance Loss:
Texture Augmentation Invariance:
Symbol-by-symbol:
- : Key from slot given original image
- : Texture-augmented image (hue shift, saturation change, style transfer)
- : Squared L2 norm
Intuitive Explanation:
The same object with different colors/textures should produce identical keys. We augment textures and penalize key differences.
Prompt Invariance (Selective):
Symbol-by-symbol:
- : Two different prompts
- : True if prompts describe same structure with different appearance
Examples:
- = "red wooden mug" vs = "blue ceramic mug" → Same structure, apply invariance
- = "mug with handle" vs = "mug without handle" → Different structure, skip invariance
Intuitive Explanation:
"Red wooden chair" and "blue metal chair" should retrieve the same structural Engrams. But "chair with armrests" and "chair without armrests" are structurally different.
Cross-Instance Structural Alignment:
Symbol-by-symbol:
- : Pairs of keys from same structure (different instances/textures)
- : Pairs of keys from different structures
- : Margin threshold
Intuitive Explanation:
This is contrastive learning for structure. Keys from structurally similar objects should be close; keys from different structures should be far.
4.4 Memory Key Generation
Strategy A (Recommended): Shared Encoder Pipeline
Symbol-by-symbol:
- : Source 3D shape for Engram
- : Set of viewpoints (e.g., 8 canonical views)
- : Set of texture variations
- : Rendered image from viewpoint with texture
- : Structure-focused prompt (e.g., "a 3D object")
Intuitive Explanation:
Memory keys are generated using the same pipeline as query keys, ensuring they live in the same space. Averaging over viewpoints and textures ensures the key captures structure, not appearance or viewpoint.
4.5 Modular Engram Components (Detailed Specifications)
4.5.1 Relational Token Format
Storage Format (23-dimensional raw vector):
| Field | Dims | Type | Description |
|---|---|---|---|
part_i_idx |
1 | Integer | Index of first connected part |
part_j_idx |
1 | Integer | Index of second connected part |
joint_type |
6 | One-hot | {fixed, revolute, prismatic, spherical, planar, free} |
axis |
3 | Unit vector | Joint rotation/translation axis |
rel_rotation |
6 | 6D continuous | Relative rotation (Zhou et al. representation) |
rel_translation |
3 | 3D vector | Relative translation |
constraint_limits |
2 | Floats | in radians |
valid_mask |
1 | Binary | 1 if valid, 0 if padding |
Model Input Format (projected to dimensions):
Symbol-by-symbol:
- : Learned embedding table for integer indices
- : The 20 continuous dimensions (joint_type through valid_mask)
- : Multi-layer perceptron projecting to dimensions
4.5.2 Symmetry Component Format
| Field | Dims | Description |
|---|---|---|
sym_type |
5 | One-hot: {none, bilateral, radial, translational, helical} |
sym_axis |
3 | Primary symmetry axis (unit vector) |
sym_center |
3 | Center of symmetry (3D point) |
sym_count |
1 | Repetition count (e.g., 4 for chair legs) |
sym_spacing |
1 | Spacing for translational symmetry |
| Total | 13 | — |
4.5.3 Affordance Component Format
| Field | Dims | Description |
|---|---|---|
position |
3 | 3D location in canonical frame |
aff_type |
8 | One-hot: {grasp, support, hinge, button, slider, socket, lid, none} |
approach_dir |
3 | Approach/interaction direction |
contact_normal |
3 | Surface normal at contact point |
valid_mask |
1 | 1 if valid, 0 if padding |
| Total | 18 | Per marker (\(N_a = 5\) markers) |
4.6 Modular Engram Composition
Geometric Component (weighted average):
Symbol-by-symbol:
- : Retrieval weight (slot , Engram )
- : Geometric latent of Engram
- : Relative transformation from canonical to target pose
- : Transformation operator (MLP approximation in latent space)
Intuitive Explanation:
Each retrieved geometric Engram is transformed to its target pose, then all are blended by retrieval weights.
Relational Component (concatenation with threshold):
Intuitive Explanation:
We concatenate relational tokens from high-confidence retrievals. Thresholding prevents noisy low-weight retrievals from polluting the relation set.
Symmetry Component (weighted voting):
Intuitive Explanation:
Symmetry is discrete (an object is either bilaterally symmetric or not). We take a weighted vote over retrieved symmetry types.
Affordance Component (transform and merge):
Intuitive Explanation:
Affordance markers (grasp points, etc.) are transformed to target poses and merged. The union collects all relevant interaction points.
4.7 Enforcing Relational Constraints: Kinematic Consistency Loss
Problem: Cross-attention alone allows the generator to "politely ignore" relational tokens.
Solution: Explicit loss penalizing kinematic violations.
Combined Kinematic Loss:
Joint Position Consistency:
Symbol-by-symbol:
- : Set of connected part pairs from relational tokens
- : Estimated poses of slots and
- : Relative pose from to
- : Expected relative pose from relational token
- : Geodesic distance on SE(3)
Geodesic distance on SE(3):
where is the matrix logarithm mapping rotation to axis-angle.
Intuitive Explanation:
If a relational token says "the handle is attached to the door with relative pose ," the generated geometry must satisfy this. Deviation is penalized.
Joint Axis Alignment:
Symbol-by-symbol:
- : Joint axis from relational token (unit vector)
- : Geometry near the joint between parts and
- : First principal component (dominant direction)
Intuitive Explanation:
For a revolute joint (like a door hinge), the local geometry should be aligned with the joint axis. A poorly-aligned hinge would have high loss.
Joint Limit Enforcement:
Symbol-by-symbol:
- : Current joint angle between parts and
- : Joint limits from relational token
- : ReLU (only penalize violations)
Intuitive Explanation:
A door can only open so far. If the generated geometry implies the door is open beyond its limit, this is penalized.
Relational Tokens also Feed Pose Refinement:
Intuitive Explanation:
Beyond cross-attention in the generator, relational tokens also directly inform pose estimation. This creates multiple pathways for relational information, making it harder to ignore.
4.8 Gravity-Aligned Pose Estimation (No Ground Truth)
Challenge: Single-view pose estimation usually needs GT supervision.
Solution: Use robot IMU + multi-view consistency.
Gravity Prior Loss:
Symbol-by-symbol:
- : Gravity direction from IMU (unit vector)
- : Estimated rotation for slot
- : Canonical up-vector
- : 1 if slot should be gravity-aligned (supports, tables, floors)
Intuitive Explanation:
A table surface should be horizontal (aligned with gravity). We use the robot's IMU to know which way is "up" and penalize misalignment.
Multi-View Consistency Loss:
Symbol-by-symbol:
- : Rendering from viewpoint
- : Image warping from to using estimated depth/pose
Intuitive Explanation:
If the 3D reconstruction is correct, rendering from different viewpoints and warping should produce consistent images. Inconsistencies indicate pose or geometry errors.
Combined Pose Loss (No GT):
Intuitive Explanation:
Without ground-truth poses, we combine: (1) gravity alignment (from IMU), (2) multi-view consistency (self-supervised), (3) kinematic constraints (from relational tokens). Together, these provide sufficient signal to learn reasonable poses.
4.9 Multi-Pathway Generator Conditioning
Pathway A: AdaLN for Geometry
Symbol-by-symbol:
- : Activation at layer (batch , channels )
- : Layer normalization
- : Mean and std (per-sample, across channels for MLPs; per-token for Transformers)
- : Scale from geometric prior
- : Shift from geometric prior
- : Element-wise multiplication
Intuitive Explanation:
AdaLN injects the geometric prior by controlling the scale and shift of activations at every layer. This is a strong form of conditioning—the prior directly modulates the "communication channels" in the network.
Pathway B: Cross-Attention for Relations
Expanded Cross-Attention:
where if token is invalid (padding).
Intuitive Explanation:
The generator can "ask questions" about relational structure. "Is there a joint here? What type? What axis?" The cross-attention mechanism retrieves relevant relational information.
Pathway C: Per-Slot Symmetry
Symbol-by-symbol:
- : Final field value for slot at point
- : Base generator output (before symmetry)
- : Symmetry count (e.g., 4 for chair legs)
- : Rotation by around symmetry axis
- : Point transformed to slot's local frame
- : Reflection across symmetry plane
Intuitive Explanation:
Symmetry is enforced by averaging the field over symmetric transformations. For 4-way radial symmetry (like chair legs), we evaluate the base field at 4 rotated positions and average. This guarantees perfect symmetry.
Pathway D: Affordance Head
Intuitive Explanation:
A dedicated head predicts affordance heatmaps (grasp points, support surfaces) using both generator features and retrieved affordance markers. This provides explicit manipulation-relevant outputs.
4.10 Composition Operators (Generator-Specific)
SDF Composition (Smooth Minimum):
Symbol-by-symbol:
- : SDF value from slot at point (negative inside, positive outside)
- : Sharpness parameter (larger = closer to hard min)
Limit behavior:
- : (hard union)
- : (smooth blend)
Intuitive Explanation:
SDFs represent shapes as distance fields. To combine shapes (union), we take the minimum. Soft-min provides smooth transitions at part boundaries.
NeRF Composition (Density Mixture):
Symbol-by-symbol:
- : Density from slot at point
- : Color from slot
- : Spatial weight for slot (from lifted attention mask)
- : Small constant for numerical stability
Intuitive Explanation:
NeRF represents scenes as density and color fields. We combine slot contributions using spatial weights, with colors weighted by density (denser regions contribute more to final color).
3DGS Composition (Concatenation + Pruning):
Symbol-by-symbol:
- : Set of Gaussians from slot
- : Set union (concatenation)
- : Remove Gaussians with opacity below
Intuitive Explanation:
3D Gaussian Splatting represents scenes as collections of Gaussian primitives. We simply concatenate Gaussians from all slots, then prune low-opacity ones to remove redundancy.
5. Staged Memory Construction Protocol
Key Insight: Full Engram annotation is expensive. We propose staged construction for practical deployment.
5.1 Stage 1: Minimal Viable Memory (EG-3D Lite)
Required: , , (coarse)
Annotation effort: Low (automatic from 3D scans)
- Geometry: PointNet++ encoder on mesh/point cloud
- Symmetry: PCA + RANSAC detection (fully automatic)
- Canonical frame: Gravity + principal axis alignment (automatic)
Capabilities: Basic shape retrieval, symmetry enforcement, rigid object reconstruction
5.2 Stage 2: Add Relational Structure
Added:
Annotation effort: Medium (semi-automatic)
- Source: PartNet-Mobility dataset, or learned from articulation videos
Capabilities: Articulated object reconstruction, kinematic constraint enforcement
5.3 Stage 3: Add Affordance
Added:
Annotation effort: Medium-High (from interaction data)
- Source: HOI videos, robot demonstrations, geometric priors
Capabilities: Full affordance-grounded reconstruction, direct grasp point prediction
5.4 Staged Training Protocol
Phase 1: Train with geo + sym only
→ Learn basic retrieval and composition
Phase 2: Freeze geo/sym modules, add rel
→ Learn relational cross-attention
→ Enable kinematic consistency loss
Phase 3: Freeze geo/sym/rel, add aff
→ Learn affordance head
→ Fine-tune with prioritized slot alignment
6. EG-3D Lite: Minimal Configuration
| Component | EG-3D Full | EG-3D Lite |
|---|---|---|
| ✅ | ✅ | |
| ✅ | ✅ | |
| ✅ | ❌ | |
| ✅ | ❌ | |
| Kinematic loss | ✅ | ❌ |
| Affordance head | ✅ | ❌ |
EG-3D Lite Loss:
7. Full Training Objective
Reconstruction Loss:
Chamfer Distance:
Intuitive Explanation:
Reconstruction loss has two parts: (1) rendered images should match target images (L1 loss), (2) reconstructed geometry should match ground-truth geometry (Chamfer distance measures point cloud similarity).
Surface Regularization (Eikonal, for SDF):
Intuitive Explanation:
A valid SDF should have unit gradient magnitude (\(|\nabla f| = 1\)) almost everywhere. This regularization prevents degenerate solutions.
8. Computational Considerations
| Component | Complexity | Notes |
|---|---|---|
| Slot Attention | Linear in tokens | |
| Memory Retrieval | Parallelizable; use approximate NN for large | |
| Cross-Attention | , modest overhead | |
| Kinematic Loss | Per-edge computation | |
| Composition | per query point | Soft-min/mixture |
9. Failure Modes and Mitigations
9.1 Slot Collapse to Semantic Blobs
Mitigations (by priority):
- Contact signal supervision (highest priority)
- Geometric affordance priors
- Relational consistency
- Diversity regularization
9.2 Generator Ignoring Relational Tokens
Mitigations:
- Kinematic consistency loss (explicit constraint)
- Feed relational tokens to pose refinement head
- Monitor ; alert if too small
9.3 Pose Estimation Without GT
Mitigations:
- IMU gravity prior
- Multi-view consistency
- Kinematic loss provides indirect supervision
10. Hypotheses and Future Work
10.1 Hypotheses (To Be Validated)
- Prioritized supervision → more functionally meaningful slots
- Kinematic consistency loss → improved articulated object reconstruction
- EG-3D Lite → meaningful gains even without rel/aff
- Staged construction → practical deployment with minimal effort
10.2 Proposed Experiments
| Experiment | Metric |
|---|---|
| Slot alignment quality | IoU with functional part GT |
| Kinematic consistency | Joint error on articulated objects |
| EG-3D Lite vs Full | Ablation study |
| Retrieval vs assembly | Comparison with exemplar methods |
11. Conclusion
We presented Engram-Guided 2D-to-3D (EG-3D), a structural assembly framework for single-view 3D reconstruction. Key contributions:
- Modular Engram Architecture with explicit composition operators
- Prioritized Slot-Part Alignment: contact → geometric → relational → diversity
- Kinematic Consistency Loss enforcing relational constraints
- Gravity-Aligned Pose Estimation for GT-free settings
- Staged Memory Construction enabling incremental deployment
- Detailed mathematical specifications with intuitive explanations for every equation
This paper provides a complete foundation for implementation and empirical validation.
12. Engram-Guided Training Cycle
# Algorithm 1: Engram-Guided Training Cycle
# Input: EngramBank M, Dataset D, Stages S = {Lite, +Rel, +Aff}
# Output: Trained EG-3D model and VLA policy
# =======================================================
# STAGE-WISE TRAINING OF EG-3D
# =======================================================
for stage_s in S: # Iterate over Lite → +Rel → +Aff stages
freeze_unused_components(stage_s) # Freeze modules not used in the current stage
for (I, P, G_star) in D: # Sample a minibatch (image, prompt, GT 3D)
# -------------------------------------------------------
# Visual-language encoding and part-level slot extraction
# -------------------------------------------------------
Z_tokens = E_VL(I, P) # Fused visual-language tokens {z_t}
K_slots = SlotAttention(Z_tokens) # Extract part-level structural keys {k_j}
# -------------------------------------------------------
# Engram retrieval and modular structural composition
# -------------------------------------------------------
Z_e = RetrieveAndCompose( # Retrieve and assemble Engram components
K_slots, M, stage_s) # Z_e contains geo/sym/(rel)/(aff) by stage
# -------------------------------------------------------
# 3D generation and coarse pose estimation
# -------------------------------------------------------
G, T_coarse = G_theta(I, Z_e) # Predict 3D geometry G and coarse slot poses {T_j^coarse}
# -------------------------------------------------------
# Pose refinement using relational tokens (if enabled)
# -------------------------------------------------------
if stage_s >= "+Rel": # Only refine poses when relational structure is available
T_ref = PoseRefine( # Refine poses with relational and slot information
T_coarse, K_slots, Z_e.rel) # Output refined poses {T_j^ref}
else:
T_ref = T_coarse # Use coarse poses in Lite stage
# -------------------------------------------------------
# Compute stage-dependent training losses
# -------------------------------------------------------
L = L_rec(G, G_star) # Reconstruction loss on geometry and rendering
+ λ_inv * L_inv(K_slots) # Invariance loss to enforce structure-only keys
+ λ_sym * L_sym(G, Z_e.sym) # Symmetry regularization loss
if stage_s >= "+Rel": # Additional losses for relational stages
L += λ_slot * L_slot_align(K_slots) # Slot-part alignment loss
+ λ_kin * L_kin(G, T_ref, Z_e.rel) # Kinematic consistency loss
+ λ_pose * L_pose(T_ref, I) # Pose loss (gravity + multi-view + kinematic)
if stage_s == "+Aff": # Additional loss for affordance stage
L += λ_aff * L_aff(G, Z_e.aff) # Affordance supervision loss
# -------------------------------------------------------
# Update only the parameters active in the current stage
# -------------------------------------------------------
update(params_of(stage_s), loss=L) # Stage-wise parameter update
# =======================================================
# TRAINING VLA POLICY WITH FROZEN EG-3D
# =======================================================
freeze(EG3D) # Freeze all EG-3D modules during policy learning
for episode in RL_episodes: # Iterate over reinforcement learning episodes
I_t, P = observe() # Observe current image and task prompt
# -------------------------------------------------------
# Forward pass through frozen EG-3D for structure-aware state
# -------------------------------------------------------
G_t, Z_e_t = EG3D(I_t, P) # Obtain reconstructed 3D and retrieved structure tokens
# Z_e_t.rel / Z_e_t.aff may be empty depending on stage
# -------------------------------------------------------
# Vision-Language-Action policy inference and execution
# -------------------------------------------------------
A_t = π_VLA(Render(G_t), P) # Predict action from rendered 3D and prompt
execute(A_t) # Execute action in simulator or robot
# -------------------------------------------------------
# Compute reward with structural shaping and safety penalties
# -------------------------------------------------------
r_t = r_success() # Task success reward
+ α * r_struct( # Structural shaping reward using rel/aff tokens
Z_e_t.rel, Z_e_t.aff, A_t)
- β * r_collision() # Collision penalty
- γ * r_torque() # Excessive torque penalty
# -------------------------------------------------------
# Update VLA policy using PPO or SAC
# -------------------------------------------------------
update(π_VLA, algorithm="PPO/SAC", reward=r_t)
References
[1] D. Tochilkin et al. "TripoSR." arXiv:2403.02151, 2024. [2] Y. Hong et al. "LRM." arXiv:2311.04400, 2023. [3] X. Long et al. "Wonder3D." arXiv:2311.00005, 2023. [4] M. J. Kim et al. "OpenVLA." arXiv:2406.09246, 2024. [5] H. Shi et al. "MemoryVLA." arXiv:2508.19236, 2025. [6] F. Locatello et al. "Slot Attention." NeurIPS, 2020. [7] K. Mo et al. "PartNet." CVPR, 2019. [8] Y. Zhou et al. "Continuity of Rotation Representations." CVPR, 2019. [9] B. Mildenhall et al. "NeRF." ECCV, 2020. [10] J. J. Park et al. "DeepSDF." CVPR, 2019. [11] B. Kerbl et al. "3D Gaussian Splatting." ACM TOG, 2023. [12] W. Peebles and S. Xie. "DiT." CVPR, 2023. [13] K. Grauman et al. "Ego4D." CVPR, 2022. [14] S. Brahmbhatt et al. "ContactDB." CVPR, 2019.
