Engram-Guided Structural Memory for Single-View 3D Reconstruction in Humanoid Robotics

Community Article Published January 26, 2026

Author: GrooveJ (Danny Lee)
Affiliation: Independent Researcher (With the assistance of Many LLMs)

Conceptual paper (no experimental results).
This article presents a complete mathematical specification with intuitive explanations of all operators and symbols, designed for future implementation and empirical validation.


download (27)

Abstract

Single-view 3D reconstruction under occlusion is a critical bottleneck in humanoid robotics: functional structures—handles, joints, supports—are frequently hidden, causing grasp failures and unsafe actions. While Large Reconstruction Models (LRMs) produce visually plausible shapes, they lack mechanisms for reusing structural knowledge across objects. Inspired by the principle that decoupling memory from computation yields significant reasoning gains, we propose Engram-Guided 2D-to-3D (EG-3D): a framework that integrates structure-only 3D memory with LLM-conditioned perception and Vision-Language-Action (VLA) control.

Given an RGB image and a task prompt, an LLM-conditioned encoder extracts part-level structural keys via Slot Attention, anchored to functional parts through a prioritized supervision hierarchy (contact signals → geometric priors → relational consistency). These keys retrieve Modular Engrams—decomposed into geometric, relational, symmetry, and affordance subcomponents—each injected through dedicated pathways. We enforce structural constraints via kinematic consistency loss ensuring relational tokens are not ignored by the generator. For single-view pose estimation without ground truth, we leverage gravity-aligned priors (from robot IMU) combined with multi-view consistency. We also introduce a staged memory construction protocol that enables deployment with minimal initial annotation (geometry + symmetry only), progressively adding relations and affordances.

This paper provides: (1) a complete mathematical specification with detailed explanations for every equation, (2) a minimal "EG-3D Lite" configuration for practical deployment, (3) explicit mechanisms preventing generators from ignoring structural constraints, and (4) clear differentiation from prior retrieval-based methods.


1. Introduction

1.1 The Problem: Occlusion-Induced Failures in Manipulation

Humanoid robots increasingly rely on monocular head-mounted cameras. However, real-world manipulation involves inherent underspecification: crucial functional elements—door handles, hinges, support legs, tool tips, internal joints—are frequently occluded by the object itself or by the robot's body. When these structures are misreconstructed or not inferred at all, robots select incorrect grasp points, apply unsafe forces, collide unexpectedly, or fail tasks entirely.

1.2 Industry Context: Why Now?

The strategic importance of single-view 3D reconstruction for robotics has become increasingly apparent. Recent acquisitions and investments around single-image-to-3D asset generation highlight the urgency of robust 2D→3D pipelines for embodied systems. Major technology companies are actively acquiring startups in this space, recognizing that perception-to-action pipelines are critical bottlenecks for deploying robots at scale.

1.3 The Gap: Missing Structural Memory

Recent LRMs (TripoSR, LRM, Wonder3D) scale transformers and diffusion to generate plausible geometry from single images. Yet they exhibit systematic structural errors: wrong part counts, symmetry collapse, hallucinated connections, or missing functional subparts. We argue that a central missing ingredient is reusable structural memory.

1.4 Inspiration: Conditional Memory as a Modeling Primitive

The Engram-Guided Structural Memory concept demonstrates that decoupling memory from computation yields significant reasoning gains. We extend this principle to 3D perception.

1.5 Key Technical Contributions

  1. Modular Engram Architecture with explicit composition operators and kinematic consistency loss ensuring relational constraints are enforced

  2. Prioritized Slot-Part Alignment: contact/interaction signals (strongest) → geometric priors → relational consistency (weakest)

  3. Gravity-Aligned Pose Estimation: leveraging robot IMU for coarse alignment without ground-truth poses

  4. Staged Memory Construction: minimal viable memory (geo + sym) → progressive enrichment (rel → aff)

  5. Clear Differentiation from retrieval-based priors: EG-3D is a structural assembly system, not a retrieval prior


2. Related Work and Differentiation

2.1 Retrieval-Based 3D Priors vs. Structural Assembly Memory

A key question is: How does EG-3D differ from existing retrieval-based 3D priors?

Aspect Retrieval-Based Priors EG-3D (Ours)
Retrieval unit Global shape or exemplar Part-level structural motifs
Memory content Full shapes (with texture/appearance) Structure-only (geo + rel + sym + aff)
Composition Single retrieval or blending SE(3)-aware multi-part assembly
Key space Category/appearance-based Structure-only (invariance-regularized)
Constraints Implicit (hope generator learns) Explicit (kinematic consistency loss)
Affordance Not modeled First-class component
Action loop Not considered VLA integration

Our key insight: EG-3D is not "retrieval + generation" but "structural assembly + generation". We retrieve parts, align them via SE(3), enforce kinematic constraints, and compose them with explicit operators.

2.2 Memory in Vision-Language-Action Models

Recent VLA models incorporate memory:

  • MemoryVLA: Temporal memory for long-horizon tasks
  • IVE: Episodic memory for exploration

Our work addresses structural memory (reusable 3D motifs across objects), complementary to temporal/episodic memory.


3. Notation and Preliminaries

3.1 Inputs and Embeddings

Symbol Type Description
I\mathcal{I} RH×W×3\mathbb{R}^{H \times W \times 3} Input RGB image with height HH, width WW, 3 color channels
PP V\mathcal{V}^* Natural-language prompt as sequence of tokens from vocabulary V\mathcal{V}
{zt}t=1N+T\{z_t\}_{t=1}^{N+T} RD\mathbb{R}^D each Fused token sequence: NN visual tokens + TT text tokens, each DD-dimensional
DD N\mathbb{N} Embedding dimension (typical: 768 or 1024)

3.2 Part-Level Queries (Slots)

Symbol Type Description
JJ N\mathbb{N} Number of part-level query slots (typical: 8–16)
kjk_j RD\mathbb{R}^D Structure-only query key for slot jj
aj,ta_{j,t} [0,1][0,1] Attention weight: how much slot jj attends to token tt

3.3 Modular Engram Structure

Each Engram Em\mathbf{E}_m is a modular tuple:

Em=(emgeo,emrel,emsym,emaff,Tmc)\mathbf{E}_m = (e_m^{\text{geo}}, e_m^{\text{rel}}, e_m^{\text{sym}}, e_m^{\text{aff}}, T_m^c)

Symbol-by-symbol explanation:

  • mm: Index of the Engram in memory (\(m = 1, \ldots, M\))
  • emgeoRdge_m^{\text{geo}} \in \mathbb{R}^{d_g}: Geometric component — latent vector encoding 3D shape
  • emrelRNr×dre_m^{\text{rel}} \in \mathbb{R}^{N_r \times d_r}: Relational componentNrN_r tokens encoding part connectivity
  • emsymRdse_m^{\text{sym}} \in \mathbb{R}^{d_s}: Symmetry component — descriptor of symmetry type and axes
  • emaffRNa×dae_m^{\text{aff}} \in \mathbb{R}^{N_a \times d_a}: Affordance componentNaN_a markers for interaction points
  • TmcSE(3)T_m^c \in SE(3): Canonical frame — reference pose for spatial alignment
Component Injection Pathway Required in Minimal Config?
emgeoe_m^{\text{geo}} AdaLN ✅ Yes
emrele_m^{\text{rel}} Cross-Attention + Kinematic Loss ❌ Optional (Stage 2)
emsyme_m^{\text{sym}} Symmetry-Aware Decoder ✅ Yes
emaffe_m^{\text{aff}} Auxiliary Head ❌ Optional (Stage 3)
TmcT_m^c SE(3) Alignment ✅ Yes (coarse)

3.4 Memory Structure

Symbol Type Description
MM N\mathbb{N} Total number of Engrams in memory
M\mathcal{M} {(km,Em)}m=1M\{(k_m, \mathbf{E}_m)\}_{m=1}^{M} Memory bank as key-Engram pairs
kmk_m RD\mathbb{R}^D Retrieval key for Engram mm
αj,m\alpha_{j,m} [0,1][0,1] Retrieval weight: how much slot jj retrieves Engram mm
τ\tau R+\mathbb{R}^+ Temperature for softmax retrieval (typical: 0.05–0.2)

3.5 3D Generator and Action

Symbol Type Description
GG Continuous 3D representation (NeRF, SDF, or 3DGS)
GθG_\theta 3D generator network with parameters θ\theta
Render(G)\text{Render}(G) RH×W×3\mathbb{R}^{H' \times W' \times 3} Differentiable rendering of GG
AA RK\mathbb{R}^{K} Action vector (\(K\) dimensions, e.g., 7 for 6-DoF + gripper)
πVLA\pi_{\text{VLA}} Vision-Language-Action policy

4. Core Framework

4.1 Pipeline Overview

The complete pipeline consists of five stages:


Equation (1): Visual-Language Encoding

{zt}t=1N+T=EVL(P,I),z=Pool({zt})\{z_t\}_{t=1}^{N+T} = E_{\text{VL}}(P, \mathcal{I}), \quad z = \text{Pool}(\{z_t\})

Symbol-by-symbol:

  • EVLE_{\text{VL}}: Visual-Language encoder (any VLM backbone)
  • PP: Input text prompt
  • I\mathcal{I}: Input RGB image
  • {zt}t=1N+T\{z_t\}_{t=1}^{N+T}: Output token sequence (\(N\) visual + TT text tokens)
  • Pool()\text{Pool}(\cdot): Pooling operation (mean or attention pooling)
  • zz: Global embedding summarizing the entire input

Intuitive Explanation:
This is the robot's "perception + understanding" step. The encoder looks at the image and reads the instruction, producing a sequence of tokens that represent both visual content and linguistic intent. The pooled vector zz captures the overall context.

Why this matters:
Joint visual-language encoding allows the system to focus on task-relevant parts. For "grab the handle," the encoder emphasizes handle-related visual regions.


Equation (2): Structure-Only Key Extraction

{kj}j=1J=K({zt})\{k_j\}_{j=1}^{J} = \mathcal{K}(\{z_t\})

Symbol-by-symbol:

  • K\mathcal{K}: Key extraction function (implemented via Slot Attention)
  • {zt}\{z_t\}: Input token sequence from encoder
  • kjRDk_j \in \mathbb{R}^D: Output key for slot jj
  • JJ: Number of slots (part hypotheses)

Expanded form (Slot Attention):

kj=t=1N+Taj,tztk_j = \sum_{t=1}^{N+T} a_{j,t} \cdot z_t

where attention weights are computed as:

aj,t=exp(qjkt/D)j=1Jexp(qjkt/D)a_{j,t} = \frac{\exp(q_j^\top k_t / \sqrt{D})}{\sum_{j'=1}^{J} \exp(q_{j'}^\top k_t / \sqrt{D})}

Critical constraint (competition across slots):

j=1Jaj,t=1for each token t\sum_{j=1}^{J} a_{j,t} = 1 \quad \text{for each token } t

Intuitive Explanation:
Slot Attention splits the scene into parts. Each slot "competes" to explain different tokens—one slot might capture "the handle," another "the door body," another "the hinge." The key kjk_j summarizes what slot jj has captured.

Why the competition constraint?
Without it, all slots might attend to everything equally, producing identical keys. Competition forces specialization.


Equation (3): Modular Engram Retrieval with SE(3) Alignment

Ze=RSE(3)({kj};M)\mathbf{Z}_e = \mathcal{R}_{\text{SE(3)}}(\{k_j\}; \mathcal{M})

Symbol-by-symbol:

  • RSE(3)\mathcal{R}_{\text{SE(3)}}: Retrieval function with SE(3) alignment
  • {kj}\{k_j\}: Query keys from slots
  • M\mathcal{M}: Memory bank
  • Ze\mathbf{Z}_e: Retrieved and composed Engram components

Expanded form (retrieval weights):

sj,m=kjkmkj2km2s_{j,m} = \frac{k_j^\top k_m}{\|k_j\|_2 \|k_m\|_2}

αj,m=exp(sj,m/τ)l=1Mexp(sj,l/τ)\alpha_{j,m} = \frac{\exp(s_{j,m}/\tau)}{\sum_{l=1}^{M}\exp(s_{j,l}/\tau)}

Symbol-by-symbol:

  • sj,m[1,1]s_{j,m} \in [-1, 1]: Cosine similarity between query kjk_j and memory key kmk_m
  • τ>0\tau > 0: Temperature (lower = sharper retrieval, higher = softer)
  • αj,m[0,1]\alpha_{j,m} \in [0, 1]: Normalized retrieval weight (sums to 1 over mm)

Intuitive Explanation:
For each slot's key, we find the most similar Engrams in memory. The softmax converts similarities into probability-like weights. Temperature controls sharpness: τ0\tau \to 0 gives winner-take-all, τ\tau \to \infty gives uniform weighting.


Equation (4): Multi-Pathway 3D Generation with Kinematic Constraints

G=Gθ(I,Ze),subject to Lkin(G,Zerel)<ϵG = G_\theta(\mathcal{I}, \mathbf{Z}_e), \quad \text{subject to } \mathcal{L}_{\text{kin}}(G, Z_e^{\text{rel}}) < \epsilon

Symbol-by-symbol:

  • GθG_\theta: 3D generator network with parameters θ\theta
  • I\mathcal{I}: Input image (provides appearance guidance)
  • Ze\mathbf{Z}_e: Composed Engram components (structural guidance)
  • Lkin\mathcal{L}_{\text{kin}}: Kinematic consistency loss
  • ϵ\epsilon: Constraint tolerance

Intuitive Explanation:
The generator produces 3D geometry guided by both the image (for appearance) and retrieved Engrams (for structure). The kinematic constraint ensures the output respects relational information (e.g., joint axes, connection points).


Equation (5): Action Inference

A=πVLA(Render(G),P)A = \pi_{\text{VLA}}(\text{Render}(G), P)

Symbol-by-symbol:

  • πVLA\pi_{\text{VLA}}: Vision-Language-Action policy
  • Render(G)\text{Render}(G): Rendered images from 3D reconstruction
  • PP: Original task prompt
  • ARKA \in \mathbb{R}^K: Output action vector

Intuitive Explanation:
Given the reconstructed 3D (rendered as images) and the task instruction, the VLA policy outputs robot actions. The explicit 3D structure helps the policy reason about occluded parts.


4.2 Slot-Part Alignment with Prioritized Supervision

Critical Challenge: Slot Attention often produces "semantic blobs" rather than functional parts.

Our Solution: Prioritized Supervision Hierarchy

Priority Signal Source Weight Description
1 (Highest) Contact/interaction w1=1.0w_1 = 1.0 Where humans/robots actually touch
2 Geometric priors w2=0.5w_2 = 0.5 Cylinders are handles, planes are supports
3 Relational consistency w3=0.3w_3 = 0.3 Connected parts should be spatially close
4 (Lowest) Diversity regularization w4=0.1w_4 = 0.1 Slots should differ from each other

Combined Slot Alignment Loss:

Lslot-align=w1Lcontact+w2Lgeo-prior+w3Lrel-consist+w4Ldiversity\mathcal{L}_{\text{slot-align}} = w_1 \mathcal{L}_{\text{contact}} + w_2 \mathcal{L}_{\text{geo-prior}} + w_3 \mathcal{L}_{\text{rel-consist}} + w_4 \mathcal{L}_{\text{diversity}}

Symbol-by-symbol:

  • w1,w2,w3,w4w_1, w_2, w_3, w_4: Priority weights (\(w_1 > w_2 > w_3 > w_4\))
  • Lcontact\mathcal{L}_{\text{contact}}: Contact signal loss
  • Lgeo-prior\mathcal{L}_{\text{geo-prior}}: Geometric affordance prior loss
  • Lrel-consist\mathcal{L}_{\text{rel-consist}}: Relational consistency loss
  • Ldiversity\mathcal{L}_{\text{diversity}}: Slot diversity regularization

Intuitive Explanation:
We stack multiple supervision signals by priority. Contact signals (where people actually grab objects) are strongest because they directly indicate functional parts. Lower-priority signals fill in when higher-priority signals are unavailable.


Priority 1: Contact Signal Loss

Lcontact=j=1JBCE(SlotMaskj,ContactMask)\mathcal{L}_{\text{contact}} = \sum_{j=1}^{J} \text{BCE}\big(\text{SlotMask}_j, \text{ContactMask}\big)

Symbol-by-symbol:

  • SlotMaskj[0,1]H×W\text{SlotMask}_j \in [0,1]^{H \times W}: Attention mask for slot jj (where it attends in the image)
  • ContactMask{0,1}H×W\text{ContactMask} \in \{0,1\}^{H \times W}: Ground-truth contact regions (from HOI data)
  • BCE\text{BCE}: Binary Cross-Entropy loss

Expanded BCE:

BCE(p,q)=1HWi,j[qijlogpij+(1qij)log(1pij)]\text{BCE}(p, q) = -\frac{1}{HW}\sum_{i,j}\big[q_{ij}\log p_{ij} + (1-q_{ij})\log(1-p_{ij})\big]

Intuitive Explanation:
If we know where humans touch objects (from videos of people using objects), we train slots to attend to those regions. A slot attending to a handle should align with hand-contact regions on handles.

Why highest priority?
Contact directly indicates functional relevance. A handle is a handle because people grab it.


Priority 2: Geometric Affordance Prior Loss

Lgeo-prior=j=1JaA1[SlotMaskjGeoPriora]logp(akj)\mathcal{L}_{\text{geo-prior}} = -\sum_{j=1}^{J} \sum_{a \in \mathcal{A}} \mathbb{1}[\text{SlotMask}_j \cap \text{GeoPrior}_a \neq \emptyset] \cdot \log p(a | k_j)

Symbol-by-symbol:

  • A\mathcal{A}: Set of affordance types {grasp, support, hinge, button, ...}
  • GeoPriora\text{GeoPrior}_a: Regions satisfying geometric prior for affordance aa
  • 1[]\mathbb{1}[\cdot]: Indicator function (1 if condition true, 0 otherwise)
  • p(akj)p(a | k_j): Predicted probability of affordance aa given slot key kjk_j

Geometric priors (examples):

  • Grasp (handle): Elongated cylindrical regions where λ1λ2λ3\lambda_1 \gg \lambda_2 \approx \lambda_3 (PCA eigenvalues)
  • Support: Horizontal surfaces with upward normal: nz^>0.9\mathbf{n} \cdot \hat{z} > 0.9
  • Hinge: Concave junctions between planar surfaces
  • Button: Small convex protrusions on flat surfaces

Intuitive Explanation:
If a slot attends to a cylindrical region, it should predict "grasp" affordance. This loss encourages geometric consistency between attention patterns and affordance predictions.


Priority 3: Relational Consistency Loss

Lrel-consist=(i,j)ErelCentroid(SlotMaski)Centroid(SlotMaskj)Δtijrel22\mathcal{L}_{\text{rel-consist}} = \sum_{(i,j) \in \mathcal{E}_{\text{rel}}} \| \text{Centroid}(\text{SlotMask}_i) - \text{Centroid}(\text{SlotMask}_j) - \Delta t_{ij}^{\text{rel}} \|_2^2

Symbol-by-symbol:

  • Erel\mathcal{E}_{\text{rel}}: Set of slot pairs connected by relational tokens
  • Centroid(SlotMaski)R2\text{Centroid}(\text{SlotMask}_i) \in \mathbb{R}^2: 2D centroid of slot ii's attention mask
  • ΔtijrelR2\Delta t_{ij}^{\text{rel}} \in \mathbb{R}^2: Expected relative position from relational token (projected to 2D)

Intuitive Explanation:
If relational tokens say "slot ii connects to slot jj with offset Δt\Delta t," then their attention centroids should reflect this spatial relationship. A handle connected to a door should attend to adjacent regions.


Priority 4: Slot Diversity Regularization

Ldiversity=j=1JH(aj)entropy maximization+λrepijmax(0,cos(ki,kj)δ)repulsion\mathcal{L}_{\text{diversity}} = \underbrace{-\sum_{j=1}^{J} \mathcal{H}(a_j)}_{\text{entropy maximization}} + \underbrace{\lambda_{\text{rep}} \sum_{i \neq j} \max(0, \cos(k_i, k_j) - \delta)}_{\text{repulsion}}

Symbol-by-symbol:

  • H(aj)=taj,tlogaj,t\mathcal{H}(a_j) = -\sum_t a_{j,t} \log a_{j,t}: Entropy of slot jj's attention distribution
  • cos(ki,kj)\cos(k_i, k_j): Cosine similarity between slot keys
  • δ\delta: Similarity threshold (typical: 0.5)
  • λrep\lambda_{\text{rep}}: Repulsion strength

Intuitive Explanation:
The entropy term encourages each slot to attend broadly (not collapse to a single token). The repulsion term pushes slot keys apart, preventing multiple slots from representing the same part.


4.3 Structure-Only Key Learning via Invariance

Problem: Keys might encode texture/color, causing retrieval to match appearance rather than structure.

Solution: Invariance regularization forces keys to ignore appearance.


Combined Invariance Loss:

Linv=Linvtex+λpLinvprompt+λcLinvcross\mathcal{L}_{\text{inv}} = \mathcal{L}_{\text{inv}}^{\text{tex}} + \lambda_p \mathcal{L}_{\text{inv}}^{\text{prompt}} + \lambda_c \mathcal{L}_{\text{inv}}^{\text{cross}}


Texture Augmentation Invariance:

Linvtex=j=1Jkj(I)kj(Augtex(I))22\mathcal{L}_{\text{inv}}^{\text{tex}} = \sum_{j=1}^{J} \| k_j(\mathcal{I}) - k_j(\text{Aug}_{\text{tex}}(\mathcal{I})) \|_2^2

Symbol-by-symbol:

  • kj(I)k_j(\mathcal{I}): Key from slot jj given original image
  • Augtex(I)\text{Aug}_{\text{tex}}(\mathcal{I}): Texture-augmented image (hue shift, saturation change, style transfer)
  • 22\|\cdot\|_2^2: Squared L2 norm

Intuitive Explanation:
The same object with different colors/textures should produce identical keys. We augment textures and penalize key differences.


Prompt Invariance (Selective):

Linvprompt=j=1Jkj(P1,I)kj(P2,I)221[SameStructure(P1,P2)]\mathcal{L}_{\text{inv}}^{\text{prompt}} = \sum_{j=1}^{J} \| k_j(P_1, \mathcal{I}) - k_j(P_2, \mathcal{I}) \|_2^2 \cdot \mathbb{1}[\text{SameStructure}(P_1, P_2)]

Symbol-by-symbol:

  • P1,P2P_1, P_2: Two different prompts
  • SameStructure(P1,P2)\text{SameStructure}(P_1, P_2): True if prompts describe same structure with different appearance

Examples:

  • P1P_1 = "red wooden mug" vs P2P_2 = "blue ceramic mug" → Same structure, apply invariance
  • P1P_1 = "mug with handle" vs P2P_2 = "mug without handle" → Different structure, skip invariance

Intuitive Explanation:
"Red wooden chair" and "blue metal chair" should retrieve the same structural Engrams. But "chair with armrests" and "chair without armrests" are structurally different.


Cross-Instance Structural Alignment:

Linvcross=(i,j)Ssamecos(ki,kj)attract same-structure pairs+(i,j)Sdiffmax(0,cos(ki,kj)δ)repel different-structure pairs\mathcal{L}_{\text{inv}}^{\text{cross}} = \underbrace{-\sum_{(i,j) \in \mathcal{S}_{\text{same}}} \cos(k_i, k_j)}_{\text{attract same-structure pairs}} + \underbrace{\sum_{(i,j) \in \mathcal{S}_{\text{diff}}} \max(0, \cos(k_i, k_j) - \delta)}_{\text{repel different-structure pairs}}

Symbol-by-symbol:

  • Ssame\mathcal{S}_{\text{same}}: Pairs of keys from same structure (different instances/textures)
  • Sdiff\mathcal{S}_{\text{diff}}: Pairs of keys from different structures
  • δ\delta: Margin threshold

Intuitive Explanation:
This is contrastive learning for structure. Keys from structurally similar objects should be close; keys from different structures should be far.


4.4 Memory Key Generation

Strategy A (Recommended): Shared Encoder Pipeline

km=1VTvVtexTK(EVL(Pneutral,Renderv(Sm,tex)))k_m = \frac{1}{|\mathcal{V}| \cdot |\mathcal{T}|} \sum_{v \in \mathcal{V}} \sum_{\text{tex} \in \mathcal{T}} \mathcal{K}\big(E_{\text{VL}}(P_{\text{neutral}}, \text{Render}_{v}(S_m, \text{tex}))\big)

Symbol-by-symbol:

  • SmS_m: Source 3D shape for Engram mm
  • V\mathcal{V}: Set of viewpoints (e.g., 8 canonical views)
  • T\mathcal{T}: Set of texture variations
  • Renderv(Sm,tex)\text{Render}_{v}(S_m, \text{tex}): Rendered image from viewpoint vv with texture tex\text{tex}
  • PneutralP_{\text{neutral}}: Structure-focused prompt (e.g., "a 3D object")

Intuitive Explanation:
Memory keys are generated using the same pipeline as query keys, ensuring they live in the same space. Averaging over viewpoints and textures ensures the key captures structure, not appearance or viewpoint.


4.5 Modular Engram Components (Detailed Specifications)

4.5.1 Relational Token Format

Storage Format (23-dimensional raw vector):

Field Dims Type Description
part_i_idx 1 Integer Index of first connected part
part_j_idx 1 Integer Index of second connected part
joint_type 6 One-hot {fixed, revolute, prismatic, spherical, planar, free}
axis 3 Unit vector Joint rotation/translation axis
rel_rotation 6 6D continuous Relative rotation (Zhou et al. representation)
rel_translation 3 3D vector Relative translation
constraint_limits 2 Floats [θmin,θmax][\theta_{\min}, \theta_{\max}] in radians
valid_mask 1 Binary 1 if valid, 0 if padding

Model Input Format (projected to drd_r dimensions):

emrel, proj=MLP([Embed(part_i);Embed(part_j);continuous_fields])e_m^{\text{rel, proj}} = \text{MLP}\big([\text{Embed}(\text{part\_i}); \text{Embed}(\text{part\_j}); \text{continuous\_fields}]\big)

Symbol-by-symbol:

  • Embed()\text{Embed}(\cdot): Learned embedding table for integer indices
  • continuous_fields\text{continuous\_fields}: The 20 continuous dimensions (joint_type through valid_mask)
  • MLP\text{MLP}: Multi-layer perceptron projecting to drd_r dimensions

4.5.2 Symmetry Component Format

Field Dims Description
sym_type 5 One-hot: {none, bilateral, radial, translational, helical}
sym_axis 3 Primary symmetry axis (unit vector)
sym_center 3 Center of symmetry (3D point)
sym_count 1 Repetition count (e.g., 4 for chair legs)
sym_spacing 1 Spacing for translational symmetry
Total 13

4.5.3 Affordance Component Format

Field Dims Description
position 3 3D location in canonical frame
aff_type 8 One-hot: {grasp, support, hinge, button, slider, socket, lid, none}
approach_dir 3 Approach/interaction direction
contact_normal 3 Surface normal at contact point
valid_mask 1 1 if valid, 0 if padding
Total 18 Per marker (\(N_a = 5\) markers)

4.6 Modular Engram Composition

Geometric Component (weighted average):

Zegeo=j=1Jm=1Mαj,mSE3Transform(emgeo,ΔTj,m)Z_e^{\text{geo}} = \sum_{j=1}^{J} \sum_{m=1}^{M} \alpha_{j,m} \cdot \text{SE3Transform}(e_m^{\text{geo}}, \Delta T_{j,m})

Symbol-by-symbol:

  • αj,m\alpha_{j,m}: Retrieval weight (slot jj, Engram mm)
  • emgeoe_m^{\text{geo}}: Geometric latent of Engram mm
  • ΔTj,m=Tj(Tmc)1\Delta T_{j,m} = T_j \cdot (T_m^c)^{-1}: Relative transformation from canonical to target pose
  • SE3Transform\text{SE3Transform}: Transformation operator (MLP approximation in latent space)

Intuitive Explanation:
Each retrieved geometric Engram is transformed to its target pose, then all are blended by retrieval weights.


Relational Component (concatenation with threshold):

Zerel=Concatj,m:αj,m>θrel[ej,mrel, proj]Z_e^{\text{rel}} = \text{Concat}_{j,m: \alpha_{j,m} > \theta_{\text{rel}}} \big[ e_{j,m}^{\text{rel, proj}} \big]

Intuitive Explanation:
We concatenate relational tokens from high-confidence retrievals. Thresholding prevents noisy low-weight retrievals from polluting the relation set.


Symmetry Component (weighted voting):

Zesym=argmaxtypej,mαj,m1[emsym.type=type]Z_e^{\text{sym}} = \arg\max_{\text{type}} \sum_{j,m} \alpha_{j,m} \cdot \mathbb{1}[e_m^{\text{sym}}.\text{type} = \text{type}]

Intuitive Explanation:
Symmetry is discrete (an object is either bilaterally symmetric or not). We take a weighted vote over retrieved symmetry types.


Affordance Component (transform and merge):

Zeaff=j,m:αj,m>θaffTransform(emaff,ΔTj,m)Z_e^{\text{aff}} = \bigcup_{j,m: \alpha_{j,m} > \theta_{\text{aff}}} \text{Transform}(e_m^{\text{aff}}, \Delta T_{j,m})

Intuitive Explanation:
Affordance markers (grasp points, etc.) are transformed to target poses and merged. The union collects all relevant interaction points.


4.7 Enforcing Relational Constraints: Kinematic Consistency Loss

Problem: Cross-attention alone allows the generator to "politely ignore" relational tokens.

Solution: Explicit loss penalizing kinematic violations.


Combined Kinematic Loss:

Lkin=Lkinjoint+λaxisLkinaxis+λlimitLkinlimit\mathcal{L}_{\text{kin}} = \mathcal{L}_{\text{kin}}^{\text{joint}} + \lambda_{\text{axis}} \mathcal{L}_{\text{kin}}^{\text{axis}} + \lambda_{\text{limit}} \mathcal{L}_{\text{kin}}^{\text{limit}}


Joint Position Consistency:

Lkinjoint=(i,j)E(Ti1Tj)Tijrelgeo2\mathcal{L}_{\text{kin}}^{\text{joint}} = \sum_{(i,j) \in \mathcal{E}} \| (T_i^{-1} \cdot T_j) - T_{ij}^{\text{rel}} \|_{\text{geo}}^2

Symbol-by-symbol:

  • E\mathcal{E}: Set of connected part pairs from relational tokens
  • Ti,TjSE(3)T_i, T_j \in SE(3): Estimated poses of slots ii and jj
  • Ti1TjT_i^{-1} \cdot T_j: Relative pose from ii to jj
  • TijrelT_{ij}^{\text{rel}}: Expected relative pose from relational token
  • geo\|\cdot\|_{\text{geo}}: Geodesic distance on SE(3)

Geodesic distance on SE(3):

T1T2geo=log(R1R2)F+λtt1t22\|T_1 - T_2\|_{\text{geo}} = \|\log(R_1^\top R_2)\|_F + \lambda_t \|t_1 - t_2\|_2

where log()\log(\cdot) is the matrix logarithm mapping rotation to axis-angle.

Intuitive Explanation:
If a relational token says "the handle is attached to the door with relative pose TijrelT_{ij}^{\text{rel}}," the generated geometry must satisfy this. Deviation is penalized.


Joint Axis Alignment:

Lkinaxis=(i,j):revolute/prismatic(1axisijPCA1(Gjointij))\mathcal{L}_{\text{kin}}^{\text{axis}} = \sum_{(i,j): \text{revolute/prismatic}} \big(1 - |\text{axis}_{ij}^\top \cdot \text{PCA}_1(G_{\text{joint}}^{ij})|\big)

Symbol-by-symbol:

  • axisij\text{axis}_{ij}: Joint axis from relational token (unit vector)
  • GjointijG_{\text{joint}}^{ij}: Geometry near the joint between parts ii and jj
  • PCA1()\text{PCA}_1(\cdot): First principal component (dominant direction)

Intuitive Explanation:
For a revolute joint (like a door hinge), the local geometry should be aligned with the joint axis. A poorly-aligned hinge would have high loss.


Joint Limit Enforcement:

Lkinlimit=(i,j)[max(0,θijθmax)+max(0,θminθij)]\mathcal{L}_{\text{kin}}^{\text{limit}} = \sum_{(i,j)} \big[\max(0, \theta_{ij} - \theta_{\max}) + \max(0, \theta_{\min} - \theta_{ij})\big]

Symbol-by-symbol:

  • θij\theta_{ij}: Current joint angle between parts ii and jj
  • θmin,θmax\theta_{\min}, \theta_{\max}: Joint limits from relational token
  • max(0,)\max(0, \cdot): ReLU (only penalize violations)

Intuitive Explanation:
A door can only open so far. If the generated geometry implies the door is open beyond its limit, this is penalized.


Relational Tokens also Feed Pose Refinement:

Tjrefined=PoseRefineHead(Tjcoarse,kj,Zerel)T_j^{\text{refined}} = \text{PoseRefineHead}(T_j^{\text{coarse}}, k_j, Z_e^{\text{rel}})

Intuitive Explanation:
Beyond cross-attention in the generator, relational tokens also directly inform pose estimation. This creates multiple pathways for relational information, making it harder to ignore.


4.8 Gravity-Aligned Pose Estimation (No Ground Truth)

Challenge: Single-view pose estimation usually needs GT supervision.

Solution: Use robot IMU + multi-view consistency.


Gravity Prior Loss:

Lgravity=j=1J(1gRjz^)1[GravityAligned(j)]\mathcal{L}_{\text{gravity}} = \sum_{j=1}^{J} \big(1 - |\mathbf{g}^\top \cdot R_j \cdot \hat{z}|\big) \cdot \mathbb{1}[\text{GravityAligned}(j)]

Symbol-by-symbol:

  • gR3\mathbf{g} \in \mathbb{R}^3: Gravity direction from IMU (unit vector)
  • RjSO(3)R_j \in SO(3): Estimated rotation for slot jj
  • z^=[0,0,1]\hat{z} = [0, 0, 1]^\top: Canonical up-vector
  • 1[GravityAligned(j)]\mathbb{1}[\text{GravityAligned}(j)]: 1 if slot jj should be gravity-aligned (supports, tables, floors)

Intuitive Explanation:
A table surface should be horizontal (aligned with gravity). We use the robot's IMU to know which way is "up" and penalize misalignment.


Multi-View Consistency Loss:

LMV=v1v2Renderv1(G)Warpv1v2(Renderv2(G))1\mathcal{L}_{\text{MV}} = \sum_{v_1 \neq v_2} \| \text{Render}_{v_1}(G) - \text{Warp}_{v_1 \leftarrow v_2}(\text{Render}_{v_2}(G)) \|_1

Symbol-by-symbol:

  • Renderv(G)\text{Render}_{v}(G): Rendering from viewpoint vv
  • Warpv1v2\text{Warp}_{v_1 \leftarrow v_2}: Image warping from v2v_2 to v1v_1 using estimated depth/pose

Intuitive Explanation:
If the 3D reconstruction is correct, rendering from different viewpoints and warping should produce consistent images. Inconsistencies indicate pose or geometry errors.


Combined Pose Loss (No GT):

Lposeno-GT=λgLgravity+λMVLMV+λkinLkin\mathcal{L}_{\text{pose}}^{\text{no-GT}} = \lambda_g \mathcal{L}_{\text{gravity}} + \lambda_{\text{MV}} \mathcal{L}_{\text{MV}} + \lambda_{\text{kin}} \mathcal{L}_{\text{kin}}

Intuitive Explanation:
Without ground-truth poses, we combine: (1) gravity alignment (from IMU), (2) multi-view consistency (self-supervised), (3) kinematic constraints (from relational tokens). Together, these provide sufficient signal to learn reasonable poses.


4.9 Multi-Pathway Generator Conditioning

Pathway A: AdaLN for Geometry

hl=γl(Zegeo)LN(hl)+βl(Zegeo)h'_l = \gamma_l(Z_e^{\text{geo}}) \odot \text{LN}(h_l) + \beta_l(Z_e^{\text{geo}})

Symbol-by-symbol:

  • hlRB×Ch_l \in \mathbb{R}^{B \times C}: Activation at layer ll (batch BB, channels CC)
  • LN(hl)=hlμlσl\text{LN}(h_l) = \frac{h_l - \mu_l}{\sigma_l}: Layer normalization
  • μl,σl\mu_l, \sigma_l: Mean and std (per-sample, across channels for MLPs; per-token for Transformers)
  • γl(Zegeo)=WγlZegeo+bγl\gamma_l(Z_e^{\text{geo}}) = W_\gamma^l Z_e^{\text{geo}} + b_\gamma^l: Scale from geometric prior
  • βl(Zegeo)=WβlZegeo+bβl\beta_l(Z_e^{\text{geo}}) = W_\beta^l Z_e^{\text{geo}} + b_\beta^l: Shift from geometric prior
  • \odot: Element-wise multiplication

Intuitive Explanation:
AdaLN injects the geometric prior by controlling the scale and shift of activations at every layer. This is a strong form of conditioning—the prior directly modulates the "communication channels" in the network.


Pathway B: Cross-Attention for Relations

hl=hl+CrossAttn(Q=hl,K=Zerel,V=Zerel,M=Mvalid)h'_l = h_l + \text{CrossAttn}(Q=h_l, K=Z_e^{\text{rel}}, V=Z_e^{\text{rel}}, M=M_{\text{valid}})

Expanded Cross-Attention:

CrossAttn(Q,K,V,M)=softmax(QKd+M)V\text{CrossAttn}(Q, K, V, M) = \text{softmax}\bigg(\frac{QK^\top}{\sqrt{d}} + M\bigg) V

where Mij=M_{ij} = -\infty if token jj is invalid (padding).

Intuitive Explanation:
The generator can "ask questions" about relational structure. "Is there a joint here? What type? What axis?" The cross-attention mechanism retrieves relevant relational information.


Pathway C: Per-Slot Symmetry

Gj(x)={1nji=0nj1Gbasej(Raxis2πi/njxlocal)if radial symmetry12(Gbasej(xlocal)+Gbasej(Reflect(xlocal)))if bilateralGbasej(xlocal)if no symmetryG_j(x) = \begin{cases} \frac{1}{n_j} \sum_{i=0}^{n_j-1} G_{\text{base}}^j\big(R_{\text{axis}}^{2\pi i/n_j} \cdot x_{\text{local}}\big) & \text{if radial symmetry}\\[2mm] \frac{1}{2}\big(G_{\text{base}}^j(x_{\text{local}}) + G_{\text{base}}^j(\text{Reflect}(x_{\text{local}}))\big) & \text{if bilateral}\\[2mm] G_{\text{base}}^j(x_{\text{local}}) & \text{if no symmetry} \end{cases}

Symbol-by-symbol:

  • Gj(x)G_j(x): Final field value for slot jj at point xx
  • GbasejG_{\text{base}}^j: Base generator output (before symmetry)
  • njn_j: Symmetry count (e.g., 4 for chair legs)
  • RaxisθR_{\text{axis}}^{\theta}: Rotation by θ\theta around symmetry axis
  • xlocal=(Tjc)1xx_{\text{local}} = (T_j^c)^{-1} \cdot x: Point transformed to slot's local frame
  • Reflect()\text{Reflect}(\cdot): Reflection across symmetry plane

Intuitive Explanation:
Symmetry is enforced by averaging the field over symmetric transformations. For 4-way radial symmetry (like chair legs), we evaluate the base field at 4 rotated positions and average. This guarantees perfect symmetry.


Pathway D: Affordance Head

Apred=AffordanceHead(Gfeatures,Zeaff)A_{\text{pred}} = \text{AffordanceHead}(G_{\text{features}}, Z_e^{\text{aff}})

Intuitive Explanation:
A dedicated head predicts affordance heatmaps (grasp points, support surfaces) using both generator features and retrieved affordance markers. This provides explicit manipulation-relevant outputs.


4.10 Composition Operators (Generator-Specific)

SDF Composition (Smooth Minimum):

f(x)=SoftMinj(fj(x))=1βlogj=1Jexp(βfj(x))f(x) = \text{SoftMin}_j\big(f_j(x)\big) = -\frac{1}{\beta}\log\sum_{j=1}^{J} \exp(-\beta \cdot f_j(x))

Symbol-by-symbol:

  • fj(x)f_j(x): SDF value from slot jj at point xx (negative inside, positive outside)
  • β>0\beta > 0: Sharpness parameter (larger = closer to hard min)

Limit behavior:

  • β\beta \to \infty: SoftMinmin\text{SoftMin} \to \min (hard union)
  • β0\beta \to 0: SoftMinmean\text{SoftMin} \to \text{mean} (smooth blend)

Intuitive Explanation:
SDFs represent shapes as distance fields. To combine shapes (union), we take the minimum. Soft-min provides smooth transitions at part boundaries.


NeRF Composition (Density Mixture):

σ(x)=j=1Jwj(x)σj(x)\sigma(x) = \sum_{j=1}^{J} w_j(x) \cdot \sigma_j(x)

c(x)=j=1Jwj(x)σj(x)cj(x)j=1Jwj(x)σj(x)+ϵc(x) = \frac{\sum_{j=1}^{J} w_j(x) \cdot \sigma_j(x) \cdot c_j(x)}{\sum_{j=1}^{J} w_j(x) \cdot \sigma_j(x) + \epsilon}

Symbol-by-symbol:

  • σj(x)\sigma_j(x): Density from slot jj at point xx
  • cj(x)c_j(x): Color from slot jj
  • wj(x)w_j(x): Spatial weight for slot jj (from lifted attention mask)
  • ϵ\epsilon: Small constant for numerical stability

Intuitive Explanation:
NeRF represents scenes as density and color fields. We combine slot contributions using spatial weights, with colors weighted by density (denser regions contribute more to final color).


3DGS Composition (Concatenation + Pruning):

Gfinal=Prune(j=1JGj,  θopacity)\mathcal{G}_{\text{final}} = \text{Prune}\bigg(\bigcup_{j=1}^{J} \mathcal{G}_j, \;\theta_{\text{opacity}}\bigg)

Symbol-by-symbol:

  • Gj\mathcal{G}_j: Set of Gaussians from slot jj
  • \bigcup: Set union (concatenation)
  • Prune(,θ)\text{Prune}(\cdot, \theta): Remove Gaussians with opacity below θ\theta

Intuitive Explanation:
3D Gaussian Splatting represents scenes as collections of Gaussian primitives. We simply concatenate Gaussians from all slots, then prune low-opacity ones to remove redundancy.


5. Staged Memory Construction Protocol

Key Insight: Full Engram annotation is expensive. We propose staged construction for practical deployment.

5.1 Stage 1: Minimal Viable Memory (EG-3D Lite)

Required: egeoe^{\text{geo}}, esyme^{\text{sym}}, TcT^c (coarse)

Annotation effort: Low (automatic from 3D scans)

  • Geometry: PointNet++ encoder on mesh/point cloud
  • Symmetry: PCA + RANSAC detection (fully automatic)
  • Canonical frame: Gravity + principal axis alignment (automatic)

Capabilities: Basic shape retrieval, symmetry enforcement, rigid object reconstruction


5.2 Stage 2: Add Relational Structure

Added: erele^{\text{rel}}

Annotation effort: Medium (semi-automatic)

  • Source: PartNet-Mobility dataset, or learned from articulation videos

Capabilities: Articulated object reconstruction, kinematic constraint enforcement


5.3 Stage 3: Add Affordance

Added: eaffe^{\text{aff}}

Annotation effort: Medium-High (from interaction data)

  • Source: HOI videos, robot demonstrations, geometric priors

Capabilities: Full affordance-grounded reconstruction, direct grasp point prediction


5.4 Staged Training Protocol

Phase 1: Train with geo + sym only
         → Learn basic retrieval and composition

Phase 2: Freeze geo/sym modules, add rel
         → Learn relational cross-attention
         → Enable kinematic consistency loss

Phase 3: Freeze geo/sym/rel, add aff
         → Learn affordance head
         → Fine-tune with prioritized slot alignment

6. EG-3D Lite: Minimal Configuration

Component EG-3D Full EG-3D Lite
egeoe^{\text{geo}}
esyme^{\text{sym}}
erele^{\text{rel}}
eaffe^{\text{aff}}
Kinematic loss
Affordance head

EG-3D Lite Loss:

LLite=Lrec+λinvLinv+λsymLsym+λgravityLgravity\mathcal{L}_{\text{Lite}} = \mathcal{L}_{\text{rec}} + \lambda_{\text{inv}} \mathcal{L}_{\text{inv}} + \lambda_{\text{sym}} \mathcal{L}_{\text{sym}} + \lambda_{\text{gravity}} \mathcal{L}_{\text{gravity}}


7. Full Training Objective

L=Lrec+λinvLinv+λslotLslot-align+λkinLkin+λposeLpose+λaffLaff\mathcal{L} = \mathcal{L}_{\text{rec}} + \lambda_{\text{inv}} \mathcal{L}_{\text{inv}} + \lambda_{\text{slot}} \mathcal{L}_{\text{slot-align}} + \lambda_{\text{kin}} \mathcal{L}_{\text{kin}} + \lambda_{\text{pose}} \mathcal{L}_{\text{pose}} + \lambda_{\text{aff}} \mathcal{L}_{\text{aff}}


Reconstruction Loss:

Lrec=RR1image reconstruction+λCDCD(G,G)geometry reconstruction\mathcal{L}_{\text{rec}} = \underbrace{\|R - R^*\|_1}_{\text{image reconstruction}} + \lambda_{\text{CD}} \underbrace{\text{CD}(G, G^*)}_{\text{geometry reconstruction}}

Chamfer Distance:

CD(G,G)=1GxGminyGxy22+1GyGminxGxy22\text{CD}(G, G^*) = \frac{1}{|G|}\sum_{x \in G} \min_{y \in G^*} \|x - y\|_2^2 + \frac{1}{|G^*|}\sum_{y \in G^*} \min_{x \in G} \|x - y\|_2^2

Intuitive Explanation:
Reconstruction loss has two parts: (1) rendered images should match target images (L1 loss), (2) reconstructed geometry should match ground-truth geometry (Chamfer distance measures point cloud similarity).


Surface Regularization (Eikonal, for SDF):

Lsurf=ExU(Ω)[(xf(x)21)2]\mathcal{L}_{\text{surf}} = \mathbb{E}_{x \sim U(\Omega)} \big[ (\|\nabla_x f(x)\|_2 - 1)^2 \big]

Intuitive Explanation:
A valid SDF should have unit gradient magnitude (\(|\nabla f| = 1\)) almost everywhere. This regularization prevents degenerate solutions.


8. Computational Considerations

Component Complexity Notes
Slot Attention O(J(N+T)D)O(J \cdot (N+T) \cdot D) Linear in tokens
Memory Retrieval O(JM)O(J \cdot M) Parallelizable; use approximate NN for large MM
Cross-Attention O(LHNr)O(L \cdot H \cdot N_r) Nr8N_r \leq 8, modest overhead
Kinematic Loss O(E)O(|\mathcal{E}|) Per-edge computation
Composition O(J)O(J) per query point Soft-min/mixture

9. Failure Modes and Mitigations

9.1 Slot Collapse to Semantic Blobs

Mitigations (by priority):

  1. Contact signal supervision (highest priority)
  2. Geometric affordance priors
  3. Relational consistency
  4. Diversity regularization

9.2 Generator Ignoring Relational Tokens

Mitigations:

  1. Kinematic consistency loss (explicit constraint)
  2. Feed relational tokens to pose refinement head
  3. Monitor relL\nabla_{\text{rel}} \mathcal{L}; alert if too small

9.3 Pose Estimation Without GT

Mitigations:

  1. IMU gravity prior
  2. Multi-view consistency
  3. Kinematic loss provides indirect supervision

10. Hypotheses and Future Work

10.1 Hypotheses (To Be Validated)

  1. Prioritized supervision → more functionally meaningful slots
  2. Kinematic consistency loss → improved articulated object reconstruction
  3. EG-3D Lite → meaningful gains even without rel/aff
  4. Staged construction → practical deployment with minimal effort

10.2 Proposed Experiments

Experiment Metric
Slot alignment quality IoU with functional part GT
Kinematic consistency Joint error on articulated objects
EG-3D Lite vs Full Ablation study
Retrieval vs assembly Comparison with exemplar methods

11. Conclusion

We presented Engram-Guided 2D-to-3D (EG-3D), a structural assembly framework for single-view 3D reconstruction. Key contributions:

  1. Modular Engram Architecture with explicit composition operators
  2. Prioritized Slot-Part Alignment: contact → geometric → relational → diversity
  3. Kinematic Consistency Loss enforcing relational constraints
  4. Gravity-Aligned Pose Estimation for GT-free settings
  5. Staged Memory Construction enabling incremental deployment
  6. Detailed mathematical specifications with intuitive explanations for every equation

This paper provides a complete foundation for implementation and empirical validation.


12. Engram-Guided Training Cycle

# Algorithm 1: Engram-Guided Training Cycle
# Input: EngramBank M, Dataset D, Stages S = {Lite, +Rel, +Aff}
# Output: Trained EG-3D model and VLA policy

# =======================================================
# STAGE-WISE TRAINING OF EG-3D
# =======================================================
for stage_s in S:                             # Iterate over Lite → +Rel → +Aff stages
    freeze_unused_components(stage_s)        # Freeze modules not used in the current stage

    for (I, P, G_star) in D:                 # Sample a minibatch (image, prompt, GT 3D)

        # -------------------------------------------------------
        # Visual-language encoding and part-level slot extraction
        # -------------------------------------------------------
        Z_tokens = E_VL(I, P)                # Fused visual-language tokens {z_t}
        K_slots  = SlotAttention(Z_tokens)  # Extract part-level structural keys {k_j}

        # -------------------------------------------------------
        # Engram retrieval and modular structural composition
        # -------------------------------------------------------
        Z_e = RetrieveAndCompose(            # Retrieve and assemble Engram components
                K_slots, M, stage_s)         # Z_e contains geo/sym/(rel)/(aff) by stage

        # -------------------------------------------------------
        # 3D generation and coarse pose estimation
        # -------------------------------------------------------
        G, T_coarse = G_theta(I, Z_e)        # Predict 3D geometry G and coarse slot poses {T_j^coarse}

        # -------------------------------------------------------
        # Pose refinement using relational tokens (if enabled)
        # -------------------------------------------------------
        if stage_s >= "+Rel":                # Only refine poses when relational structure is available
            T_ref = PoseRefine(              # Refine poses with relational and slot information
                        T_coarse, K_slots, Z_e.rel)   # Output refined poses {T_j^ref}
        else:
            T_ref = T_coarse                # Use coarse poses in Lite stage

        # -------------------------------------------------------
        # Compute stage-dependent training losses
        # -------------------------------------------------------
        L = L_rec(G, G_star)                # Reconstruction loss on geometry and rendering
            + λ_inv * L_inv(K_slots)        # Invariance loss to enforce structure-only keys
            + λ_sym * L_sym(G, Z_e.sym)     # Symmetry regularization loss

        if stage_s >= "+Rel":               # Additional losses for relational stages
            L += λ_slot * L_slot_align(K_slots)   # Slot-part alignment loss
               + λ_kin  * L_kin(G, T_ref, Z_e.rel) # Kinematic consistency loss
               + λ_pose * L_pose(T_ref, I)        # Pose loss (gravity + multi-view + kinematic)

        if stage_s == "+Aff":               # Additional loss for affordance stage
            L += λ_aff * L_aff(G, Z_e.aff)  # Affordance supervision loss

        # -------------------------------------------------------
        # Update only the parameters active in the current stage
        # -------------------------------------------------------
        update(params_of(stage_s), loss=L)  # Stage-wise parameter update


# =======================================================
# TRAINING VLA POLICY WITH FROZEN EG-3D
# =======================================================
freeze(EG3D)                                # Freeze all EG-3D modules during policy learning

for episode in RL_episodes:                # Iterate over reinforcement learning episodes
    I_t, P = observe()                     # Observe current image and task prompt

    # -------------------------------------------------------
    # Forward pass through frozen EG-3D for structure-aware state
    # -------------------------------------------------------
    G_t, Z_e_t = EG3D(I_t, P)               # Obtain reconstructed 3D and retrieved structure tokens
                                             # Z_e_t.rel / Z_e_t.aff may be empty depending on stage

    # -------------------------------------------------------
    # Vision-Language-Action policy inference and execution
    # -------------------------------------------------------
    A_t = π_VLA(Render(G_t), P)             # Predict action from rendered 3D and prompt
    execute(A_t)                           # Execute action in simulator or robot

    # -------------------------------------------------------
    # Compute reward with structural shaping and safety penalties
    # -------------------------------------------------------
    r_t = r_success()                     # Task success reward
          + α * r_struct(                # Structural shaping reward using rel/aff tokens
                Z_e_t.rel, Z_e_t.aff, A_t)
          - β * r_collision()             # Collision penalty
          - γ * r_torque()                # Excessive torque penalty

    # -------------------------------------------------------
    # Update VLA policy using PPO or SAC
    # -------------------------------------------------------
    update(π_VLA, algorithm="PPO/SAC", reward=r_t)

References

[1] D. Tochilkin et al. "TripoSR." arXiv:2403.02151, 2024. [2] Y. Hong et al. "LRM." arXiv:2311.04400, 2023. [3] X. Long et al. "Wonder3D." arXiv:2311.00005, 2023. [4] M. J. Kim et al. "OpenVLA." arXiv:2406.09246, 2024. [5] H. Shi et al. "MemoryVLA." arXiv:2508.19236, 2025. [6] F. Locatello et al. "Slot Attention." NeurIPS, 2020. [7] K. Mo et al. "PartNet." CVPR, 2019. [8] Y. Zhou et al. "Continuity of Rotation Representations." CVPR, 2019. [9] B. Mildenhall et al. "NeRF." ECCV, 2020. [10] J. J. Park et al. "DeepSDF." CVPR, 2019. [11] B. Kerbl et al. "3D Gaussian Splatting." ACM TOG, 2023. [12] W. Peebles and S. Xie. "DiT." CVPR, 2023. [13] K. Grauman et al. "Ego4D." CVPR, 2022. [14] S. Brahmbhatt et al. "ContactDB." CVPR, 2019.


Community

Sign up or log in to comment