Engram-Guided Structural Memory for Single-View 3D Reconstruction in Humanoid Robotics

Community Article Published January 26, 2026

Author: GrooveJ (Danny Lee)
Affiliation: Independent Researcher (With the assistance of Many LLMs)

Conceptual paper (no experimental results).
This article presents a complete mathematical specification with intuitive explanations of all operators and symbols, designed for future implementation and empirical validation.

Abstract

Single-view 3D reconstruction under occlusion is a critical bottleneck in humanoid robotics: functional structures—handles, joints, supports—are frequently hidden, causing grasp failures and unsafe actions. While Large Reconstruction Models (LRMs) produce visually plausible shapes, they lack mechanisms for reusing structural knowledge across objects. Inspired by the principle that decoupling memory from computation yields significant reasoning gains, we propose Engram-Guided 2D-to-3D (EG-3D): a framework that integrates structure-only 3D memory with LLM-conditioned perception and Vision-Language-Action (VLA) control.

Given an RGB image and a task prompt, an LLM-conditioned encoder extracts part-level structural keys via Slot Attention, anchored to functional parts through a prioritized supervision hierarchy (contact signals → geometric priors → relational consistency). These keys retrieve Modular Engrams—decomposed into geometric, relational, symmetry, and affordance subcomponents—each injected through dedicated pathways. We enforce structural constraints via kinematic consistency loss ensuring relational tokens are not ignored by the generator. For single-view pose estimation without ground truth, we leverage gravity-aligned priors (from robot IMU) combined with multi-view consistency. We also introduce a staged memory construction protocol that enables deployment with minimal initial annotation (geometry + symmetry only), progressively adding relations and affordances.

This paper provides: (1) a complete mathematical specification with detailed explanations for every equation, (2) a minimal "EG-3D Lite" configuration for practical deployment, (3) explicit mechanisms preventing generators from ignoring structural constraints, and (4) clear differentiation from prior retrieval-based methods.

1. Introduction

1.1 The Problem: Occlusion-Induced Failures in Manipulation

Humanoid robots increasingly rely on monocular head-mounted cameras. However, real-world manipulation involves inherent underspecification: crucial functional elements—door handles, hinges, support legs, tool tips, internal joints—are frequently occluded by the object itself or by the robot's body. When these structures are misreconstructed or not inferred at all, robots select incorrect grasp points, apply unsafe forces, collide unexpectedly, or fail tasks entirely.

1.2 Industry Context: Why Now?

The strategic importance of single-view 3D reconstruction for robotics has become increasingly apparent. Recent acquisitions and investments around single-image-to-3D asset generation highlight the urgency of robust 2D→3D pipelines for embodied systems. Major technology companies are actively acquiring startups in this space, recognizing that perception-to-action pipelines are critical bottlenecks for deploying robots at scale.

1.3 The Gap: Missing Structural Memory

Recent LRMs (TripoSR, LRM, Wonder3D) scale transformers and diffusion to generate plausible geometry from single images. Yet they exhibit systematic structural errors: wrong part counts, symmetry collapse, hallucinated connections, or missing functional subparts. We argue that a central missing ingredient is reusable structural memory.

1.4 Inspiration: Conditional Memory as a Modeling Primitive

The Engram-Guided Structural Memory concept demonstrates that decoupling memory from computation yields significant reasoning gains. We extend this principle to 3D perception.

1.5 Key Technical Contributions

Modular Engram Architecture with explicit composition operators and kinematic consistency loss ensuring relational constraints are enforced
Prioritized Slot-Part Alignment: contact/interaction signals (strongest) → geometric priors → relational consistency (weakest)
Gravity-Aligned Pose Estimation: leveraging robot IMU for coarse alignment without ground-truth poses
Staged Memory Construction: minimal viable memory (geo + sym) → progressive enrichment (rel → aff)
Clear Differentiation from retrieval-based priors: EG-3D is a structural assembly system, not a retrieval prior

2. Related Work and Differentiation

2.1 Retrieval-Based 3D Priors vs. Structural Assembly Memory

A key question is: How does EG-3D differ from existing retrieval-based 3D priors?

Aspect	Retrieval-Based Priors	EG-3D (Ours)
Retrieval unit	Global shape or exemplar	Part-level structural motifs
Memory content	Full shapes (with texture/appearance)	Structure-only (geo + rel + sym + aff)
Composition	Single retrieval or blending	SE(3)-aware multi-part assembly
Key space	Category/appearance-based	Structure-only (invariance-regularized)
Constraints	Implicit (hope generator learns)	Explicit (kinematic consistency loss)
Affordance	Not modeled	First-class component
Action loop	Not considered	VLA integration

Our key insight: EG-3D is not "retrieval + generation" but "structural assembly + generation". We retrieve parts, align them via SE(3), enforce kinematic constraints, and compose them with explicit operators.

2.2 Memory in Vision-Language-Action Models

Recent VLA models incorporate memory:

MemoryVLA: Temporal memory for long-horizon tasks
IVE: Episodic memory for exploration

Our work addresses structural memory (reusable 3D motifs across objects), complementary to temporal/episodic memory.

3. Notation and Preliminaries

3.1 Inputs and Embeddings

Symbol	Type	Description
$\mathcal{I}$	$\mathbb{R}^{H \times W \times 3}$	Input RGB image with height $H$ , width $W$ , 3 color channels
$P$	$\mathcal{V}^*$	Natural-language prompt as sequence of tokens from vocabulary $\mathcal{V}$
$\{z_t\}_{t=1}^{N+T}$	$\mathbb{R}^D$ each	Fused token sequence: $N$ visual tokens + $T$ text tokens, each $D$ -dimensional
$D$	$\mathbb{N}$	Embedding dimension (typical: 768 or 1024)

3.2 Part-Level Queries (Slots)

Symbol	Type	Description
$J$	$\mathbb{N}$	Number of part-level query slots (typical: 8–16)
$k_{j}$	$\mathbb{R}^D$	Structure-only query key for slot $j$
$a_{j,t}$	$[0, 1]$	Attention weight: how much slot $j$ attends to token $t$

3.3 Modular Engram Structure

Each Engram $\mathbf{E}_m$ is a modular tuple:

$\mathbf{E}_m = (e_m^{\text{geo}}, e_m^{\text{rel}}, e_m^{\text{sym}}, e_m^{\text{aff}}, T_m^c)$

Symbol-by-symbol explanation:

$m$ : Index of the Engram in memory ($m = 1, \ldots, M$)
$e_m^{\text{geo}} \in \mathbb{R}^{d_g}$ : Geometric component — latent vector encoding 3D shape
$e_m^{\text{rel}} \in \mathbb{R}^{N_r \times d_r}$ : Relational component — $N_{r}$ tokens encoding part connectivity
$e_m^{\text{sym}} \in \mathbb{R}^{d_s}$ : Symmetry component — descriptor of symmetry type and axes
$e_m^{\text{aff}} \in \mathbb{R}^{N_a \times d_a}$ : Affordance component — $N_{a}$ markers for interaction points
$T_m^c \in SE(3)$ : Canonical frame — reference pose for spatial alignment

Component	Injection Pathway	Required in Minimal Config?
$e_m^{\text{geo}}$	AdaLN	✅ Yes
$e_m^{\text{rel}}$	Cross-Attention + Kinematic Loss	❌ Optional (Stage 2)
$e_m^{\text{sym}}$	Symmetry-Aware Decoder	✅ Yes
$e_m^{\text{aff}}$	Auxiliary Head	❌ Optional (Stage 3)
$T_{m}^{c}$	SE(3) Alignment	✅ Yes (coarse)

3.4 Memory Structure

Symbol	Type	Description
$M$	$\mathbb{N}$	Total number of Engrams in memory
$\mathcal{M}$	$\{(k_m, \mathbf{E}_m)\}_{m=1}^{M}$	Memory bank as key-Engram pairs
$k_{m}$	$\mathbb{R}^D$	Retrieval key for Engram $m$
$\alpha_{j,m}$	$[0, 1]$	Retrieval weight: how much slot $j$ retrieves Engram $m$
$\tau$	$\mathbb{R}^+$	Temperature for softmax retrieval (typical: 0.05–0.2)

3.5 3D Generator and Action

Symbol	Type	Description
$G$	—	Continuous 3D representation (NeRF, SDF, or 3DGS)
$G_\theta$	—	3D generator network with parameters $\theta$
$\text{Render}(G)$	$\mathbb{R}^{H' \times W' \times 3}$	Differentiable rendering of $G$
$A$	$\mathbb{R}^{K}$	Action vector ($K$ dimensions, e.g., 7 for 6-DoF + gripper)
$\pi_{\text{VLA}}$	—	Vision-Language-Action policy

4. Core Framework

4.1 Pipeline Overview

The complete pipeline consists of five stages:

Equation (1): Visual-Language Encoding

$\{z_t\}_{t=1}^{N+T} = E_{\text{VL}}(P, \mathcal{I}), \quad z = \text{Pool}(\{z_t\})$

Symbol-by-symbol:

$E_{\text{VL}}$ : Visual-Language encoder (any VLM backbone)
$P$ : Input text prompt
$\mathcal{I}$ : Input RGB image
$\{z_t\}_{t=1}^{N+T}$ : Output token sequence ($N$ visual + $T$ text tokens)
$\text{Pool}(\cdot)$ : Pooling operation (mean or attention pooling)
$z$ : Global embedding summarizing the entire input

Intuitive Explanation:
This is the robot's "perception + understanding" step. The encoder looks at the image and reads the instruction, producing a sequence of tokens that represent both visual content and linguistic intent. The pooled vector $z$ captures the overall context.

Why this matters:
Joint visual-language encoding allows the system to focus on task-relevant parts. For "grab the handle," the encoder emphasizes handle-related visual regions.

Equation (2): Structure-Only Key Extraction

$\{k_j\}_{j=1}^{J} = \mathcal{K}(\{z_t\})$

Symbol-by-symbol:

$\mathcal{K}$ : Key extraction function (implemented via Slot Attention)
${z_{t}}$ : Input token sequence from encoder
$k_j \in \mathbb{R}^D$ : Output key for slot $j$
$J$ : Number of slots (part hypotheses)

Expanded form (Slot Attention):

$k_j = \sum_{t=1}^{N+T} a_{j,t} \cdot z_t$

where attention weights are computed as:

$a_{j,t} = \frac{\exp(q_j^\top k_t / \sqrt{D})}{\sum_{j'=1}^{J} \exp(q_{j'}^\top k_t / \sqrt{D})}$

Critical constraint (competition across slots):

$\sum_{j=1}^{J} a_{j,t} = 1 \quad \text{for each token } t$

Intuitive Explanation:
Slot Attention splits the scene into parts. Each slot "competes" to explain different tokens—one slot might capture "the handle," another "the door body," another "the hinge." The key $k_{j}$ summarizes what slot $j$ has captured.

Why the competition constraint?
Without it, all slots might attend to everything equally, producing identical keys. Competition forces specialization.

Equation (3): Modular Engram Retrieval with SE(3) Alignment

$\mathbf{Z}_e = \mathcal{R}_{\text{SE(3)}}(\{k_j\}; \mathcal{M})$

Symbol-by-symbol:

$\mathcal{R}_{\text{SE(3)}}$ : Retrieval function with SE(3) alignment
${k_{j}}$ : Query keys from slots
$\mathcal{M}$ : Memory bank
$\mathbf{Z}_e$ : Retrieved and composed Engram components

Expanded form (retrieval weights):

$s_{j,m} = \frac{k_j^\top k_m}{\|k_j\|_2 \|k_m\|_2}$

$\alpha_{j,m} = \frac{\exp(s_{j,m}/\tau)}{\sum_{l=1}^{M}\exp(s_{j,l}/\tau)}$

Symbol-by-symbol:

$s_{j,m} \in [-1, 1]$ : Cosine similarity between query $k_{j}$ and memory key $k_{m}$
$\tau > 0$ : Temperature (lower = sharper retrieval, higher = softer)
$\alpha_{j,m} \in [0, 1]$ : Normalized retrieval weight (sums to 1 over $m$ )

Intuitive Explanation:
For each slot's key, we find the most similar Engrams in memory. The softmax converts similarities into probability-like weights. Temperature controls sharpness: $\tau \to 0$ gives winner-take-all, $\tau \to \infty$ gives uniform weighting.

Equation (4): Multi-Pathway 3D Generation with Kinematic Constraints

$G = G_\theta(\mathcal{I}, \mathbf{Z}_e), \quad \text{subject to } \mathcal{L}_{\text{kin}}(G, Z_e^{\text{rel}}) < \epsilon$

Symbol-by-symbol:

$G_\theta$ : 3D generator network with parameters $\theta$
$\mathcal{I}$ : Input image (provides appearance guidance)
$\mathbf{Z}_e$ : Composed Engram components (structural guidance)
$\mathcal{L}_{\text{kin}}$ : Kinematic consistency loss
$\epsilon$ : Constraint tolerance

Intuitive Explanation:
The generator produces 3D geometry guided by both the image (for appearance) and retrieved Engrams (for structure). The kinematic constraint ensures the output respects relational information (e.g., joint axes, connection points).

Equation (5): Action Inference

$A = \pi_{\text{VLA}}(\text{Render}(G), P)$

Symbol-by-symbol:

$\pi_{\text{VLA}}$ : Vision-Language-Action policy
$\text{Render}(G)$ : Rendered images from 3D reconstruction
$P$ : Original task prompt
$A \in \mathbb{R}^K$ : Output action vector

Intuitive Explanation:
Given the reconstructed 3D (rendered as images) and the task instruction, the VLA policy outputs robot actions. The explicit 3D structure helps the policy reason about occluded parts.

4.2 Slot-Part Alignment with Prioritized Supervision

Critical Challenge: Slot Attention often produces "semantic blobs" rather than functional parts.

Our Solution: Prioritized Supervision Hierarchy

Priority	Signal Source	Weight	Description
1 (Highest)	Contact/interaction	$w_{1} = 1.0$	Where humans/robots actually touch
2	Geometric priors	$w_{2} = 0.5$	Cylinders are handles, planes are supports
3	Relational consistency	$w_{3} = 0.3$	Connected parts should be spatially close
4 (Lowest)	Diversity regularization	$w_{4} = 0.1$	Slots should differ from each other

Combined Slot Alignment Loss:

$\mathcal{L}_{\text{slot-align}} = w_1 \mathcal{L}_{\text{contact}} + w_2 \mathcal{L}_{\text{geo-prior}} + w_3 \mathcal{L}_{\text{rel-consist}} + w_4 \mathcal{L}_{\text{diversity}}$

Symbol-by-symbol:

$w_{1}, w_{2}, w_{3}, w_{4}$ : Priority weights ($w_1 > w_2 > w_3 > w_4$)
$\mathcal{L}_{\text{contact}}$ : Contact signal loss
$\mathcal{L}_{\text{geo-prior}}$ : Geometric affordance prior loss
$\mathcal{L}_{\text{rel-consist}}$ : Relational consistency loss
$\mathcal{L}_{\text{diversity}}$ : Slot diversity regularization

Intuitive Explanation:
We stack multiple supervision signals by priority. Contact signals (where people actually grab objects) are strongest because they directly indicate functional parts. Lower-priority signals fill in when higher-priority signals are unavailable.

Priority 1: Contact Signal Loss

$\mathcal{L}_{\text{contact}} = \sum_{j=1}^{J} \text{BCE}\big(\text{SlotMask}_j, \text{ContactMask}\big)$

Symbol-by-symbol:

$\text{SlotMask}_j \in [0,1]^{H \times W}$ : Attention mask for slot $j$ (where it attends in the image)
$\text{ContactMask} \in \{0,1\}^{H \times W}$ : Ground-truth contact regions (from HOI data)
$\text{BCE}$ : Binary Cross-Entropy loss

Expanded BCE:

$\text{BCE}(p, q) = -\frac{1}{HW}\sum_{i,j}\big[q_{ij}\log p_{ij} + (1-q_{ij})\log(1-p_{ij})\big]$

Intuitive Explanation:
If we know where humans touch objects (from videos of people using objects), we train slots to attend to those regions. A slot attending to a handle should align with hand-contact regions on handles.

Why highest priority?
Contact directly indicates functional relevance. A handle is a handle because people grab it.

Priority 2: Geometric Affordance Prior Loss

$\mathcal{L}_{\text{geo-prior}} = -\sum_{j=1}^{J} \sum_{a \in \mathcal{A}} \mathbb{1}[\text{SlotMask}_j \cap \text{GeoPrior}_a \neq \emptyset] \cdot \log p(a | k_j)$

Symbol-by-symbol:

$\mathcal{A}$ : Set of affordance types {grasp, support, hinge, button, ...}
$\text{GeoPrior}_a$ : Regions satisfying geometric prior for affordance $a$
$\mathbb{1}[\cdot]$ : Indicator function (1 if condition true, 0 otherwise)
$p (a ∣ k_{j})$ : Predicted probability of affordance $a$ given slot key $k_{j}$

Geometric priors (examples):

Grasp (handle): Elongated cylindrical regions where $\lambda_1 \gg \lambda_2 \approx \lambda_3$ (PCA eigenvalues)
Support: Horizontal surfaces with upward normal: $\mathbf{n} \cdot \hat{z} > 0.9$
Hinge: Concave junctions between planar surfaces
Button: Small convex protrusions on flat surfaces

Intuitive Explanation:
If a slot attends to a cylindrical region, it should predict "grasp" affordance. This loss encourages geometric consistency between attention patterns and affordance predictions.

Priority 3: Relational Consistency Loss

$\mathcal{L}_{\text{rel-consist}} = \sum_{(i,j) \in \mathcal{E}_{\text{rel}}} \| \text{Centroid}(\text{SlotMask}_i) - \text{Centroid}(\text{SlotMask}_j) - \Delta t_{ij}^{\text{rel}} \|_2^2$

Symbol-by-symbol:

$\mathcal{E}_{\text{rel}}$ : Set of slot pairs connected by relational tokens
$\text{Centroid}(\text{SlotMask}_i) \in \mathbb{R}^2$ : 2D centroid of slot $i$ 's attention mask
$\Delta t_{ij}^{\text{rel}} \in \mathbb{R}^2$ : Expected relative position from relational token (projected to 2D)

Intuitive Explanation:
If relational tokens say "slot $i$ connects to slot $j$ with offset $\Delta t$ ," then their attention centroids should reflect this spatial relationship. A handle connected to a door should attend to adjacent regions.

Priority 4: Slot Diversity Regularization

$\mathcal{L}_{\text{diversity}} = \underbrace{-\sum_{j=1}^{J} \mathcal{H}(a_j)}_{\text{entropy maximization}} + \underbrace{\lambda_{\text{rep}} \sum_{i \neq j} \max(0, \cos(k_i, k_j) - \delta)}_{\text{repulsion}}$

Symbol-by-symbol:

$\mathcal{H}(a_j) = -\sum_t a_{j,t} \log a_{j,t}$ : Entropy of slot $j$ 's attention distribution
$\cos(k_i, k_j)$ : Cosine similarity between slot keys
$\delta$ : Similarity threshold (typical: 0.5)
$\lambda_{\text{rep}}$ : Repulsion strength

Intuitive Explanation:
The entropy term encourages each slot to attend broadly (not collapse to a single token). The repulsion term pushes slot keys apart, preventing multiple slots from representing the same part.

4.3 Structure-Only Key Learning via Invariance

Problem: Keys might encode texture/color, causing retrieval to match appearance rather than structure.

Solution: Invariance regularization forces keys to ignore appearance.

Combined Invariance Loss:

$\mathcal{L}_{\text{inv}} = \mathcal{L}_{\text{inv}}^{\text{tex}} + \lambda_p \mathcal{L}_{\text{inv}}^{\text{prompt}} + \lambda_c \mathcal{L}_{\text{inv}}^{\text{cross}}$

Texture Augmentation Invariance:

$\mathcal{L}_{\text{inv}}^{\text{tex}} = \sum_{j=1}^{J} \| k_j(\mathcal{I}) - k_j(\text{Aug}_{\text{tex}}(\mathcal{I})) \|_2^2$

Symbol-by-symbol:

$k_j(\mathcal{I})$ : Key from slot $j$ given original image
$\text{Aug}_{\text{tex}}(\mathcal{I})$ : Texture-augmented image (hue shift, saturation change, style transfer)
$\|\cdot\|_2^2$ : Squared L2 norm

Intuitive Explanation:
The same object with different colors/textures should produce identical keys. We augment textures and penalize key differences.

Prompt Invariance (Selective):

$\mathcal{L}_{\text{inv}}^{\text{prompt}} = \sum_{j=1}^{J} \| k_j(P_1, \mathcal{I}) - k_j(P_2, \mathcal{I}) \|_2^2 \cdot \mathbb{1}[\text{SameStructure}(P_1, P_2)]$

Symbol-by-symbol:

$P_{1}, P_{2}$ : Two different prompts
$\text{SameStructure}(P_1, P_2)$ : True if prompts describe same structure with different appearance

Examples:

$P_{1}$ = "red wooden mug" vs $P_{2}$ = "blue ceramic mug" → Same structure, apply invariance
$P_{1}$ = "mug with handle" vs $P_{2}$ = "mug without handle" → Different structure, skip invariance

Intuitive Explanation:
"Red wooden chair" and "blue metal chair" should retrieve the same structural Engrams. But "chair with armrests" and "chair without armrests" are structurally different.

Cross-Instance Structural Alignment:

$\mathcal{L}_{\text{inv}}^{\text{cross}} = \underbrace{-\sum_{(i,j) \in \mathcal{S}_{\text{same}}} \cos(k_i, k_j)}_{\text{attract same-structure pairs}} + \underbrace{\sum_{(i,j) \in \mathcal{S}_{\text{diff}}} \max(0, \cos(k_i, k_j) - \delta)}_{\text{repel different-structure pairs}}$

Symbol-by-symbol:

$\mathcal{S}_{\text{same}}$ : Pairs of keys from same structure (different instances/textures)
$\mathcal{S}_{\text{diff}}$ : Pairs of keys from different structures
$\delta$ : Margin threshold

Intuitive Explanation:
This is contrastive learning for structure. Keys from structurally similar objects should be close; keys from different structures should be far.

4.4 Memory Key Generation

Strategy A (Recommended): Shared Encoder Pipeline

$k_m = \frac{1}{|\mathcal{V}| \cdot |\mathcal{T}|} \sum_{v \in \mathcal{V}} \sum_{\text{tex} \in \mathcal{T}} \mathcal{K}\big(E_{\text{VL}}(P_{\text{neutral}}, \text{Render}_{v}(S_m, \text{tex}))\big)$

Symbol-by-symbol:

$S_{m}$ : Source 3D shape for Engram $m$
$\mathcal{V}$ : Set of viewpoints (e.g., 8 canonical views)
$\mathcal{T}$ : Set of texture variations
$\text{Render}_{v}(S_m, \text{tex})$ : Rendered image from viewpoint $v$ with texture $\text{tex}$
$P_{\text{neutral}}$ : Structure-focused prompt (e.g., "a 3D object")

Intuitive Explanation:
Memory keys are generated using the same pipeline as query keys, ensuring they live in the same space. Averaging over viewpoints and textures ensures the key captures structure, not appearance or viewpoint.

4.5 Modular Engram Components (Detailed Specifications)

4.5.1 Relational Token Format

Storage Format (23-dimensional raw vector):

Field	Dims	Type	Description
`part_i_idx`	1	Integer	Index of first connected part
`part_j_idx`	1	Integer	Index of second connected part
`joint_type`	6	One-hot	{fixed, revolute, prismatic, spherical, planar, free}
`axis`	3	Unit vector	Joint rotation/translation axis
`rel_rotation`	6	6D continuous	Relative rotation (Zhou et al. representation)
`rel_translation`	3	3D vector	Relative translation
`constraint_limits`	2	Floats	$[\theta_{\min}, \theta_{\max}]$ in radians
`valid_mask`	1	Binary	1 if valid, 0 if padding

Model Input Format (projected to $d_{r}$ dimensions):

$e_m^{\text{rel, proj}} = \text{MLP}\big([\text{Embed}(\text{part\_i}); \text{Embed}(\text{part\_j}); \text{continuous\_fields}]\big)$

Symbol-by-symbol:

$\text{Embed}(\cdot)$ : Learned embedding table for integer indices
$\text{continuous\_fields}$ : The 20 continuous dimensions (joint_type through valid_mask)
$\text{MLP}$ : Multi-layer perceptron projecting to $d_{r}$ dimensions

4.5.2 Symmetry Component Format

Field	Dims	Description
`sym_type`	5	One-hot: {none, bilateral, radial, translational, helical}
`sym_axis`	3	Primary symmetry axis (unit vector)
`sym_center`	3	Center of symmetry (3D point)
`sym_count`	1	Repetition count (e.g., 4 for chair legs)
`sym_spacing`	1	Spacing for translational symmetry
Total	13	—

4.5.3 Affordance Component Format

Field	Dims	Description
`position`	3	3D location in canonical frame
`aff_type`	8	One-hot: {grasp, support, hinge, button, slider, socket, lid, none}
`approach_dir`	3	Approach/interaction direction
`contact_normal`	3	Surface normal at contact point
`valid_mask`	1	1 if valid, 0 if padding
Total	18	Per marker ($N_a = 5$ markers)

4.6 Modular Engram Composition

Geometric Component (weighted average):

$Z_e^{\text{geo}} = \sum_{j=1}^{J} \sum_{m=1}^{M} \alpha_{j,m} \cdot \text{SE3Transform}(e_m^{\text{geo}}, \Delta T_{j,m})$

Symbol-by-symbol:

$\alpha_{j,m}$ : Retrieval weight (slot $j$ , Engram $m$ )
$e_m^{\text{geo}}$ : Geometric latent of Engram $m$
$\Delta T_{j,m} = T_j \cdot (T_m^c)^{-1}$ : Relative transformation from canonical to target pose
$\text{SE3Transform}$ : Transformation operator (MLP approximation in latent space)

Intuitive Explanation:
Each retrieved geometric Engram is transformed to its target pose, then all are blended by retrieval weights.

Relational Component (concatenation with threshold):

$Z_e^{\text{rel}} = \text{Concat}_{j,m: \alpha_{j,m} > \theta_{\text{rel}}} \big[ e_{j,m}^{\text{rel, proj}} \big]$

Intuitive Explanation:
We concatenate relational tokens from high-confidence retrievals. Thresholding prevents noisy low-weight retrievals from polluting the relation set.

Symmetry Component (weighted voting):

$Z_e^{\text{sym}} = \arg\max_{\text{type}} \sum_{j,m} \alpha_{j,m} \cdot \mathbb{1}[e_m^{\text{sym}}.\text{type} = \text{type}]$

Intuitive Explanation:
Symmetry is discrete (an object is either bilaterally symmetric or not). We take a weighted vote over retrieved symmetry types.

Affordance Component (transform and merge):

$Z_e^{\text{aff}} = \bigcup_{j,m: \alpha_{j,m} > \theta_{\text{aff}}} \text{Transform}(e_m^{\text{aff}}, \Delta T_{j,m})$

Intuitive Explanation:
Affordance markers (grasp points, etc.) are transformed to target poses and merged. The union collects all relevant interaction points.

4.7 Enforcing Relational Constraints: Kinematic Consistency Loss

Problem: Cross-attention alone allows the generator to "politely ignore" relational tokens.

Solution: Explicit loss penalizing kinematic violations.

Combined Kinematic Loss:

$\mathcal{L}_{\text{kin}} = \mathcal{L}_{\text{kin}}^{\text{joint}} + \lambda_{\text{axis}} \mathcal{L}_{\text{kin}}^{\text{axis}} + \lambda_{\text{limit}} \mathcal{L}_{\text{kin}}^{\text{limit}}$

Joint Position Consistency:

$\mathcal{L}_{\text{kin}}^{\text{joint}} = \sum_{(i,j) \in \mathcal{E}} \| (T_i^{-1} \cdot T_j) - T_{ij}^{\text{rel}} \|_{\text{geo}}^2$

Symbol-by-symbol:

$\mathcal{E}$ : Set of connected part pairs from relational tokens
$T_i, T_j \in SE(3)$ : Estimated poses of slots $i$ and $j$
$T_i^{-1} \cdot T_j$ : Relative pose from $i$ to $j$
$T_{ij}^{\text{rel}}$ : Expected relative pose from relational token
$\|\cdot\|_{\text{geo}}$ : Geodesic distance on SE(3)

Geodesic distance on SE(3):

$\|T_1 - T_2\|_{\text{geo}} = \|\log(R_1^\top R_2)\|_F + \lambda_t \|t_1 - t_2\|_2$

where $\log(\cdot)$ is the matrix logarithm mapping rotation to axis-angle.

Intuitive Explanation:
If a relational token says "the handle is attached to the door with relative pose $T_{ij}^{\text{rel}}$ ," the generated geometry must satisfy this. Deviation is penalized.

Joint Axis Alignment:

$\mathcal{L}_{\text{kin}}^{\text{axis}} = \sum_{(i,j): \text{revolute/prismatic}} \big(1 - |\text{axis}_{ij}^\top \cdot \text{PCA}_1(G_{\text{joint}}^{ij})|\big)$

Symbol-by-symbol:

$\text{axis}_{ij}$ : Joint axis from relational token (unit vector)
$G_{\text{joint}}^{ij}$ : Geometry near the joint between parts $i$ and $j$
$\text{PCA}_1(\cdot)$ : First principal component (dominant direction)

Intuitive Explanation:
For a revolute joint (like a door hinge), the local geometry should be aligned with the joint axis. A poorly-aligned hinge would have high loss.

Joint Limit Enforcement:

$\mathcal{L}_{\text{kin}}^{\text{limit}} = \sum_{(i,j)} \big[\max(0, \theta_{ij} - \theta_{\max}) + \max(0, \theta_{\min} - \theta_{ij})\big]$

Symbol-by-symbol:

$\theta_{ij}$ : Current joint angle between parts $i$ and $j$
$\theta_{\min}, \theta_{\max}$ : Joint limits from relational token
$\max(0, \cdot)$ : ReLU (only penalize violations)

Intuitive Explanation:
A door can only open so far. If the generated geometry implies the door is open beyond its limit, this is penalized.

Relational Tokens also Feed Pose Refinement:

$T_j^{\text{refined}} = \text{PoseRefineHead}(T_j^{\text{coarse}}, k_j, Z_e^{\text{rel}})$

Intuitive Explanation:
Beyond cross-attention in the generator, relational tokens also directly inform pose estimation. This creates multiple pathways for relational information, making it harder to ignore.

4.8 Gravity-Aligned Pose Estimation (No Ground Truth)

Challenge: Single-view pose estimation usually needs GT supervision.

Solution: Use robot IMU + multi-view consistency.

Gravity Prior Loss:

$\mathcal{L}_{\text{gravity}} = \sum_{j=1}^{J} \big(1 - |\mathbf{g}^\top \cdot R_j \cdot \hat{z}|\big) \cdot \mathbb{1}[\text{GravityAligned}(j)]$

Symbol-by-symbol:

$\mathbf{g} \in \mathbb{R}^3$ : Gravity direction from IMU (unit vector)
$R_j \in SO(3)$ : Estimated rotation for slot $j$
$\hat{z} = [0, 0, 1]^\top$ : Canonical up-vector
$\mathbb{1}[\text{GravityAligned}(j)]$ : 1 if slot $j$ should be gravity-aligned (supports, tables, floors)

Intuitive Explanation:
A table surface should be horizontal (aligned with gravity). We use the robot's IMU to know which way is "up" and penalize misalignment.

Multi-View Consistency Loss:

$\mathcal{L}_{\text{MV}} = \sum_{v_1 \neq v_2} \| \text{Render}_{v_1}(G) - \text{Warp}_{v_1 \leftarrow v_2}(\text{Render}_{v_2}(G)) \|_1$

Symbol-by-symbol:

$\text{Render}_{v}(G)$ : Rendering from viewpoint $v$
$\text{Warp}_{v_1 \leftarrow v_2}$ : Image warping from $v_{2}$ to $v_{1}$ using estimated depth/pose

Intuitive Explanation:
If the 3D reconstruction is correct, rendering from different viewpoints and warping should produce consistent images. Inconsistencies indicate pose or geometry errors.

Combined Pose Loss (No GT):

$\mathcal{L}_{\text{pose}}^{\text{no-GT}} = \lambda_g \mathcal{L}_{\text{gravity}} + \lambda_{\text{MV}} \mathcal{L}_{\text{MV}} + \lambda_{\text{kin}} \mathcal{L}_{\text{kin}}$

Intuitive Explanation:
Without ground-truth poses, we combine: (1) gravity alignment (from IMU), (2) multi-view consistency (self-supervised), (3) kinematic constraints (from relational tokens). Together, these provide sufficient signal to learn reasonable poses.

4.9 Multi-Pathway Generator Conditioning

Pathway A: AdaLN for Geometry

$h'_l = \gamma_l(Z_e^{\text{geo}}) \odot \text{LN}(h_l) + \beta_l(Z_e^{\text{geo}})$

Symbol-by-symbol:

$h_l \in \mathbb{R}^{B \times C}$ : Activation at layer $l$ (batch $B$ , channels $C$ )
$\text{LN}(h_l) = \frac{h_l - \mu_l}{\sigma_l}$ : Layer normalization
$\mu_l, \sigma_l$ : Mean and std (per-sample, across channels for MLPs; per-token for Transformers)
$\gamma_l(Z_e^{\text{geo}}) = W_\gamma^l Z_e^{\text{geo}} + b_\gamma^l$ : Scale from geometric prior
$\beta_l(Z_e^{\text{geo}}) = W_\beta^l Z_e^{\text{geo}} + b_\beta^l$ : Shift from geometric prior
$\odot$ : Element-wise multiplication

Intuitive Explanation:
AdaLN injects the geometric prior by controlling the scale and shift of activations at every layer. This is a strong form of conditioning—the prior directly modulates the "communication channels" in the network.

Pathway B: Cross-Attention for Relations

$h'_l = h_l + \text{CrossAttn}(Q=h_l, K=Z_e^{\text{rel}}, V=Z_e^{\text{rel}}, M=M_{\text{valid}})$

Expanded Cross-Attention:

$\text{CrossAttn}(Q, K, V, M) = \text{softmax}\bigg(\frac{QK^\top}{\sqrt{d}} + M\bigg) V$

where $M_{ij} = -\infty$ if token $j$ is invalid (padding).

Intuitive Explanation:
The generator can "ask questions" about relational structure. "Is there a joint here? What type? What axis?" The cross-attention mechanism retrieves relevant relational information.

Pathway C: Per-Slot Symmetry

$G_j(x) = \begin{cases} \frac{1}{n_j} \sum_{i=0}^{n_j-1} G_{\text{base}}^j\big(R_{\text{axis}}^{2\pi i/n_j} \cdot x_{\text{local}}\big) & \text{if radial symmetry}\\[2mm] \frac{1}{2}\big(G_{\text{base}}^j(x_{\text{local}}) + G_{\text{base}}^j(\text{Reflect}(x_{\text{local}}))\big) & \text{if bilateral}\\[2mm] G_{\text{base}}^j(x_{\text{local}}) & \text{if no symmetry} \end{cases}$

Symbol-by-symbol:

$G_{j} (x)$ : Final field value for slot $j$ at point $x$
$G_{\text{base}}^j$ : Base generator output (before symmetry)
$n_{j}$ : Symmetry count (e.g., 4 for chair legs)
$R_{\text{axis}}^{\theta}$ : Rotation by $\theta$ around symmetry axis
$x_{\text{local}} = (T_j^c)^{-1} \cdot x$ : Point transformed to slot's local frame
$\text{Reflect}(\cdot)$ : Reflection across symmetry plane

Intuitive Explanation:
Symmetry is enforced by averaging the field over symmetric transformations. For 4-way radial symmetry (like chair legs), we evaluate the base field at 4 rotated positions and average. This guarantees perfect symmetry.

Pathway D: Affordance Head

$A_{\text{pred}} = \text{AffordanceHead}(G_{\text{features}}, Z_e^{\text{aff}})$

Intuitive Explanation:
A dedicated head predicts affordance heatmaps (grasp points, support surfaces) using both generator features and retrieved affordance markers. This provides explicit manipulation-relevant outputs.

4.10 Composition Operators (Generator-Specific)

SDF Composition (Smooth Minimum):

$f(x) = \text{SoftMin}_j\big(f_j(x)\big) = -\frac{1}{\beta}\log\sum_{j=1}^{J} \exp(-\beta \cdot f_j(x))$

Symbol-by-symbol:

$f_{j} (x)$ : SDF value from slot $j$ at point $x$ (negative inside, positive outside)
$\beta > 0$ : Sharpness parameter (larger = closer to hard min)

Limit behavior:

$\beta \to \infty$ : $\text{SoftMin} \to \min$ (hard union)
$\beta \to 0$ : $\text{SoftMin} \to \text{mean}$ (smooth blend)

Intuitive Explanation:
SDFs represent shapes as distance fields. To combine shapes (union), we take the minimum. Soft-min provides smooth transitions at part boundaries.

NeRF Composition (Density Mixture):

$\sigma(x) = \sum_{j=1}^{J} w_j(x) \cdot \sigma_j(x)$

$c(x) = \frac{\sum_{j=1}^{J} w_j(x) \cdot \sigma_j(x) \cdot c_j(x)}{\sum_{j=1}^{J} w_j(x) \cdot \sigma_j(x) + \epsilon}$

Symbol-by-symbol:

$\sigma_j(x)$ : Density from slot $j$ at point $x$
$c_{j} (x)$ : Color from slot $j$
$w_{j} (x)$ : Spatial weight for slot $j$ (from lifted attention mask)
$\epsilon$ : Small constant for numerical stability

Intuitive Explanation:
NeRF represents scenes as density and color fields. We combine slot contributions using spatial weights, with colors weighted by density (denser regions contribute more to final color).

3DGS Composition (Concatenation + Pruning):

$\mathcal{G}_{\text{final}} = \text{Prune}\bigg(\bigcup_{j=1}^{J} \mathcal{G}_j, \;\theta_{\text{opacity}}\bigg)$

Symbol-by-symbol:

$\mathcal{G}_j$ : Set of Gaussians from slot $j$
$\bigcup$ : Set union (concatenation)
$\text{Prune}(\cdot, \theta)$ : Remove Gaussians with opacity below $\theta$

Intuitive Explanation:
3D Gaussian Splatting represents scenes as collections of Gaussian primitives. We simply concatenate Gaussians from all slots, then prune low-opacity ones to remove redundancy.

5. Staged Memory Construction Protocol

Key Insight: Full Engram annotation is expensive. We propose staged construction for practical deployment.

5.1 Stage 1: Minimal Viable Memory (EG-3D Lite)

Required: $e^{\text{geo}}$ , $e^{\text{sym}}$ , $T^{c}$ (coarse)

Annotation effort: Low (automatic from 3D scans)

Geometry: PointNet++ encoder on mesh/point cloud
Symmetry: PCA + RANSAC detection (fully automatic)
Canonical frame: Gravity + principal axis alignment (automatic)

Capabilities: Basic shape retrieval, symmetry enforcement, rigid object reconstruction

5.2 Stage 2: Add Relational Structure

Added: $e^{\text{rel}}$

Annotation effort: Medium (semi-automatic)

Source: PartNet-Mobility dataset, or learned from articulation videos

Capabilities: Articulated object reconstruction, kinematic constraint enforcement

5.3 Stage 3: Add Affordance

Added: $e^{\text{aff}}$

Annotation effort: Medium-High (from interaction data)

Source: HOI videos, robot demonstrations, geometric priors

Capabilities: Full affordance-grounded reconstruction, direct grasp point prediction

5.4 Staged Training Protocol

Phase 1: Train with geo + sym only
         → Learn basic retrieval and composition

Phase 2: Freeze geo/sym modules, add rel
         → Learn relational cross-attention
         → Enable kinematic consistency loss

Phase 3: Freeze geo/sym/rel, add aff
         → Learn affordance head
         → Fine-tune with prioritized slot alignment

6. EG-3D Lite: Minimal Configuration

Component	EG-3D Full	EG-3D Lite
$e^{\text{geo}}$	✅	✅
$e^{\text{sym}}$	✅	✅
$e^{\text{rel}}$	✅	❌
$e^{\text{aff}}$	✅	❌
Kinematic loss	✅	❌
Affordance head	✅	❌

EG-3D Lite Loss:

$\mathcal{L}_{\text{Lite}} = \mathcal{L}_{\text{rec}} + \lambda_{\text{inv}} \mathcal{L}_{\text{inv}} + \lambda_{\text{sym}} \mathcal{L}_{\text{sym}} + \lambda_{\text{gravity}} \mathcal{L}_{\text{gravity}}$

7. Full Training Objective

$\mathcal{L} = \mathcal{L}_{\text{rec}} + \lambda_{\text{inv}} \mathcal{L}_{\text{inv}} + \lambda_{\text{slot}} \mathcal{L}_{\text{slot-align}} + \lambda_{\text{kin}} \mathcal{L}_{\text{kin}} + \lambda_{\text{pose}} \mathcal{L}_{\text{pose}} + \lambda_{\text{aff}} \mathcal{L}_{\text{aff}}$

Reconstruction Loss:

$\mathcal{L}_{\text{rec}} = \underbrace{\|R - R^*\|_1}_{\text{image reconstruction}} + \lambda_{\text{CD}} \underbrace{\text{CD}(G, G^*)}_{\text{geometry reconstruction}}$

Chamfer Distance:

$\text{CD}(G, G^*) = \frac{1}{|G|}\sum_{x \in G} \min_{y \in G^*} \|x - y\|_2^2 + \frac{1}{|G^*|}\sum_{y \in G^*} \min_{x \in G} \|x - y\|_2^2$

Intuitive Explanation:
Reconstruction loss has two parts: (1) rendered images should match target images (L1 loss), (2) reconstructed geometry should match ground-truth geometry (Chamfer distance measures point cloud similarity).

Surface Regularization (Eikonal, for SDF):

$\mathcal{L}_{\text{surf}} = \mathbb{E}_{x \sim U(\Omega)} \big[ (\|\nabla_x f(x)\|_2 - 1)^2 \big]$

Intuitive Explanation:
A valid SDF should have unit gradient magnitude ($|\nabla f| = 1$) almost everywhere. This regularization prevents degenerate solutions.

8. Computational Considerations

Component	Complexity	Notes
Slot Attention	$O(J \cdot (N+T) \cdot D)$	Linear in tokens
Memory Retrieval	$O(J \cdot M)$	Parallelizable; use approximate NN for large $M$
Cross-Attention	$O(L \cdot H \cdot N_r)$	$N_r \leq 8$ , modest overhead
Kinematic Loss	$O(\|\mathcal{E}\|)$	Per-edge computation
Composition	$O (J)$ per query point	Soft-min/mixture

9. Failure Modes and Mitigations

9.1 Slot Collapse to Semantic Blobs

Mitigations (by priority):

Contact signal supervision (highest priority)
Geometric affordance priors
Relational consistency
Diversity regularization

9.2 Generator Ignoring Relational Tokens

Mitigations:

Kinematic consistency loss (explicit constraint)
Feed relational tokens to pose refinement head
Monitor $\nabla_{\text{rel}} \mathcal{L}$ ; alert if too small

9.3 Pose Estimation Without GT

Mitigations:

IMU gravity prior
Multi-view consistency
Kinematic loss provides indirect supervision

10. Hypotheses and Future Work

10.1 Hypotheses (To Be Validated)

Prioritized supervision → more functionally meaningful slots
Kinematic consistency loss → improved articulated object reconstruction
EG-3D Lite → meaningful gains even without rel/aff
Staged construction → practical deployment with minimal effort

10.2 Proposed Experiments

Experiment	Metric
Slot alignment quality	IoU with functional part GT
Kinematic consistency	Joint error on articulated objects
EG-3D Lite vs Full	Ablation study
Retrieval vs assembly	Comparison with exemplar methods

11. Conclusion

We presented Engram-Guided 2D-to-3D (EG-3D), a structural assembly framework for single-view 3D reconstruction. Key contributions:

Modular Engram Architecture with explicit composition operators
Prioritized Slot-Part Alignment: contact → geometric → relational → diversity
Kinematic Consistency Loss enforcing relational constraints
Gravity-Aligned Pose Estimation for GT-free settings
Staged Memory Construction enabling incremental deployment
Detailed mathematical specifications with intuitive explanations for every equation

This paper provides a complete foundation for implementation and empirical validation.

12. Engram-Guided Training Cycle

# Algorithm 1: Engram-Guided Training Cycle
# Input: EngramBank M, Dataset D, Stages S = {Lite, +Rel, +Aff}
# Output: Trained EG-3D model and VLA policy

# =======================================================
# STAGE-WISE TRAINING OF EG-3D
# =======================================================
for stage_s in S:                             # Iterate over Lite → +Rel → +Aff stages
    freeze_unused_components(stage_s)        # Freeze modules not used in the current stage

    for (I, P, G_star) in D:                 # Sample a minibatch (image, prompt, GT 3D)

        # -------------------------------------------------------
        # Visual-language encoding and part-level slot extraction
        # -------------------------------------------------------
        Z_tokens = E_VL(I, P)                # Fused visual-language tokens {z_t}
        K_slots  = SlotAttention(Z_tokens)  # Extract part-level structural keys {k_j}

        # -------------------------------------------------------
        # Engram retrieval and modular structural composition
        # -------------------------------------------------------
        Z_e = RetrieveAndCompose(            # Retrieve and assemble Engram components
                K_slots, M, stage_s)         # Z_e contains geo/sym/(rel)/(aff) by stage

        # -------------------------------------------------------
        # 3D generation and coarse pose estimation
        # -------------------------------------------------------
        G, T_coarse = G_theta(I, Z_e)        # Predict 3D geometry G and coarse slot poses {T_j^coarse}

        # -------------------------------------------------------
        # Pose refinement using relational tokens (if enabled)
        # -------------------------------------------------------
        if stage_s >= "+Rel":                # Only refine poses when relational structure is available
            T_ref = PoseRefine(              # Refine poses with relational and slot information
                        T_coarse, K_slots, Z_e.rel)   # Output refined poses {T_j^ref}
        else:
            T_ref = T_coarse                # Use coarse poses in Lite stage

        # -------------------------------------------------------
        # Compute stage-dependent training losses
        # -------------------------------------------------------
        L = L_rec(G, G_star)                # Reconstruction loss on geometry and rendering
            + λ_inv * L_inv(K_slots)        # Invariance loss to enforce structure-only keys
            + λ_sym * L_sym(G, Z_e.sym)     # Symmetry regularization loss

        if stage_s >= "+Rel":               # Additional losses for relational stages
            L += λ_slot * L_slot_align(K_slots)   # Slot-part alignment loss
               + λ_kin  * L_kin(G, T_ref, Z_e.rel) # Kinematic consistency loss
               + λ_pose * L_pose(T_ref, I)        # Pose loss (gravity + multi-view + kinematic)

        if stage_s == "+Aff":               # Additional loss for affordance stage
            L += λ_aff * L_aff(G, Z_e.aff)  # Affordance supervision loss

        # -------------------------------------------------------
        # Update only the parameters active in the current stage
        # -------------------------------------------------------
        update(params_of(stage_s), loss=L)  # Stage-wise parameter update


# =======================================================
# TRAINING VLA POLICY WITH FROZEN EG-3D
# =======================================================
freeze(EG3D)                                # Freeze all EG-3D modules during policy learning

for episode in RL_episodes:                # Iterate over reinforcement learning episodes
    I_t, P = observe()                     # Observe current image and task prompt

    # -------------------------------------------------------
    # Forward pass through frozen EG-3D for structure-aware state
    # -------------------------------------------------------
    G_t, Z_e_t = EG3D(I_t, P)               # Obtain reconstructed 3D and retrieved structure tokens
                                             # Z_e_t.rel / Z_e_t.aff may be empty depending on stage

    # -------------------------------------------------------
    # Vision-Language-Action policy inference and execution
    # -------------------------------------------------------
    A_t = π_VLA(Render(G_t), P)             # Predict action from rendered 3D and prompt
    execute(A_t)                           # Execute action in simulator or robot

    # -------------------------------------------------------
    # Compute reward with structural shaping and safety penalties
    # -------------------------------------------------------
    r_t = r_success()                     # Task success reward
          + α * r_struct(                # Structural shaping reward using rel/aff tokens
                Z_e_t.rel, Z_e_t.aff, A_t)
          - β * r_collision()             # Collision penalty
          - γ * r_torque()                # Excessive torque penalty

    # -------------------------------------------------------
    # Update VLA policy using PPO or SAC
    # -------------------------------------------------------
    update(π_VLA, algorithm="PPO/SAC", reward=r_t)

References

[1] D. Tochilkin et al. "TripoSR." arXiv:2403.02151, 2024. [2] Y. Hong et al. "LRM." arXiv:2311.04400, 2023. [3] X. Long et al. "Wonder3D." arXiv:2311.00005, 2023. [4] M. J. Kim et al. "OpenVLA." arXiv:2406.09246, 2024. [5] H. Shi et al. "MemoryVLA." arXiv:2508.19236, 2025. [6] F. Locatello et al. "Slot Attention." NeurIPS, 2020. [7] K. Mo et al. "PartNet." CVPR, 2019. [8] Y. Zhou et al. "Continuity of Rotation Representations." CVPR, 2019. [9] B. Mildenhall et al. "NeRF." ECCV, 2020. [10] J. J. Park et al. "DeepSDF." CVPR, 2019. [11] B. Kerbl et al. "3D Gaussian Splatting." ACM TOG, 2023. [12] W. Peebles and S. Xie. "DiT." CVPR, 2023. [13] K. Grauman et al. "Ego4D." CVPR, 2022. [14] S. Brahmbhatt et al. "ContactDB." CVPR, 2019.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote