SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
Abstract
SeeThrough3D generates 3D layout-conditioned scenes with explicit occlusion modeling using translucent 3D boxes and visual tokens derived from rendered representations.
We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion (2026)
- 3D Space as a Scratchpad for Editable Text-to-Image Generation (2026)
- GenCAMO: Scene-Graph Contextual Decoupling for Environment-aware and Mask-free Camouflage Image-Dense Annotation Generation (2026)
- RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming (2026)
- PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories (2026)
- RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation (2026)
- Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
