File size: 5,539 Bytes
a1158c3
 
 
 
 
175877f
a1158c3
 
 
 
 
 
 
 
 
 
 
175877f
 
9d1d414
d485900
 
 
175877f
 
 
5da6d82
175877f
 
9d1d414
175877f
 
 
 
 
 
 
 
9d1d414
175877f
 
 
9d1d414
175877f
 
459461f
175877f
9d1d414
 
5da6d82
175877f
 
 
d485900
 
175877f
 
 
 
 
 
 
 
5da6d82
175877f
 
 
 
 
8bd48e1
bc25bd1
175877f
 
 
 
 
459461f
175877f
8bd48e1
 
 
 
 
175877f
459461f
175877f
5da6d82
 
175877f
 
 
5da6d82
 
175877f
 
5da6d82
 
459461f
 
5da6d82
 
175877f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d1d414
 
2793fa3
9d1d414
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: apache-2.0
library_name: diffusers
pipeline_tag: text-to-image
base_model: black-forest-labs/FLUX.1-dev
datasets:
- va1bhavagrawa1/seethrough3d-data
language:
- en
tags:
- controllable text-to-image generation
- diffusion models
- 3D layout control
- occlusion reasoning
---

# [CVPR-26 πŸŽ‰] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[![arXiv](https://img.shields.io/badge/arXiv-2602.23359-b31b1b.svg)](https://arxiv.org/abs/2602.23359)
[![Project Page](https://img.shields.io/badge/Project-Page-blue.svg)](https://seethrough3d.github.io)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black.svg)](https://github.com/va1bhavagrawal/seethrough3d.git)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow)](https://huggingface.co/va1bhavagrawa1/seethrough3d)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-orange)](https://huggingface.co/datasets/va1bhavagrawa1/seethrough3d-data/tree/main)


<div align="center">
  <img src="assets/teaser_camera_ready.png" width="50%">
</div>

## πŸš€ Getting Started 

We recommend creating a `conda` environment named `st3d` with Python 3.11:

```bash
conda create -n st3d python=3.11
conda activate st3d
```

Install the dependencies using the provided `requirements.txt`, then install the project itself in editable mode:

```bash
pip install -r requirements.txt
pip install -e .
```

## 🎨 Inference 

Inference of this model requires ~38 GB VRAM on the GPU. Note that the inference runs Blender in EEVEE mode, which runs faster on workstation GPUs like the NVIDIA RTX A6000, compared to data center GPUs like the NVIDIA H100.  

### 🌐 Download the Pre-Trained Checkpoint 

We use [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) as the base model. To download the SeeThrough3D LoRA checkpoint,
```bash
conda activate st3d 
python3 download_checkpoint.py 
``` 
<div align="center">
  <img src="assets/gradio_demo.png" width="100%">
</div>

### πŸ€— Gradio Interface 

It is best to perform inference using the πŸ€— Gradio interface, which makes it easy to specify 3D layouts. To launch the interface, run
```bash
cd inference 
conda activate st3d 
python3 app.py 
``` 

For a detailed guide on how to use the Gradio interface, please refer to the [wiki](gradio_wiki.md).

The created 3D layouts can be saved by clicking the `πŸ’Ύ Save Scene` button. This functionality stores the 3D layout along with other information such as seed, image size, prompt, etc. in a pickle file. We also provide various example layouts under the `πŸ–ΌοΈ Examples` section.  


The interface requires some available ports on the host machine, these can be configured in `inference/config.py`. 

### πŸ“’ Notebook Inference  

The inference notebook is located at `infer.ipynb`. It is able to load a scene saved by the πŸ€— Gradio interface (described above), visualize the inputs to the model (shown below), and perform inference. The inference notebook also requires some available ports on the host machine, these can be configured in `inference/config.py`.  

<div align="center">
  <img src="assets/input_vis.png" width="100%">
</div>

## πŸ‹ Training  

### 🌐 Download the Dataset
By default, the data is downloaded in the `dataset` directory. To change the download location, edit the `LOCAL_DIR` variable in `dataset/download.py`.  

```bash
cd dataset 
conda activate st3d 
./setup_data.sh 
```

We are working on making the data compatible with πŸ€— datasets library for ease of visualization and streaming, see [`va1bhavagrawa1/seethrough3d-data`](https://huggingface.co/datasets/va1bhavagrawa1/seethrough3d-data/tree/main) 

### πŸƒ Run Training

Edit `train/train.sh` to specify the downloaded dataset path. 

We train the model for a single epoch at resolution 512, effective batch size of 2 (~25K steps). This requires 2x 80 GB GPUs (one image per GPU). 
```bash
cd train 
# edit the default_config.yaml to specify GPU configuration  
conda activate st3d 
./train.sh 
```
The training takes ~6 hours on 2x NVIDIA H100 GPUs. 

We further do an optional second stage finetuning (~5K steps) at resolution 1024, which improves control and realism during inference. The number of finetuning steps for this stage can be controlled by setting the flag `--stage2_steps`. This stage requires 2x 96 GB GPUs. 

The training VRAM requirement can be reduced to 2x 48 GB GPUs (for the first stage) by **caching text embeddings**. 
To cache the text embeddings, run the following.
```bash
cd train/caching 
# change `DATASET_JSONL` global var in `train/caching/cache.py` to point to training dataset jsonl  
conda activate st3d 
./cache_text_embeddings.sh 
``` 

Now, set the flag `--inference_embeds_dir` in `train/train.sh` to the location of the cached text embeddings.

> **Note:** The VRAM requirements can be further reduced using training time optimizations such as gradient checkpointing. We plan to implement this in the future. We are also welcome to any PRs regarding this.  

## πŸ… Citation

If you find this work useful please cite:

```bibtex
@misc{agrawal2026seethrough3docclusionaware3d,
      title={SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation}, 
      author={Vaibhav Agrawal and Rishubh Parihar and Pradhaan Bhat and Ravi Kiran Sarvadevabhatla and R. Venkatesh Babu},
      year={2026},
      eprint={2602.23359},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.23359}, 
}
```