--- base_model: - black-forest-labs/FLUX.1-Fill-dev - microsoft/TRELLIS-image-large pipeline_tag: image-to-image tags: - object-insertion - 3d-aware - pose-controllable-generation - image-to-image --- # DIRECT: Direct 3D-Aware Object Insertion via Decomposed Visual Proxies This repository contains the model weights for **DIRECT**, presented in the paper [Direct 3D-Aware Object Insertion via Decomposed Visual Proxies](https://huggingface.co/papers/2606.06601). **Authors**: Jingbo Gong, Yikai Wang, Yushi Lan, Yuhao Wan, Ziheng Ouyang, Rui Zhao, Ming-Ming Cheng, Qibin Hou, and Chen Change Loy. [**Project Page**](https://gong1130.github.io/DIRECT/) | [**Paper (ArXiv)**](https://arxiv.org/abs/2606.06601) | [**Code**](https://github.com/Gong1130/DIRECT) ## Overview DIRECT (Decomposed Injection for Reference Composition and Target-integration) is a framework that enables pose-controllable object insertion. It integrates interactive pose manipulation with high-fidelity 2D image synthesis by decomposing insertion conditions into three visual proxies: - **Appearance guidance**: Captures visual details from the reference object image. - **Geometry guidance**: Derived from a user-adjusted 3D proxy rendered from a reconstructed 3D object. - **Context guidance**: From the target background scene. By injecting these through separate pathways, DIRECT preserves reference appearance, follows user-specified poses, and adapts the object naturally to the target scene. ## Usage Please refer to the [official GitHub repository](https://github.com/Gong1130/DIRECT) for installation instructions. You can run the interactive demo with the following command: ```bash python demo/demo.py --gradio_port 7860 --viser_port 8081 ``` The demo allows you to segment a reference object, reconstruct it in 3D, and interactively manipulate its pose within the background image. ## Model Details This repository contains **DIRECT-specific** weights only: - `lora.safetensors` - `condition_embedder.safetensors` - `x_embedder.safetensors` - `time_text_embed.safetensors` - `pooled_image_projector.safetensors` - `image_projector.safetensors` - `config.json` The framework requires the following **external** foundation models: - [black-forest-labs/FLUX.1-Fill-dev](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev) - [google/siglip2-so400m-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) - [microsoft/TRELLIS-image-large](https://huggingface.co/microsoft/TRELLIS-image-large) - [briaai/RMBG-2.0](https://huggingface.co/briaai/RMBG-2.0) (for background removal in the demo) ## Citation ```bibtex @inproceedings{gong2026direct, title = {Direct 3D-Aware Object Insertion via Decomposed Visual Proxies}, author = {Jingbo Gong and Yikai Wang and Yushi Lan and Yuhao Wan and Ziheng Ouyang and Rui Zhao and Ming-Ming Cheng and Qibin Hou and Chen Change Loy}, booktitle = {ICML}, year = {2026} } ```