File size: 2,945 Bytes

2ec64ce
 
 
 
f29df69
2ec64ce
 
 
 
f29df69
2ec64ce
 
f29df69
 
 
2ec64ce
f29df69
2ec64ce
f29df69
2ec64ce
f29df69
2ec64ce
f29df69
 
 
 
 
 
2ec64ce
 
 
f29df69
2ec64ce
f29df69
 
 
2ec64ce
f29df69
2ec64ce
f29df69
 
 
2ec64ce
 
 
 
 
 
 
 
f29df69
 
 
 
 
 
 
2ec64ce
f29df69

---
base_model:
- black-forest-labs/FLUX.1-Fill-dev
- microsoft/TRELLIS-image-large
pipeline_tag: image-to-image
tags:
- object-insertion
- 3d-aware
- pose-controllable-generation
- image-to-image
---

# DIRECT: Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

This repository contains the model weights for **DIRECT**, presented in the paper [Direct 3D-Aware Object Insertion via Decomposed Visual Proxies](https://huggingface.co/papers/2606.06601).

**Authors**: Jingbo Gong, Yikai Wang, Yushi Lan, Yuhao Wan, Ziheng Ouyang, Rui Zhao, Ming-Ming Cheng, Qibin Hou, and Chen Change Loy.

[**Project Page**](https://gong1130.github.io/DIRECT/) | [**Paper (ArXiv)**](https://arxiv.org/abs/2606.06601) | [**Code**](https://github.com/Gong1130/DIRECT)

## Overview

DIRECT (Decomposed Injection for Reference Composition and Target-integration) is a framework that enables pose-controllable object insertion. It integrates interactive pose manipulation with high-fidelity 2D image synthesis by decomposing insertion conditions into three visual proxies:
- **Appearance guidance**: Captures visual details from the reference object image.
- **Geometry guidance**: Derived from a user-adjusted 3D proxy rendered from a reconstructed 3D object.
- **Context guidance**: From the target background scene.

By injecting these through separate pathways, DIRECT preserves reference appearance, follows user-specified poses, and adapts the object naturally to the target scene.

## Usage

Please refer to the [official GitHub repository](https://github.com/Gong1130/DIRECT) for installation instructions. You can run the interactive demo with the following command:

```bash
python demo/demo.py --gradio_port 7860 --viser_port 8081
```

The demo allows you to segment a reference object, reconstruct it in 3D, and interactively manipulate its pose within the background image.

## Model Details

This repository contains **DIRECT-specific** weights only:
- `lora.safetensors`
- `condition_embedder.safetensors`
- `x_embedder.safetensors`
- `time_text_embed.safetensors`
- `pooled_image_projector.safetensors`
- `image_projector.safetensors`
- `config.json`

The framework requires the following **external** foundation models:
- [black-forest-labs/FLUX.1-Fill-dev](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev)
- [google/siglip2-so400m-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384)
- [microsoft/TRELLIS-image-large](https://huggingface.co/microsoft/TRELLIS-image-large)
- [briaai/RMBG-2.0](https://huggingface.co/briaai/RMBG-2.0) (for background removal in the demo)

## Citation

```bibtex
@inproceedings{gong2026direct,
  title     = {Direct 3D-Aware Object Insertion via Decomposed Visual Proxies},
  author    = {Jingbo Gong and Yikai Wang and Yushi Lan and Yuhao Wan and Ziheng Ouyang and Rui Zhao and Ming-Ming Cheng and Qibin Hou and Chen Change Loy},
  booktitle = {ICML},
  year      = {2026}
}
```