DIRECT / README.md
superGong's picture
Improve model card: add authors, paper link, and usage instructions (#1)
f29df69
---
base_model:
- black-forest-labs/FLUX.1-Fill-dev
- microsoft/TRELLIS-image-large
pipeline_tag: image-to-image
tags:
- object-insertion
- 3d-aware
- pose-controllable-generation
- image-to-image
---
# DIRECT: Direct 3D-Aware Object Insertion via Decomposed Visual Proxies
This repository contains the model weights for **DIRECT**, presented in the paper [Direct 3D-Aware Object Insertion via Decomposed Visual Proxies](https://huggingface.co/papers/2606.06601).
**Authors**: Jingbo Gong, Yikai Wang, Yushi Lan, Yuhao Wan, Ziheng Ouyang, Rui Zhao, Ming-Ming Cheng, Qibin Hou, and Chen Change Loy.
[**Project Page**](https://gong1130.github.io/DIRECT/) | [**Paper (ArXiv)**](https://arxiv.org/abs/2606.06601) | [**Code**](https://github.com/Gong1130/DIRECT)
## Overview
DIRECT (Decomposed Injection for Reference Composition and Target-integration) is a framework that enables pose-controllable object insertion. It integrates interactive pose manipulation with high-fidelity 2D image synthesis by decomposing insertion conditions into three visual proxies:
- **Appearance guidance**: Captures visual details from the reference object image.
- **Geometry guidance**: Derived from a user-adjusted 3D proxy rendered from a reconstructed 3D object.
- **Context guidance**: From the target background scene.
By injecting these through separate pathways, DIRECT preserves reference appearance, follows user-specified poses, and adapts the object naturally to the target scene.
## Usage
Please refer to the [official GitHub repository](https://github.com/Gong1130/DIRECT) for installation instructions. You can run the interactive demo with the following command:
```bash
python demo/demo.py --gradio_port 7860 --viser_port 8081
```
The demo allows you to segment a reference object, reconstruct it in 3D, and interactively manipulate its pose within the background image.
## Model Details
This repository contains **DIRECT-specific** weights only:
- `lora.safetensors`
- `condition_embedder.safetensors`
- `x_embedder.safetensors`
- `time_text_embed.safetensors`
- `pooled_image_projector.safetensors`
- `image_projector.safetensors`
- `config.json`
The framework requires the following **external** foundation models:
- [black-forest-labs/FLUX.1-Fill-dev](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev)
- [google/siglip2-so400m-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384)
- [microsoft/TRELLIS-image-large](https://huggingface.co/microsoft/TRELLIS-image-large)
- [briaai/RMBG-2.0](https://huggingface.co/briaai/RMBG-2.0) (for background removal in the demo)
## Citation
```bibtex
@inproceedings{gong2026direct,
title = {Direct 3D-Aware Object Insertion via Decomposed Visual Proxies},
author = {Jingbo Gong and Yikai Wang and Yushi Lan and Yuhao Wan and Ziheng Ouyang and Rui Zhao and Ming-Ming Cheng and Qibin Hou and Chen Change Loy},
booktitle = {ICML},
year = {2026}
}
```