DIRECT / README.md

Improve model card: add authors, paper link, and usage instructions (#1)

f29df69 1 day ago

2.95 kB

	---
	base_model:
	- black-forest-labs/FLUX.1-Fill-dev
	- microsoft/TRELLIS-image-large
	pipeline_tag: image-to-image
	tags:
	- object-insertion
	- 3d-aware
	- pose-controllable-generation
	- image-to-image
	---

	# DIRECT: Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

	This repository contains the model weights for DIRECT, presented in the paper [Direct 3D-Aware Object Insertion via Decomposed Visual Proxies](https://huggingface.co/papers/2606.06601).

	Authors: Jingbo Gong, Yikai Wang, Yushi Lan, Yuhao Wan, Ziheng Ouyang, Rui Zhao, Ming-Ming Cheng, Qibin Hou, and Chen Change Loy.

	[Project Page](https://gong1130.github.io/DIRECT/) \| [Paper (ArXiv)](https://arxiv.org/abs/2606.06601) \| [Code](https://github.com/Gong1130/DIRECT)

	## Overview

	DIRECT (Decomposed Injection for Reference Composition and Target-integration) is a framework that enables pose-controllable object insertion. It integrates interactive pose manipulation with high-fidelity 2D image synthesis by decomposing insertion conditions into three visual proxies:
	- Appearance guidance: Captures visual details from the reference object image.
	- Geometry guidance: Derived from a user-adjusted 3D proxy rendered from a reconstructed 3D object.
	- Context guidance: From the target background scene.

	By injecting these through separate pathways, DIRECT preserves reference appearance, follows user-specified poses, and adapts the object naturally to the target scene.

	## Usage

	Please refer to the [official GitHub repository](https://github.com/Gong1130/DIRECT) for installation instructions. You can run the interactive demo with the following command:

	```bash
	python demo/demo.py --gradio_port 7860 --viser_port 8081
	```

	The demo allows you to segment a reference object, reconstruct it in 3D, and interactively manipulate its pose within the background image.

	## Model Details

	This repository contains DIRECT-specific weights only:
	- `lora.safetensors`
	- `condition_embedder.safetensors`
	- `x_embedder.safetensors`
	- `time_text_embed.safetensors`
	- `pooled_image_projector.safetensors`
	- `image_projector.safetensors`
	- `config.json`

	The framework requires the following external foundation models:
	- [black-forest-labs/FLUX.1-Fill-dev](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev)
	- [google/siglip2-so400m-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384)
	- [microsoft/TRELLIS-image-large](https://huggingface.co/microsoft/TRELLIS-image-large)
	- [briaai/RMBG-2.0](https://huggingface.co/briaai/RMBG-2.0) (for background removal in the demo)

	## Citation

	```bibtex
	@inproceedings{gong2026direct,
	title = {Direct 3D-Aware Object Insertion via Decomposed Visual Proxies},
	author = {Jingbo Gong and Yikai Wang and Yushi Lan and Yuhao Wan and Ziheng Ouyang and Rui Zhao and Ming-Ming Cheng and Qibin Hou and Chen Change Loy},
	booktitle = {ICML},
	year = {2026}
	}
	```