More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models
Paper • 2510.23574 • Published
How to use hongk1998/merge-large-depth-v1 with Diffusers:
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("hongk1998/merge-large-depth-v1", dtype=torch.bfloat16, device_map="cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai†
Huazhong University of Science & Technology
(†) Corresponding author.
We present MERGE, a simple unified diffusion model for image generation and depth estimation. Its core lies in leveraging streamlined converters and rich visual prior stored in generative image models. Our model, derived from fixed generative image models and fine-tuned pluggable converters with synthetic data, expands powerful zero-shot depth estimation capability.
Please refer to this page.