How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import load_image

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("tennyyyin/multiview_droid_v0", dtype=torch.bfloat16, device_map="cuda")

prompt = "Turn this cat into a dog"
input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")

image = pipe(image=input_image, prompt=prompt).images[0]

Multiview DROID image-edit (FLUX.2-klein-4B)

Full finetune of black-forest-labs/FLUX.2-klein-4B for the multiview DROID image-edit task. Given a wrist-camera reference (cond 1) and a side-camera target (cond 2), the model inserts the prompted object into the side view at a plausible location, scale, orientation, and occlusion using the wrist view as reference.

Files

  • config.json โ€” Flux2Transformer2DModel architecture config.
  • diffusion_pytorch_model.safetensors โ€” finetuned transformer weights.

Load directly with Flux2Transformer2DModel.from_pretrained("<this repo or local dir>").

Usage

See the IROM docs and inference script in the source repo:

# Download
uv run examples/inference/download_multiview_droid.py

# Inference
uv run examples/inference/multiview_droid.py \
    --checkpoint_path checkpoints/multiview_droid_v0 \
    --wrist_cond assets/droid/wrist_banana.jpg \
    --side_cond  assets/droid/side1.jpg \
    --prompt     "banana" \
    --output_dir outputs/multiview_droid/banana

--prompt is the object phrase (e.g. "banana", "blue mug"); it is slotted into the prompt template the model was trained on. Pass --full_prompt to override the template verbatim.

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tennyyyin/multiview_droid_v0

Finetuned
(13)
this model