PICS: Pairwise Image Compositing with Spatial Interactions

Check out our Project Page for more visual demos!
β© Updates
02/08/2026
- Release training and inference code.
- Release training data.
03/01/2025
- Release checkpoints.
π§ TODO List
- Release training and inference code for pairwise image compositing
- Release datasets (LVIS, Objects365, etc. in WebDataset format)
- Release pretrained models
- Release any-object compositing code
π¦ Installation
Prerequisites
- OS: Linux (Tested on Ubuntu 20.04/22.04).
- Python: 3.10 or higher.
- Package Manager: Conda is recommended.
Hardware Requirements
| Stage | GPU (VRAM) | System RAM | Batch Size |
|---|---|---|---|
| Training | NVIDIA H100 (80GB) | 120GB | 16 |
| Inference | NVIDIA RTX A6000 (48GB) | 64GB | 1 |
Environment setup
Create a new conda environment named PICS and install the dependencies:
conda env create --file=PICS.yml
conda activate PICS
Weights preparation
DINOv2: Download ViT-g/14 and place it at: checkpoints/dinov2_vitg14_pretrain.pth
π€ Pretrained Models
We provide the following pretrained models (to be placed at the same directory with DINOv2):
| Model | Description | size | Download |
|---|---|---|---|
| PICS | Full model | 18.45GB | Download |
Minimal Example for Inference
Here is an example of how to use the pretrained models for pairwise image compositing. Run two-object compositing mode:
python run_test.py \
--input "sample" \
--output "results/sample" \
--obj_thr 2
π Dataset
Our training set is a mixture of LVIS, VITON-HD, Objects365, Cityscapes, Mapillary Vistas and BDD100K. We provide the processed two-object compositing data in WebDataset format (.tar shards) below:
| Model | #Sample | Size | Download |
|---|---|---|---|
| LVIS | 34,160 | 7.98GB | Download |
| VITON-HD | 11,647 | 2.53GB | Download |
| Objects365 | 940,764 | 243GB | Download |
| Cityscapes | 536 | 1.21GB | Download |
| Mapillary Vistas | 603 | 582MB | Download |
| BDD100K | 1,012 | 204MB | Download |
Data organization
PICS/
βββ data/
βββ train/
βββ LVIS/
βββ 00000.tar
βββ ...
βββ VITONHD/
βββ Objects365/
βββ Cityscapes/
βββ MapillaryVistas/
βββ BDD100K/
Data preparation instruction
We provide a script using SAM to extract high-quality object silhouettes for the Objects365 dataset. To process a specific range of data shards, run:
python scripts/annotate_sam.py --is_train --index_low 00000 --index_high 10000
To process raw data (e.g., LVIS), run the following command. Replace /path/to/raw_data with your actual local data path:
python -m datasets.lvis \
--dataset_dir "/path/to/raw_data" \
--construct_dataset_dir "data/train/LVIS" \
--area_ratio 0.02 \
--is_build_data \
--is_train
Training
To train a model on the whole dataset:
python run_train.py \
--root_dir 'LOGS/whole_data' \
--batch_size 16 \
--logger_freq 1000 \
--is_joint
βοΈ License
This project is licensed under the terms of the MIT license.
π Acknowledgements
We would like to thank the contributors to the AnyDoor repository for their open research.
Contact Us
For any inquiries, feel free to open a GitHub issue or reach out via email.