File size: 5,420 Bytes
25614a2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | <h1 align="center">PICS: Pairwise Image Compositing with Spatial Interactions</h1>
<p align="center"><img src="assets/figure.jpg" width="100%"></p>
***Check out our [Project Page](https://ryanhangzhou.github.io/pics/) for more visual demos!***
<!-- Updates -->
## β© Updates
**02/08/2026**
- Release training and inference code.
- Release training data.
**03/01/2025**
- Release checkpoints.
<!-- TODO List -->
## π§ TODO List
- [x] Release training and inference code for pairwise image compositing
- [x] Release datasets (LVIS, Objects365, etc. in WebDataset format)
- [x] Release pretrained models
- [ ] Release any-object compositing code
<!-- Installation -->
## π¦ Installation
### Prerequisites
- **OS**: Linux (Tested on Ubuntu 20.04/22.04).
- **Python**: 3.10 or higher.
- **Package Manager**: [Conda](https://docs.anaconda.com/miniconda/install/#quick-command-line-install) is recommended.
**Hardware Requirements**
| Stage | GPU (VRAM) | System RAM | Batch Size |
| --- | --- | --- | --- |
| Training | NVIDIA H100 (80GB) | 120GB | 16 |
| Inference | NVIDIA RTX A6000 (48GB) | 64GB | 1 |
### Environment setup
Create a new conda environment named `PICS` and install the dependencies:
```
conda env create --file=PICS.yml
conda activate PICS
```
### Weights preparation
***DINOv2***: Download [ViT-g/14](https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth) and place it at: checkpoints/dinov2_vitg14_pretrain.pth
<!-- Pretrained Models -->
## π€ Pretrained Models
<!-- Coming soon! We are currently finalizing the model weights for public release. -->
We provide the following pretrained models (to be placed at the same directory with DINOv2):
| Model | Description | size | Download |
| --- | --- | --- | --- |
| PICS | Full model | 18.45GB | [Download](https://drive.google.com/file/d/17JpvhRvHFjfqQDiV9RFfgjGa0iLropXK/view?usp=sharing) |
## Minimal Example for Inference
Here is an [example](run_test.py) of how to use the pretrained models for pairwise image compositing.
Run two-object compositing mode:
```
python run_test.py \
--input "sample" \
--output "results/sample" \
--obj_thr 2
```
<!-- Dataset -->
## π Dataset
Our training set is a mixture of [LVIS](https://www.lvisdataset.org/), [VITON-HD](https://www.kaggle.com/datasets/marquis03/high-resolution-viton-zalando-dataset), [Objects365](https://www.objects365.org/overview.html), [Cityscapes](https://www.cityscapes-dataset.com/), [Mapillary Vistas](https://www.mapillary.com/dataset/vistas) and [BDD100K](https://bair.berkeley.edu/blog/2018/05/30/bdd/).
We provide the processed ***two-object compositing data*** in WebDataset format (.tar shards) below:
| Model | #Sample | Size | Download |
| --- | --- | --- | --- |
| LVIS | 34,160 | 7.98GB | [Download](https://drive.google.com/drive/folders/1Ir1cwR7K8HALNJiS6kTTlMgKIn8f18XX?usp=sharing) |
| VITON-HD | 11,647 | 2.53GB | [Download](https://drive.google.com/drive/folders/1317fJvvc7J1OTdbiM_Rst0C9AewIcNr2?usp=sharing) |
| Objects365 | 940,764 | 243GB | [Download](https://drive.google.com/drive/folders/1xKLoGv8e5wkGkjdxEGpz5i9TH08vd1AA?usp=sharing) |
| Cityscapes | 536 | 1.21GB | [Download](https://drive.google.com/drive/folders/1HYgEgZcknvEMbK2XZf2isY0pYcluGoKU?usp=sharing) |
| Mapillary Vistas | 603 | 582MB | [Download](https://drive.google.com/drive/folders/1a0756wc2bvvHJ_8a01N0tZ_Kb_BkRZv1?usp=sharing) |
| BDD100K | 1,012 | 204MB | [Download](https://drive.google.com/drive/folders/1zS60KPfZioU4tW1ngDK1KahE7T-TeIim?usp=sharing) |
### Data organization
```
PICS/
βββ data/
βββ train/
βββ LVIS/
βββ 00000.tar
βββ ...
βββ VITONHD/
βββ Objects365/
βββ Cityscapes/
βββ MapillaryVistas/
βββ BDD100K/
```
### Data preparation instruction
We provide a script using SAM to extract high-quality object silhouettes for the Objects365 dataset.
To process a specific range of data shards, run:
```
python scripts/annotate_sam.py --is_train --index_low 00000 --index_high 10000
```
To process raw data (e.g., LVIS), run the following command. Replace /path/to/raw_data with your actual local data path:
```
python -m datasets.lvis \
--dataset_dir "/path/to/raw_data" \
--construct_dataset_dir "data/train/LVIS" \
--area_ratio 0.02 \
--is_build_data \
--is_train
```
## Training
To train a model on the whole dataset:
```
python run_train.py \
--root_dir 'LOGS/whole_data' \
--batch_size 16 \
--logger_freq 1000 \
--is_joint
```
<!-- License -->
## βοΈ License
This project is licensed under the terms of the MIT license.
<!-- Citation -->
<!-- ## π Citation -->
<!-- If you find this work helpful, please consider citing our paper: -->
<!-- ```bibtex
@inproceedings{zhou2025bootplace,
title={BOOTPLACE: Bootstrapped Object Placement with Detection Transformers},
author={Zhou, Hang and Zuo, Xinxin and Ma, Rui and Cheng, Li},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={19294--19303},
year={2025}
}
``` -->
## π Acknowledgements
We would like to thank the contributors to the [AnyDoor](https://huggingface.co/papers/2307.09481) repository for their open research.
## Contact Us
For any inquiries, feel free to open a GitHub issue or reach out via email.
|