Hang Zhou commited on
Commit
25614a2
Β·
verified Β·
1 Parent(s): 5e662f5

Upload Readme.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. Readme.md +151 -0
Readme.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">PICS: Pairwise Image Compositing with Spatial Interactions</h1>
2
+ <p align="center"><img src="assets/figure.jpg" width="100%"></p>
3
+
4
+ ***Check out our [Project Page](https://ryanhangzhou.github.io/pics/) for more visual demos!***
5
+
6
+ <!-- Updates -->
7
+ ## ⏩ Updates
8
+
9
+ **02/08/2026**
10
+ - Release training and inference code.
11
+ - Release training data.
12
+
13
+ **03/01/2025**
14
+ - Release checkpoints.
15
+
16
+ <!-- TODO List -->
17
+ ## 🚧 TODO List
18
+ - [x] Release training and inference code for pairwise image compositing
19
+ - [x] Release datasets (LVIS, Objects365, etc. in WebDataset format)
20
+ - [x] Release pretrained models
21
+ - [ ] Release any-object compositing code
22
+
23
+ <!-- Installation -->
24
+ ## πŸ“¦ Installation
25
+
26
+ ### Prerequisites
27
+ - **OS**: Linux (Tested on Ubuntu 20.04/22.04).
28
+ - **Python**: 3.10 or higher.
29
+ - **Package Manager**: [Conda](https://docs.anaconda.com/miniconda/install/#quick-command-line-install) is recommended.
30
+
31
+ **Hardware Requirements**
32
+ | Stage | GPU (VRAM) | System RAM | Batch Size |
33
+ | --- | --- | --- | --- |
34
+ | Training | NVIDIA H100 (80GB) | 120GB | 16 |
35
+ | Inference | NVIDIA RTX A6000 (48GB) | 64GB | 1 |
36
+
37
+ ### Environment setup
38
+ Create a new conda environment named `PICS` and install the dependencies:
39
+ ```
40
+ conda env create --file=PICS.yml
41
+ conda activate PICS
42
+ ```
43
+
44
+ ### Weights preparation
45
+ ***DINOv2***: Download [ViT-g/14](https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth) and place it at: checkpoints/dinov2_vitg14_pretrain.pth
46
+
47
+ <!-- Pretrained Models -->
48
+ ## πŸ€– Pretrained Models
49
+ <!-- Coming soon! We are currently finalizing the model weights for public release. -->
50
+ We provide the following pretrained models (to be placed at the same directory with DINOv2):
51
+
52
+ | Model | Description | size | Download |
53
+ | --- | --- | --- | --- |
54
+ | PICS | Full model | 18.45GB | [Download](https://drive.google.com/file/d/17JpvhRvHFjfqQDiV9RFfgjGa0iLropXK/view?usp=sharing) |
55
+
56
+
57
+ ## Minimal Example for Inference
58
+
59
+ Here is an [example](run_test.py) of how to use the pretrained models for pairwise image compositing.
60
+ Run two-object compositing mode:
61
+ ```
62
+ python run_test.py \
63
+ --input "sample" \
64
+ --output "results/sample" \
65
+ --obj_thr 2
66
+ ```
67
+
68
+
69
+ <!-- Dataset -->
70
+ ## πŸ“š Dataset
71
+ Our training set is a mixture of [LVIS](https://www.lvisdataset.org/), [VITON-HD](https://www.kaggle.com/datasets/marquis03/high-resolution-viton-zalando-dataset), [Objects365](https://www.objects365.org/overview.html), [Cityscapes](https://www.cityscapes-dataset.com/), [Mapillary Vistas](https://www.mapillary.com/dataset/vistas) and [BDD100K](https://bair.berkeley.edu/blog/2018/05/30/bdd/).
72
+ We provide the processed ***two-object compositing data*** in WebDataset format (.tar shards) below:
73
+ | Model | #Sample | Size | Download |
74
+ | --- | --- | --- | --- |
75
+ | LVIS | 34,160 | 7.98GB | [Download](https://drive.google.com/drive/folders/1Ir1cwR7K8HALNJiS6kTTlMgKIn8f18XX?usp=sharing) |
76
+ | VITON-HD | 11,647 | 2.53GB | [Download](https://drive.google.com/drive/folders/1317fJvvc7J1OTdbiM_Rst0C9AewIcNr2?usp=sharing) |
77
+ | Objects365 | 940,764 | 243GB | [Download](https://drive.google.com/drive/folders/1xKLoGv8e5wkGkjdxEGpz5i9TH08vd1AA?usp=sharing) |
78
+ | Cityscapes | 536 | 1.21GB | [Download](https://drive.google.com/drive/folders/1HYgEgZcknvEMbK2XZf2isY0pYcluGoKU?usp=sharing) |
79
+ | Mapillary Vistas | 603 | 582MB | [Download](https://drive.google.com/drive/folders/1a0756wc2bvvHJ_8a01N0tZ_Kb_BkRZv1?usp=sharing) |
80
+ | BDD100K | 1,012 | 204MB | [Download](https://drive.google.com/drive/folders/1zS60KPfZioU4tW1ngDK1KahE7T-TeIim?usp=sharing) |
81
+
82
+ ### Data organization
83
+ ```
84
+ PICS/
85
+ β”œβ”€β”€ data/
86
+ β”œβ”€β”€ train/
87
+ β”œβ”€β”€ LVIS/
88
+ β”œβ”€β”€ 00000.tar
89
+ β”œβ”€β”€ ...
90
+ β”œβ”€β”€ VITONHD/
91
+ β”œβ”€β”€ Objects365/
92
+ β”œβ”€β”€ Cityscapes/
93
+ β”œβ”€β”€ MapillaryVistas/
94
+ β”œβ”€β”€ BDD100K/
95
+ ```
96
+
97
+ ### Data preparation instruction
98
+ We provide a script using SAM to extract high-quality object silhouettes for the Objects365 dataset.
99
+ To process a specific range of data shards, run:
100
+ ```
101
+ python scripts/annotate_sam.py --is_train --index_low 00000 --index_high 10000
102
+ ```
103
+ To process raw data (e.g., LVIS), run the following command. Replace /path/to/raw_data with your actual local data path:
104
+ ```
105
+ python -m datasets.lvis \
106
+ --dataset_dir "/path/to/raw_data" \
107
+ --construct_dataset_dir "data/train/LVIS" \
108
+ --area_ratio 0.02 \
109
+ --is_build_data \
110
+ --is_train
111
+ ```
112
+
113
+ ## Training
114
+
115
+ To train a model on the whole dataset:
116
+ ```
117
+ python run_train.py \
118
+ --root_dir 'LOGS/whole_data' \
119
+ --batch_size 16 \
120
+ --logger_freq 1000 \
121
+ --is_joint
122
+ ```
123
+
124
+
125
+ <!-- License -->
126
+ ## βš–οΈ License
127
+
128
+ This project is licensed under the terms of the MIT license.
129
+
130
+
131
+
132
+ <!-- Citation -->
133
+ <!-- ## πŸ“œ Citation -->
134
+
135
+ <!-- If you find this work helpful, please consider citing our paper: -->
136
+
137
+ <!-- ```bibtex
138
+ @inproceedings{zhou2025bootplace,
139
+ title={BOOTPLACE: Bootstrapped Object Placement with Detection Transformers},
140
+ author={Zhou, Hang and Zuo, Xinxin and Ma, Rui and Cheng, Li},
141
+ booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
142
+ pages={19294--19303},
143
+ year={2025}
144
+ }
145
+ ``` -->
146
+
147
+ ## πŸ™Œ Acknowledgements
148
+ We would like to thank the contributors to the [AnyDoor](https://huggingface.co/papers/2307.09481) repository for their open research.
149
+
150
+ ## Contact Us
151
+ For any inquiries, feel free to open a GitHub issue or reach out via email.