--- license: mit tags: - change captioning - vision-language - image-to-text - procedural reasoning - multimodal - pytorch datasets: - clevr-change - image-editing-request - spot-the-diff metrics: - bleu - meteor - rouge pipeline_tag: image-to-text --- # ProCap: Experiment Materials This repository contains the **official experimental materials** for the paper: > **Imagine How to Change: Explicit Procedure Modeling for Change Captioning** It provides **processed datasets**, **pre-trained model weights**, and **evaluation tools** for reproducing the results reported in the paper. 📦 All materials are also available via [Baidu Netdisk](https://pan.baidu.com/s/1t_YXB6J_vkuPxByn2hat2A) **Extraction Code:** `5h7w` --- ## Contents - [Data](#data) - [Model Weights](#model-weights) - [Evaluation](#evaluation) - [Usage](#usage) - [License](#license) --- ## Data All datasets are preprocessed into **pseudo-sequence format** (`.h5` files) generated by [VFIformer](https://github.com/JIA-Lab-research/VFIformer). ### Included Datasets - **`CLEVR-data`** Processed pseudo-sequences for the **CLEVR-Change** dataset - **`edit-data`** Processed pseudo-sequences for the **Image-Editing-Request** dataset - **`spot-data`** Processed pseudo-sequences for the **Spot-the-Diff** dataset - **`filter_files`** Confidence scores computed using [CLIP4IDC](https://github.com/sushizixin/CLIP4IDC) - **`filtered-spot-captions`** Refined captions for the Spot-the-Diff dataset --- ## Model Weights This repository provides pre-trained weights for both stages in the paper. ### Explicit Procedure Modeling (Stage 1) - `pretrained_vqgan` – VQGAN models for each dataset - `stage1_clevr_best` - `stage1_edit_best` - `stage1_spot_best` ### Implicit Procedure Captioning (Stage 2) - `clevr_best` - `edit_best` - `spot_best` > **Note:** Stage 1 checkpoints can be directly reused to initialize Stage 2 training. --- ## Evaluation - **`densevid_eval`** Evaluation tools used for quantitative assessment --- ## Usage ### 1. Data Preparation 1. Move caption files in `filtered-spot-captions` to the original caption directory of the **Spot-the-Diff** dataset. 2. Copy the processed data folders to the original dataset root and rename them as follows: | Dataset | Folder | Rename To | |------|------|------| | CLEVR-Change | `CLEVR-data` | `CLEVR_processed` | | Image-Editing-Request | `edit-data` | `edit_processed` | | Spot-the-Diff | `spot-data` | `spot_processed` | 3. Place `filter_files` in the project root directory. --- ### 2. Model Weights - Place `pretrained_vqgan` in the project root directory. - To reuse Stage 1 weights during training, set `symlink_path` in training scripts as: ```bash symlink_path="/path/to/stage1/weight/dalle.pt" ``` - To evaluate with pre-trained checkpoints, set `resume_path` in evaluation scripts as: ```bash resume_path="/path/to/pretrained/model/model.chkpt" ``` ### 3. Evaluation Tool Place the `densevid_eval` directory in the project root before evaluation. ## Citation If you find our work or this repository useful, please consider citing our paper: ```bibtex @inproceedings{ sun2026imagine, title={Imagine How To Change: Explicit Procedure Modeling for Change Captioning}, author={Sun, Jiayang and Guo, Zixin and Cao, Min and Zhu, Guibo and Laaksonen, Jorma}, booktitle={The Fourteenth International Conference on Learning Representations}, year={2026}, } ``` --- ## License This repository is released under the MIT License.