| | --- |
| | license: mit |
| | tags: |
| | - change captioning |
| | - vision-language |
| | - image-to-text |
| | - procedural reasoning |
| | - multimodal |
| | - pytorch |
| | datasets: |
| | - clevr-change |
| | - image-editing-request |
| | - spot-the-diff |
| | metrics: |
| | - bleu |
| | - meteor |
| | - rouge |
| | pipeline_tag: image-to-text |
| | --- |
| | |
| | # ProCap: Experiment Materials |
| |
|
| | This repository contains the **official experimental materials** for the paper: |
| |
|
| | > **Imagine How to Change: Explicit Procedure Modeling for Change Captioning** |
| |
|
| | It provides **processed datasets**, **pre-trained model weights**, and **evaluation tools** for reproducing the results reported in the paper. |
| |
|
| | 📦 All materials are also available via [Baidu Netdisk](https://pan.baidu.com/s/1t_YXB6J_vkuPxByn2hat2A) |
| | **Extraction Code:** `5h7w` |
| |
|
| | --- |
| |
|
| | ## Contents |
| |
|
| | - [Data](#data) |
| | - [Model Weights](#model-weights) |
| | - [Evaluation](#evaluation) |
| | - [Usage](#usage) |
| | - [License](#license) |
| |
|
| | --- |
| |
|
| | ## Data |
| |
|
| | All datasets are preprocessed into **pseudo-sequence format** (`.h5` files) generated by [VFIformer](https://github.com/JIA-Lab-research/VFIformer). |
| |
|
| | ### Included Datasets |
| |
|
| | - **`CLEVR-data`** |
| | Processed pseudo-sequences for the **CLEVR-Change** dataset |
| |
|
| | - **`edit-data`** |
| | Processed pseudo-sequences for the **Image-Editing-Request** dataset |
| |
|
| | - **`spot-data`** |
| | Processed pseudo-sequences for the **Spot-the-Diff** dataset |
| |
|
| | - **`filter_files`** |
| | Confidence scores computed using [CLIP4IDC](https://github.com/sushizixin/CLIP4IDC) |
| | |
| | - **`filtered-spot-captions`** |
| | Refined captions for the Spot-the-Diff dataset |
| | |
| | --- |
| | |
| | ## Model Weights |
| | |
| | This repository provides pre-trained weights for both stages in the paper. |
| | |
| | ### Explicit Procedure Modeling (Stage 1) |
| | |
| | - `pretrained_vqgan` – VQGAN models for each dataset |
| | - `stage1_clevr_best` |
| | - `stage1_edit_best` |
| | - `stage1_spot_best` |
| | |
| | ### Implicit Procedure Captioning (Stage 2) |
| | |
| | - `clevr_best` |
| | - `edit_best` |
| | - `spot_best` |
| | |
| | > **Note:** Stage 1 checkpoints can be directly reused to initialize Stage 2 training. |
| | |
| | --- |
| | |
| | ## Evaluation |
| | |
| | - **`densevid_eval`** |
| | Evaluation tools used for quantitative assessment |
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | ### 1. Data Preparation |
| |
|
| | 1. Move caption files in `filtered-spot-captions` to the original caption directory of the **Spot-the-Diff** dataset. |
| | 2. Copy the processed data folders to the original dataset root and rename them as follows: |
| |
|
| | | Dataset | Folder | Rename To | |
| | |------|------|------| |
| | | CLEVR-Change | `CLEVR-data` | `CLEVR_processed` | |
| | | Image-Editing-Request | `edit-data` | `edit_processed` | |
| | | Spot-the-Diff | `spot-data` | `spot_processed` | |
| |
|
| | 3. Place `filter_files` in the project root directory. |
| |
|
| | --- |
| |
|
| | ### 2. Model Weights |
| |
|
| | - Place `pretrained_vqgan` in the project root directory. |
| | - To reuse Stage 1 weights during training, set `symlink_path` in training scripts as: |
| |
|
| | ```bash |
| | symlink_path="/path/to/stage1/weight/dalle.pt" |
| | ``` |
| |
|
| | - To evaluate with pre-trained checkpoints, set `resume_path` in evaluation scripts as: |
| |
|
| | ```bash |
| | resume_path="/path/to/pretrained/model/model.chkpt" |
| | ``` |
| |
|
| | ### 3. Evaluation Tool |
| |
|
| | Place the `densevid_eval` directory in the project root before evaluation. |
| |
|
| | ## Citation |
| |
|
| | If you find our work or this repository useful, please consider citing our paper: |
| | ```bibtex |
| | @inproceedings{ |
| | sun2026imagine, |
| | title={Imagine How To Change: Explicit Procedure Modeling for Change Captioning}, |
| | author={Sun, Jiayang and Guo, Zixin and Cao, Min and Zhu, Guibo and Laaksonen, Jorma}, |
| | booktitle={The Fourteenth International Conference on Learning Representations}, |
| | year={2026}, |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | This repository is released under the MIT License. |