BlueberryOreo
/

ProCap

+---
+license: mit
+tags:
+- change captioning
+- vision-language
+- image-to-text
+- procedural reasoning
+- multimodal
+- pytorch
+datasets:
+- clevr-change
+- image-editing-request
+- spot-the-diff
+metrics:
+- bleu
+- meteor
+- rouge
+pipeline_tag: image-to-text
+---
+# ProCap: Experiment Materials
+This repository contains the **official experimental materials** for the paper:
+> **Imagine How to Change: Explicit Procedure Modeling for Change Captioning**
+It provides **processed datasets**, **pre-trained model weights**, and **evaluation tools** for reproducing the results reported in the paper.
+📦 All materials are also available via [Baidu Netdisk](https://pan.baidu.com/s/1t_YXB6J_vkuPxByn2hat2A)
+**Extraction Code:** `5h7w`
+---
+## Contents
+- [Data](#data)
+- [Model Weights](#model-weights)
+- [Evaluation](#evaluation)
+- [Usage](#usage)
+- [License](#license)
+---
+## Data
+All datasets are preprocessed into **pseudo-sequence format** (`.h5` files).
+### Included Datasets
+- **`CLEVR-data`**
+  Processed pseudo-sequences for the **CLEVR-Change** dataset
+- **`edit-data`**
+  Processed pseudo-sequences for the **Image-Editing-Request** dataset
+- **`spot-data`**
+  Processed pseudo-sequences for the **Spot-the-Diff** dataset
+- **`filter_files`**
+  Confidence scores computed using [CLIP4IDC](https://github.com/sushizixin/CLIP4IDC)
+- **`filtered-spot-captions`**
+  Refined captions for the Spot-the-Diff dataset
+---
+## Model Weights
+This repository provides pre-trained weights for both stages in the paper.
+### Explicit Procedure Modeling (Stage 1)
+- `pretrained_vqgan` – VQGAN models for each dataset
+- `stage1_clevr_best`
+- `stage1_edit_best`
+- `stage1_spot_best`
+### Implicit Procedure Captioning (Stage 2)
+- `clevr_best`
+- `edit_best`
+- `spot_best`
+> **Note:** Stage 1 checkpoints can be directly reused to initialize Stage 2 training.
+---
+## Evaluation
+- **`densevid_eval`**
+  Evaluation tools used for quantitative assessment
+---
+## Usage
+### 1. Data Preparation
+1. Move caption files in `filtered-spot-captions` to the original caption directory of the **Spot-the-Diff** dataset.
+2. Copy the processed data folders to the original dataset root and rename them as follows:
+| Dataset | Folder | Rename To |
+|------|------|------|
+| CLEVR-Change | `CLEVR-data` | `CLEVR_processed` |
+| Image-Editing-Request | `edit-data` | `edit_processed` |
+| Spot-the-Diff | `spot-data` | `spot_processed` |
+3. Place `filter_files` in the project root directory.
+---
+### 2. Model Weights
+- Place `pretrained_vqgan` in the project root directory.
+- To reuse Stage 1 weights during training, set `symlink_path` in training scripts as:
+```bash
+symlink_path="/path/to/stage1/weight/dalle.pt"
+```
+- To evaluate with pre-trained checkpoints, set `resume_path` in evaluation scripts as:
+```bash
+resume_path="/path/to/pretrained/model/model.chkpt"
+```
+### 3. Evaluation Tool
+Place the `densevid_eval` directory in the project root before evaluation.
+## Citation
+If you find our work or this repository useful, please consider citing our paper:
+```bibtex
+@inproceedings{
+  sun2026imagine,
+  title={Imagine How To Change: Explicit Procedure Modeling for Change Captioning},
+  author={Sun, Jiayang and Guo, Zixin and Cao, Min and Zhu, Guibo and Laaksonen, Jorma},
+  booktitle={The Fourteenth International Conference on Learning Representations},
+  year={2026},
+}
+```
+---
+## License
+This repository is released under the MIT License.