---
license: mit
tags:
- change captioning
- vision-language
- image-to-text
- procedural reasoning
- multimodal
- pytorch
datasets:
- clevr-change
- image-editing-request
- spot-the-diff
metrics:
- bleu
- meteor
- rouge
pipeline_tag: image-to-text
---

# ProCap: Experiment Materials

This repository contains the **official experimental materials** for the paper:

> **Imagine How to Change: Explicit Procedure Modeling for Change Captioning**

It provides **processed datasets**, **pre-trained model weights**, and **evaluation tools** for reproducing the results reported in the paper.

📦 All materials are also available via [Baidu Netdisk](https://pan.baidu.com/s/1t_YXB6J_vkuPxByn2hat2A) 
**Extraction Code:** `5h7w`

---

## Contents

- [Data](#data)
- [Model Weights](#model-weights)
- [Evaluation](#evaluation)
- [Usage](#usage)
- [License](#license)

---

## Data

All datasets are preprocessed into **pseudo-sequence format** (`.h5` files) generated by [VFIformer](https://github.com/JIA-Lab-research/VFIformer).

### Included Datasets

- **`CLEVR-data`**  
  Processed pseudo-sequences for the **CLEVR-Change** dataset

- **`edit-data`**  
  Processed pseudo-sequences for the **Image-Editing-Request** dataset

- **`spot-data`**  
  Processed pseudo-sequences for the **Spot-the-Diff** dataset

- **`filter_files`**  
  Confidence scores computed using [CLIP4IDC](https://github.com/sushizixin/CLIP4IDC)

- **`filtered-spot-captions`**  
  Refined captions for the Spot-the-Diff dataset

---

## Model Weights

This repository provides pre-trained weights for both stages in the paper.

### Explicit Procedure Modeling (Stage 1)

- `pretrained_vqgan` – VQGAN models for each dataset
- `stage1_clevr_best`
- `stage1_edit_best`
- `stage1_spot_best`

### Implicit Procedure Captioning (Stage 2)

- `clevr_best`
- `edit_best`
- `spot_best`

> **Note:** Stage 1 checkpoints can be directly reused to initialize Stage 2 training.

---

## Evaluation

- **`densevid_eval`**  
  Evaluation tools used for quantitative assessment

---

## Usage

### 1. Data Preparation

1. Move caption files in `filtered-spot-captions` to the original caption directory of the **Spot-the-Diff** dataset.
2. Copy the processed data folders to the original dataset root and rename them as follows:

| Dataset | Folder | Rename To |
|------|------|------|
| CLEVR-Change | `CLEVR-data` | `CLEVR_processed` |
| Image-Editing-Request | `edit-data` | `edit_processed` |
| Spot-the-Diff | `spot-data` | `spot_processed` |

3. Place `filter_files` in the project root directory.

---

### 2. Model Weights

- Place `pretrained_vqgan` in the project root directory.
- To reuse Stage 1 weights during training, set `symlink_path` in training scripts as:

```bash
symlink_path="/path/to/stage1/weight/dalle.pt"
```

- To evaluate with pre-trained checkpoints, set `resume_path` in evaluation scripts as:

```bash
resume_path="/path/to/pretrained/model/model.chkpt"
```

### 3. Evaluation Tool

Place the `densevid_eval` directory in the project root before evaluation. 

## Citation

If you find our work or this repository useful, please consider citing our paper: 
```bibtex
@inproceedings{
  sun2026imagine,
  title={Imagine How To Change: Explicit Procedure Modeling for Change Captioning},
  author={Sun, Jiayang and Guo, Zixin and Cao, Min and Zhu, Guibo and Laaksonen, Jorma},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
}
```

---

## License

This repository is released under the MIT License.