File size: 3,540 Bytes
e57de04 104459b e57de04 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | ---
license: mit
tags:
- change captioning
- vision-language
- image-to-text
- procedural reasoning
- multimodal
- pytorch
datasets:
- clevr-change
- image-editing-request
- spot-the-diff
metrics:
- bleu
- meteor
- rouge
pipeline_tag: image-to-text
---
# ProCap: Experiment Materials
This repository contains the **official experimental materials** for the paper:
> **Imagine How to Change: Explicit Procedure Modeling for Change Captioning**
It provides **processed datasets**, **pre-trained model weights**, and **evaluation tools** for reproducing the results reported in the paper.
📦 All materials are also available via [Baidu Netdisk](https://pan.baidu.com/s/1t_YXB6J_vkuPxByn2hat2A)
**Extraction Code:** `5h7w`
---
## Contents
- [Data](#data)
- [Model Weights](#model-weights)
- [Evaluation](#evaluation)
- [Usage](#usage)
- [License](#license)
---
## Data
All datasets are preprocessed into **pseudo-sequence format** (`.h5` files) generated by [VFIformer](https://github.com/JIA-Lab-research/VFIformer).
### Included Datasets
- **`CLEVR-data`**
Processed pseudo-sequences for the **CLEVR-Change** dataset
- **`edit-data`**
Processed pseudo-sequences for the **Image-Editing-Request** dataset
- **`spot-data`**
Processed pseudo-sequences for the **Spot-the-Diff** dataset
- **`filter_files`**
Confidence scores computed using [CLIP4IDC](https://github.com/sushizixin/CLIP4IDC)
- **`filtered-spot-captions`**
Refined captions for the Spot-the-Diff dataset
---
## Model Weights
This repository provides pre-trained weights for both stages in the paper.
### Explicit Procedure Modeling (Stage 1)
- `pretrained_vqgan` – VQGAN models for each dataset
- `stage1_clevr_best`
- `stage1_edit_best`
- `stage1_spot_best`
### Implicit Procedure Captioning (Stage 2)
- `clevr_best`
- `edit_best`
- `spot_best`
> **Note:** Stage 1 checkpoints can be directly reused to initialize Stage 2 training.
---
## Evaluation
- **`densevid_eval`**
Evaluation tools used for quantitative assessment
---
## Usage
### 1. Data Preparation
1. Move caption files in `filtered-spot-captions` to the original caption directory of the **Spot-the-Diff** dataset.
2. Copy the processed data folders to the original dataset root and rename them as follows:
| Dataset | Folder | Rename To |
|------|------|------|
| CLEVR-Change | `CLEVR-data` | `CLEVR_processed` |
| Image-Editing-Request | `edit-data` | `edit_processed` |
| Spot-the-Diff | `spot-data` | `spot_processed` |
3. Place `filter_files` in the project root directory.
---
### 2. Model Weights
- Place `pretrained_vqgan` in the project root directory.
- To reuse Stage 1 weights during training, set `symlink_path` in training scripts as:
```bash
symlink_path="/path/to/stage1/weight/dalle.pt"
```
- To evaluate with pre-trained checkpoints, set `resume_path` in evaluation scripts as:
```bash
resume_path="/path/to/pretrained/model/model.chkpt"
```
### 3. Evaluation Tool
Place the `densevid_eval` directory in the project root before evaluation.
## Citation
If you find our work or this repository useful, please consider citing our paper:
```bibtex
@inproceedings{
sun2026imagine,
title={Imagine How To Change: Explicit Procedure Modeling for Change Captioning},
author={Sun, Jiayang and Guo, Zixin and Cao, Min and Zhu, Guibo and Laaksonen, Jorma},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
}
```
---
## License
This repository is released under the MIT License. |