File size: 3,540 Bytes
e57de04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104459b
e57de04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
license: mit
tags:
- change captioning
- vision-language
- image-to-text
- procedural reasoning
- multimodal
- pytorch
datasets:
- clevr-change
- image-editing-request
- spot-the-diff
metrics:
- bleu
- meteor
- rouge
pipeline_tag: image-to-text
---

# ProCap: Experiment Materials

This repository contains the **official experimental materials** for the paper:

> **Imagine How to Change: Explicit Procedure Modeling for Change Captioning**

It provides **processed datasets**, **pre-trained model weights**, and **evaluation tools** for reproducing the results reported in the paper.

📦 All materials are also available via [Baidu Netdisk](https://pan.baidu.com/s/1t_YXB6J_vkuPxByn2hat2A) 
**Extraction Code:** `5h7w`

---

## Contents

- [Data](#data)
- [Model Weights](#model-weights)
- [Evaluation](#evaluation)
- [Usage](#usage)
- [License](#license)

---

## Data

All datasets are preprocessed into **pseudo-sequence format** (`.h5` files) generated by [VFIformer](https://github.com/JIA-Lab-research/VFIformer).

### Included Datasets

- **`CLEVR-data`**  
  Processed pseudo-sequences for the **CLEVR-Change** dataset

- **`edit-data`**  
  Processed pseudo-sequences for the **Image-Editing-Request** dataset

- **`spot-data`**  
  Processed pseudo-sequences for the **Spot-the-Diff** dataset

- **`filter_files`**  
  Confidence scores computed using [CLIP4IDC](https://github.com/sushizixin/CLIP4IDC)

- **`filtered-spot-captions`**  
  Refined captions for the Spot-the-Diff dataset

---

## Model Weights

This repository provides pre-trained weights for both stages in the paper.

### Explicit Procedure Modeling (Stage 1)

- `pretrained_vqgan` – VQGAN models for each dataset
- `stage1_clevr_best`
- `stage1_edit_best`
- `stage1_spot_best`

### Implicit Procedure Captioning (Stage 2)

- `clevr_best`
- `edit_best`
- `spot_best`

> **Note:** Stage 1 checkpoints can be directly reused to initialize Stage 2 training.

---

## Evaluation

- **`densevid_eval`**  
  Evaluation tools used for quantitative assessment

---

## Usage

### 1. Data Preparation

1. Move caption files in `filtered-spot-captions` to the original caption directory of the **Spot-the-Diff** dataset.
2. Copy the processed data folders to the original dataset root and rename them as follows:

| Dataset | Folder | Rename To |
|------|------|------|
| CLEVR-Change | `CLEVR-data` | `CLEVR_processed` |
| Image-Editing-Request | `edit-data` | `edit_processed` |
| Spot-the-Diff | `spot-data` | `spot_processed` |

3. Place `filter_files` in the project root directory.

---

### 2. Model Weights

- Place `pretrained_vqgan` in the project root directory.
- To reuse Stage 1 weights during training, set `symlink_path` in training scripts as:

```bash
symlink_path="/path/to/stage1/weight/dalle.pt"
```

- To evaluate with pre-trained checkpoints, set `resume_path` in evaluation scripts as:

```bash
resume_path="/path/to/pretrained/model/model.chkpt"
```

### 3. Evaluation Tool

Place the `densevid_eval` directory in the project root before evaluation. 

## Citation

If you find our work or this repository useful, please consider citing our paper: 
```bibtex
@inproceedings{
  sun2026imagine,
  title={Imagine How To Change: Explicit Procedure Modeling for Change Captioning},
  author={Sun, Jiayang and Guo, Zixin and Cao, Min and Zhu, Guibo and Laaksonen, Jorma},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
}
```

---

## License

This repository is released under the MIT License.