ProCap / README.md
BlueberryOreo's picture
Update README.md
104459b verified
---
license: mit
tags:
- change captioning
- vision-language
- image-to-text
- procedural reasoning
- multimodal
- pytorch
datasets:
- clevr-change
- image-editing-request
- spot-the-diff
metrics:
- bleu
- meteor
- rouge
pipeline_tag: image-to-text
---
# ProCap: Experiment Materials
This repository contains the **official experimental materials** for the paper:
> **Imagine How to Change: Explicit Procedure Modeling for Change Captioning**
It provides **processed datasets**, **pre-trained model weights**, and **evaluation tools** for reproducing the results reported in the paper.
📦 All materials are also available via [Baidu Netdisk](https://pan.baidu.com/s/1t_YXB6J_vkuPxByn2hat2A)
**Extraction Code:** `5h7w`
---
## Contents
- [Data](#data)
- [Model Weights](#model-weights)
- [Evaluation](#evaluation)
- [Usage](#usage)
- [License](#license)
---
## Data
All datasets are preprocessed into **pseudo-sequence format** (`.h5` files) generated by [VFIformer](https://github.com/JIA-Lab-research/VFIformer).
### Included Datasets
- **`CLEVR-data`**
Processed pseudo-sequences for the **CLEVR-Change** dataset
- **`edit-data`**
Processed pseudo-sequences for the **Image-Editing-Request** dataset
- **`spot-data`**
Processed pseudo-sequences for the **Spot-the-Diff** dataset
- **`filter_files`**
Confidence scores computed using [CLIP4IDC](https://github.com/sushizixin/CLIP4IDC)
- **`filtered-spot-captions`**
Refined captions for the Spot-the-Diff dataset
---
## Model Weights
This repository provides pre-trained weights for both stages in the paper.
### Explicit Procedure Modeling (Stage 1)
- `pretrained_vqgan` – VQGAN models for each dataset
- `stage1_clevr_best`
- `stage1_edit_best`
- `stage1_spot_best`
### Implicit Procedure Captioning (Stage 2)
- `clevr_best`
- `edit_best`
- `spot_best`
> **Note:** Stage 1 checkpoints can be directly reused to initialize Stage 2 training.
---
## Evaluation
- **`densevid_eval`**
Evaluation tools used for quantitative assessment
---
## Usage
### 1. Data Preparation
1. Move caption files in `filtered-spot-captions` to the original caption directory of the **Spot-the-Diff** dataset.
2. Copy the processed data folders to the original dataset root and rename them as follows:
| Dataset | Folder | Rename To |
|------|------|------|
| CLEVR-Change | `CLEVR-data` | `CLEVR_processed` |
| Image-Editing-Request | `edit-data` | `edit_processed` |
| Spot-the-Diff | `spot-data` | `spot_processed` |
3. Place `filter_files` in the project root directory.
---
### 2. Model Weights
- Place `pretrained_vqgan` in the project root directory.
- To reuse Stage 1 weights during training, set `symlink_path` in training scripts as:
```bash
symlink_path="/path/to/stage1/weight/dalle.pt"
```
- To evaluate with pre-trained checkpoints, set `resume_path` in evaluation scripts as:
```bash
resume_path="/path/to/pretrained/model/model.chkpt"
```
### 3. Evaluation Tool
Place the `densevid_eval` directory in the project root before evaluation.
## Citation
If you find our work or this repository useful, please consider citing our paper:
```bibtex
@inproceedings{
sun2026imagine,
title={Imagine How To Change: Explicit Procedure Modeling for Change Captioning},
author={Sun, Jiayang and Guo, Zixin and Cao, Min and Zhu, Guibo and Laaksonen, Jorma},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
}
```
---
## License
This repository is released under the MIT License.