English
changedetection
scd
cd
File size: 4,245 Bytes
6337c08
 
 
 
 
 
 
 
 
 
 
81d1bc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6337c08
 
 
 
 
 
81d1bc4
 
c2025ec
81d1bc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c2025ec
 
 
 
 
 
81d1bc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6337c08
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: mit
datasets:
- hoskerelab/CSeg
language:
- en
tags:
- changedetection
- scd
- cd
---
# ViewDelta: Text-Conditioned Scene Change Detection

ViewDelta is a generalized framework for Scene Change Detection (SCD) that uses natural language prompts to define what changes are relevant. Unlike traditional change detection methods that implicitly learn what constitutes a "relevant" change from dataset labels, ViewDelta allows users to explicitly specify at runtime what types of changes they care about through text prompts.

## Overview

Given two images captured at different times and a text prompt describing the type of change to detect (e.g., "vehicle", "driveway", or "all changes"), ViewDelta produces a binary segmentation mask highlighting the relevant changes. The model is trained jointly on multiple datasets (CSeg, PSCD, SYSU-CD, VL-CMU-CD) and can:

- Detect user-specified changes via natural language prompts
- Handle unaligned image pairs with viewpoint variations
- Generalize across diverse domains (street-view, satellite, indoor/outdoor scenes)
- Detect all changes or specific semantic categories

For more details, see the paper: [ViewDelta: Scaling Scene Change Detection through Text-Conditioning](https://arxiv.org/abs/2412.07612)

## Installation

### Prerequisites

**Note:** ViewDelta has only been tested on Linux with the following specific versions:

- Python 3.10
- CUDA 12.1 (for GPU acceleration)
- NVIDIA GPU (tested on RTX 4090, L40S, and A100 - other GPUs may work)
- [Pixi package manager](https://pixi.sh/latest/)

### Clone Repository

```bash
git clone https://github.com/drags99/viewdelta-scd.git
```

### Install Pixi

First, install the Pixi package manager:

```bash
# On Linux
curl -fsSL https://pixi.sh/install.sh | bash
```

For more installation options, visit: https://pixi.sh/latest/installation/

### Install ViewDelta Dependencies

Once Pixi is installed, clone the repository and install dependencies:

```bash
pixi install
```

This will automatically set up the environment with all required dependencies including PyTorch, transformers, and other libraries.

## Running the Model

### Download ViewDelta Model Weights

```bash
wget https://huggingface.co/hoskerelab/ViewDelta/resolve/main/viewdelta_checkpoint.pth
```

### Basic Usage

The repository includes an [inference.py](inference.py) script for running the model on image pairs. Here's how to use it:

1. **Prepare your images**: Place two images you want to compare in the repository directory.

2. **Download a pre-trained checkpoint**: You'll need a model checkpoint file (e.g., `model.pth`).

3. **Edit the inference script**: Modify [inference.py](inference.py) to specify your images and text prompt:

```python
image_a_list = ["before_image.jpg"]
image_b_list = ["after_image.jpg"]
text_list = ["vehicle"]  # or "all" for all changes, or specific objects like "building", "tree", etc.

# Path to your checkpoint
PATH_TO_CHECKPOINT = "path/to/checkpoint.pth"
```

4. **Run inference**:

```bash
pixi run python inference.py
```

### Output

The script generates several outputs:
- `{image_name}_mask_{text}.png`: The binary segmentation mask
- `{image_name}_image_a_overlay.png`: First image with changes highlighted

### Text Prompt Examples

ViewDelta supports various types of text prompts:

- **Detect all changes**: `"What are the differences?"`, `"Find any differences"`
- **Specific objects**: `"vehicle"`, `"building"`, `"tree"`, `"person"`
- **Multiple objects**: `"vehicle, sign, barrier"`, `"cars and pedestrians"`
- **Natural language**: `"Has any construction equipment been added?"`, `"What buildings have changed?"`

### Model Configuration

The model uses:
- **Text embeddings**: SigLip (superior vision-language alignment)
- **Image embeddings**: DINOv2 (frozen pretrained features)
- **Architecture**: Vision Transformer (ViT) with 12 layers
- **Input resolution**: Images are automatically resized to 256×256

## Citation

```bibtex
@inproceedings{Varghese2024ViewDeltaSS,
  title={ViewDelta: Scaling Scene Change Detection through Text-Conditioning},
  author={Subin Varghese and Joshua Gao and Vedhus Hoskere},
  year={2024},
  url={https://api.semanticscholar.org/CorpusID:280642249}
}
```