File size: 6,692 Bytes
006f4db
 
 
 
9f960d7
006f4db
9f960d7
 
 
 
006f4db
3cf4fff
9f960d7
3cf4fff
9f960d7
006f4db
3cf4fff
 
006f4db
3cf4fff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2f0776
 
 
3cf4fff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
license: apache-2.0
pipeline_tag: video-classification
tags:
- concept-bottleneck
- interpretability
- video-understanding
arxiv: 2509.20899
language:
- en
---

# MoTIF — Concepts in Motion

[**Read the Paper (arXiv)**](https://arxiv.org/pdf/2509.20899) | [**GitHub Repository**](https://github.com/patrick-knab/MoTIF)

## Abstract

Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce MoTIF (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is an agentic concept discovery module to automatically extract object- and action-centric textual concepts from videos, yielding temporally expressive concept sets without manual supervision. Across multiple video benchmarks, this combination substantially narrows the performance gap between interpretable and black-box video models while maintaining faithful and temporally grounded concept explanations.

---

## Key Features

- **Concept Bottlenecks for Video**: map frames/clips to a shared image–text space and obtain concept activations by cosine similarity.
- **Per‑Channel Temporal Self‑Attention**: concept channels stay independent; attention happens over time within each concept.
- **Three Explanation Views**: global concept relevance, local window concepts, and attention‑based temporal maps.
- **Plug‑and‑Play Backbones**: designed to work with CLIP and related vision–language models.
- **Multiple Datasets**: examples provided for UCF‑101, HMDB‑51, Something‑Something v2, and Breakfast Actions.

---

## Getting Started

### 1) Environment

- Python 3.10+ (tested with 3.13.5)
- CUDA‑enabled GPU recommended (checkpoints and scripts assume a GPU environment)

Create and activate an environment, then install requirements:

```bash
pip install -r requirements.txt
```

### 2) Data

Place your datasets under `Datasets/` (see the folder structure below). If you want to generate small demo clips or frames, you can use:

```bash
python save_videos.py
```

### 3) Create Embeddings

Compute (or recompute) the video/frame embeddings used by MoTIF:

```bash
python embedding.py
```


### 4) Train MoTIF

MoTIF’s training entry point is:

```bash
python train_MoTIF.py
```

Adjust hyperparameters in the script or via CLI flags (if exposed).

### 5) Explore and Visualize

- Open `MoTIF.ipynb` to visualize concept activations, attention over time, and example predictions.
- Place model checkpoints in `Models/` (see the notebook and code comments for expected paths).

---

## Pretrained Checkpoints

Pre-trained MoTIF checkpoints for all model variants are available on [Hugging Face](https://huggingface.co/P4ddyki/MoTIF/tree/main). The checkpoints include models trained on Breakfast, HMDB-51, and UCF-101 datasets with PE-L/14 backbone. We will upload soon additional checkpoints.

To use a pre-trained checkpoint, download it from the Hugging Face repository and place it in the `Models/` directory. The notebook `MoTIF.ipynb` will automatically load the appropriate checkpoint based on the dataset and backbone you specify.

---

## Backbones and Datasets

### Vision–Language Backbones
- CLIP ViT‑B/32 — [Hugging Face: openai/clip‑vit‑base‑patch32](https://huggingface.co/openai/clip-vit-base-patch32)
- CLIP ViT‑B/16 — [Hugging Face: openai/clip‑vit‑base‑patch16](https://huggingface.co/openai/clip-vit-base-patch16)
- CLIP ViT‑L/14 — [Hugging Face: openai/clip‑vit‑large‑patch14](https://huggingface.co/openai/clip-vit-large-patch14)
- (Optional) SigLIP L/14 — [Hugging Face: google/siglip‑so400m‑patch14‑384](https://huggingface.co/google/siglip-so400m-patch14-384)
- Perception Encoder (PE‑L/14) — [Official Repo on GitHub](https://github.com/facebookresearch/perception_models)

### Datasets
- UCF‑101 — [Project page](https://www.crcv.ucf.edu/data/UCF101.php)
- HMDB‑51 — [Project page](https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/)
- Something‑Something v2 — [20BN dataset page](https://www.qualcomm.com/developer/software/something-something-v-2-dataset)
- Breakfast Actions — [Dataset page](https://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset/)

Please follow each dataset’s license and terms of use.

Note: If you use other datasets, you will need to adapt the dataset logic in the code (e.g., train/val/test splits, preprocessing, and loaders). Relevant places include `utils/core/data/` (e.g., `data.py`, `preprocessor.py`, `dataloader.py`) and any dataset‑specific handling in `embedding.py` and `train_MoTIF.py`.

---

## Folder Structure

- `Datasets/` — dataset placeholders
- `Embeddings/` — generated embeddings (created by scripts)
- `Models/` — trained model checkpoints
- `Videos/` — example videos used in the paper/one‑pager
- `utils/` — library code (vision encoder, projector, dataloaders, transforms, etc.)
- `index.html` — minimal one‑pager describing MoTIF (open locally in a browser)
- `embedding.py`, `save_videos.py`, `train_MoTIF.py` — main scripts
- `MoTIF.ipynb` — notebook for inspection and visualization

---

## Quick Tips

- If you change the dataset or backbone, regenerate embeddings before training.
- The attention visualizations are concept‑wise and time‑wise; they should not mix information across concepts.
- GPU memory usage depends on the number of concepts and the temporal window length.

---

## Citation

If you use MoTIF in your research, please consider citing:

```bibtex
@misc{knab2025conceptsmotiontemporalbottlenecks,
      title={Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification}, 
      author={Patrick Knab and Sascha Marton and Philipp J. Schubert and Drago Guggiana and Christian Bartelt},
      year={2025},
      eprint={2509.20899},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.20899}, 
}
```

---

## Acknowledgements

- Parts of the `utils/core` codebase are adapted from the Perception Encoder framework.
- Thanks to the CORE research group at TU Clausthal and Ramblr.ai Research for support.

---

## Contact

For questions and discussion, please open an issue or contact the authors.