Add metadata and improve discoverability
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,10 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# MoTIF — Concepts in Motion
|
| 2 |
|
| 3 |
-
[**Read the Paper (arXiv)**](https://arxiv.org/pdf/2509.20899)
|
| 4 |
|
| 5 |
## Abstract
|
| 6 |
|
| 7 |
-
Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human‑interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher‑level components (e.g., "bow", "mount", "shoot") that reoccur across time—forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time.
|
| 8 |
|
| 9 |
---
|
| 10 |
|
|
@@ -33,7 +43,7 @@ pip install -r requirements.txt
|
|
| 33 |
|
| 34 |
### 2) Data
|
| 35 |
|
| 36 |
-
Place your datasets under `Datasets/`
|
| 37 |
|
| 38 |
```bash
|
| 39 |
python save_videos.py
|
|
@@ -67,7 +77,7 @@ Adjust hyperparameters in the script or via CLI flags (if exposed).
|
|
| 67 |
|
| 68 |
## Pretrained Checkpoints
|
| 69 |
|
| 70 |
-
Pre-trained MoTIF checkpoints for all model variants are available on [Hugging Face](https://huggingface.co/P4ddyki/MoTIF/tree/main). The checkpoints include models trained on Breakfast, HMDB-51, and UCF-101 datasets with PE-L/14 backbone.
|
| 71 |
|
| 72 |
To use a pre-trained checkpoint, download it from the Hugging Face repository and place it in the `Models/` directory. The notebook `MoTIF.ipynb` will automatically load the appropriate checkpoint based on the dataset and backbone you specify.
|
| 73 |
|
|
@@ -88,31 +98,6 @@ To use a pre-trained checkpoint, download it from the Hugging Face repository an
|
|
| 88 |
- Something‑Something v2 — [20BN dataset page](https://www.qualcomm.com/developer/software/something-something-v-2-dataset)
|
| 89 |
- Breakfast Actions — [Dataset page](https://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset/)
|
| 90 |
|
| 91 |
-
Please follow each dataset’s license and terms of use.
|
| 92 |
-
|
| 93 |
-
Note: If you use other datasets, you will need to adapt the dataset logic in the code (e.g., train/val/test splits, preprocessing, and loaders). Relevant places include `utils/core/data/` (e.g., `data.py`, `preprocessor.py`, `dataloader.py`) and any dataset‑specific handling in `embedding.py` and `train_MoTIF.py`.
|
| 94 |
-
|
| 95 |
-
---
|
| 96 |
-
|
| 97 |
-
## Folder Structure
|
| 98 |
-
|
| 99 |
-
- `Datasets/` — dataset placeholders
|
| 100 |
-
- `Embeddings/` — generated embeddings (created by scripts)
|
| 101 |
-
- `Models/` — trained model checkpoints
|
| 102 |
-
- `Videos/` — example videos used in the paper/one‑pager
|
| 103 |
-
- `utils/` — library code (vision encoder, projector, dataloaders, transforms, etc.)
|
| 104 |
-
- `index.html` — minimal one‑pager describing MoTIF (open locally in a browser)
|
| 105 |
-
- `embedding.py`, `save_videos.py`, `train_MoTIF.py` — main scripts
|
| 106 |
-
- `MoTIF.ipynb` — notebook for inspection and visualization
|
| 107 |
-
|
| 108 |
-
---
|
| 109 |
-
|
| 110 |
-
## Quick Tips
|
| 111 |
-
|
| 112 |
-
- If you change the dataset or backbone, regenerate embeddings before training.
|
| 113 |
-
- The attention visualizations are concept‑wise and time‑wise; they should not mix information across concepts.
|
| 114 |
-
- GPU memory usage depends on the number of concepts and the temporal window length.
|
| 115 |
-
|
| 116 |
---
|
| 117 |
|
| 118 |
## Citation
|
|
@@ -136,10 +121,4 @@ If you use MoTIF in your research, please consider citing:
|
|
| 136 |
## Acknowledgements
|
| 137 |
|
| 138 |
- Parts of the `utils/core` codebase are adapted from the Perception Encoder framework.
|
| 139 |
-
- Thanks to the CORE research group at TU Clausthal and Ramblr.ai Research for support.
|
| 140 |
-
|
| 141 |
-
---
|
| 142 |
-
|
| 143 |
-
## Contact
|
| 144 |
-
|
| 145 |
-
For questions and discussion, please open an issue or contact the authors.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: video-classification
|
| 4 |
+
tags:
|
| 5 |
+
- concept-bottleneck
|
| 6 |
+
- interpretability
|
| 7 |
+
- video-understanding
|
| 8 |
+
arxiv: 2509.20899
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
# MoTIF — Concepts in Motion
|
| 12 |
|
| 13 |
+
[**Read the Paper (arXiv)**](https://arxiv.org/pdf/2509.20899) | [**GitHub Repository**](https://github.com/patrick-knab/MoTIF)
|
| 14 |
|
| 15 |
## Abstract
|
| 16 |
|
| 17 |
+
Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human‑interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher‑level components (e.g., "bow", "mount", "shoot") that reoccur across time—forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time.
|
| 18 |
|
| 19 |
---
|
| 20 |
|
|
|
|
| 43 |
|
| 44 |
### 2) Data
|
| 45 |
|
| 46 |
+
Place your datasets under `Datasets/`. If you want to generate small demo clips or frames, you can use:
|
| 47 |
|
| 48 |
```bash
|
| 49 |
python save_videos.py
|
|
|
|
| 77 |
|
| 78 |
## Pretrained Checkpoints
|
| 79 |
|
| 80 |
+
Pre-trained MoTIF checkpoints for all model variants are available on [Hugging Face](https://huggingface.co/P4ddyki/MoTIF/tree/main). The checkpoints include models trained on Breakfast, HMDB-51, and UCF-101 datasets with PE-L/14 backbone.
|
| 81 |
|
| 82 |
To use a pre-trained checkpoint, download it from the Hugging Face repository and place it in the `Models/` directory. The notebook `MoTIF.ipynb` will automatically load the appropriate checkpoint based on the dataset and backbone you specify.
|
| 83 |
|
|
|
|
| 98 |
- Something‑Something v2 — [20BN dataset page](https://www.qualcomm.com/developer/software/something-something-v-2-dataset)
|
| 99 |
- Breakfast Actions — [Dataset page](https://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset/)
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
---
|
| 102 |
|
| 103 |
## Citation
|
|
|
|
| 121 |
## Acknowledgements
|
| 122 |
|
| 123 |
- Parts of the `utils/core` codebase are adapted from the Perception Encoder framework.
|
| 124 |
+
- Thanks to the CORE research group at TU Clausthal and Ramblr.ai Research for support.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|