P4ddyki
/

MoTIF

@@ -1,10 +1,22 @@
 # MoTIF — Concepts in Motion
 [**Read the Paper (arXiv)**](https://arxiv.org/pdf/2509.20899)
 ## Abstract
-Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human‑interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher‑level components (e.g., "bow", "mount", "shoot") that reoccur across time—forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept‑based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance.
 ---

+---
+license: apache-2.0
+language:
+- en
+pipeline_tag: video-classification
+tags:
+- video-classification
+- concept-bottleneck-model
+- interpretability
+---
 # MoTIF — Concepts in Motion
 [**Read the Paper (arXiv)**](https://arxiv.org/pdf/2509.20899)
+[**GitHub Repo**](https://github.com/patrick-knab/MoTIF)
 ## Abstract
+Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce MoTIF (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is an agentic concept discovery module to automatically extract object- and action-centric textual concepts from videos, yielding temporally expressive concept sets without manual supervision. Across multiple video benchmarks, this combination substantially narrows the performance gap between interpretable and black-box video models while maintaining faithful and temporally grounded concept explanations.
 ---