Video Classification
English

Improve model card: Update license, image paths, and HF link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +22 -14
README.md CHANGED
@@ -1,31 +1,32 @@
1
  ---
2
- license: cc-by-4.0
3
  language:
4
  - en
 
5
  metrics:
6
  - roc_auc
7
  - matthews_correlation
8
  pipeline_tag: video-classification
9
  ---
 
10
  # Simplifying Traffic Anomaly Detection with Video Foundation Models
11
 
12
  Svetlana Orlova, Tommie Kerssies, Brunó B. Englert, Gijs Dubbelman \
13
  Eindhoven University of Technology
14
 
15
  [![arXiv](https://img.shields.io/badge/cs.CV-2507.09338-b31b1b?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2507.09338)
16
- [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Models-SimpleTAD-blue)](https://huggingface.co/tue-mps/simple-tad/tree/main/models)
17
  [![Code](https://img.shields.io/badge/Code-simple--tad-black?logo=github)](https://github.com/tue-mps/simple-tad)
18
 
19
  <table>
20
  <tr>
21
  <td>
22
  <a href="https://youtu.be/hY2hUlTNhCU" target="_blank">
23
- <img src="illustrations/videos/PYL3JcSsS6o_004036_small-dapt.gif" width="100%">
24
  </a>
25
  </td>
26
  <td>
27
  <a href="https://youtu.be/tKe2nTIHf9k" target="_blank">
28
- <img src="illustrations/videos/Sihe6aeyLHg_000602_small-dapt.gif" width="100%">
29
  </a>
30
  </td>
31
  </tr>
@@ -33,11 +34,11 @@ Eindhoven University of Technology
33
 
34
  Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) advanced pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity.
35
 
36
- ![Simple_Main](illustrations/main.png)
37
 
38
  ### ✨ DoTA and DADA-2000 results
39
 
40
- ![Simple_Results](illustrations/results.png)
41
 
42
  Video ViT-based encoder-only models set a new state of the art
43
  on both datasets, while being significantly more efficient than top-performing specialized methods. FPS measured using NVIDIA A100
@@ -111,6 +112,14 @@ We used fragments of original implementations of
111
  [InternVideo2](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/single_modality),
112
  and [UMT](https://github.com/OpenGVLab/unmasked_teacher/tree/main/single_modality) to integrate these models with our codebase.
113
 
 
 
 
 
 
 
 
 
114
  ## ✏️ Citation
115
 
116
  If you think this project is helpful, please feel free to like us ❤️ and cite our paper:
@@ -137,12 +146,12 @@ If you think this project is helpful, please feel free to like us ❤️ and cit
137
  <tr>
138
  <td>
139
  <a href="https://youtu.be/vPBKj9SF9yg" target="_blank">
140
- <img src="illustrations/videos/0RJPQ_97dcs_004503_large-dapt.gif" width="100%">
141
  </a>
142
  </td>
143
  <td>
144
  <a href="https://youtu.be/vGaYPZEuv5k" target="_blank">
145
- <img src="illustrations/videos/y4Evv5By6sg_004171_large-dapt.gif" width="100%">
146
  </a>
147
  </td>
148
  </tr>
@@ -152,12 +161,12 @@ If you think this project is helpful, please feel free to like us ❤️ and cit
152
  <tr>
153
  <td>
154
  <a href="https://youtu.be/7rH0QP18zsk" target="_blank">
155
- <img src="illustrations/videos/PEwiwzyTjX0_000589large-dapt.gif" width="100%">
156
  </a>
157
  </td>
158
  <td>
159
  <a href="https://youtu.be/5ZNYwDGmOZI" target="_blank">
160
- <img src="illustrations/videos/RASKiMoxhOE_000246_large-dapt.gif" width="100%">
161
  </a>
162
  </td>
163
  </tr>
@@ -167,14 +176,13 @@ If you think this project is helpful, please feel free to like us ❤️ and cit
167
  <tr>
168
  <td>
169
  <a href="https://youtu.be/X7Ij1sc4yCE" target="_blank">
170
- <img src="illustrations/videos/T7TkJVmGyts_001011_large-dapt.gif" width="100%">
171
  </a>
172
  </td>
173
  <td>
174
  <a href="https://youtu.be/S5m2ooY6CGc" target="_blank">
175
- <img src="illustrations/videos/TNZv-NBcV5U_000066large-dapt.gif" width="100%">
176
  </a>
177
  </td>
178
  </tr>
179
- </table>
180
-
 
1
  ---
 
2
  language:
3
  - en
4
+ license: cc-by-nc-4.0
5
  metrics:
6
  - roc_auc
7
  - matthews_correlation
8
  pipeline_tag: video-classification
9
  ---
10
+
11
  # Simplifying Traffic Anomaly Detection with Video Foundation Models
12
 
13
  Svetlana Orlova, Tommie Kerssies, Brunó B. Englert, Gijs Dubbelman \
14
  Eindhoven University of Technology
15
 
16
  [![arXiv](https://img.shields.io/badge/cs.CV-2507.09338-b31b1b?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2507.09338)
17
+ [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Models-SimpleTAD-blue)](https://huggingface.co/tue-mps/simple-tad)
18
  [![Code](https://img.shields.io/badge/Code-simple--tad-black?logo=github)](https://github.com/tue-mps/simple-tad)
19
 
20
  <table>
21
  <tr>
22
  <td>
23
  <a href="https://youtu.be/hY2hUlTNhCU" target="_blank">
24
+ <img src="figs/videos/PYL3JcSsS6o_004036_small-dapt.gif" width="100%">
25
  </a>
26
  </td>
27
  <td>
28
  <a href="https://youtu.be/tKe2nTIHf9k" target="_blank">
29
+ <img src="figs/videos/Sihe6aeyLHg_000602_small-dapt.gif" width="100%">
30
  </a>
31
  </td>
32
  </tr>
 
34
 
35
  Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) advanced pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity.
36
 
37
+ ![Simple_Main](figs/main.png)
38
 
39
  ### ✨ DoTA and DADA-2000 results
40
 
41
+ ![Simple_Results](figs/results.png)
42
 
43
  Video ViT-based encoder-only models set a new state of the art
44
  on both datasets, while being significantly more efficient than top-performing specialized methods. FPS measured using NVIDIA A100
 
112
  [InternVideo2](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/single_modality),
113
  and [UMT](https://github.com/OpenGVLab/unmasked_teacher/tree/main/single_modality) to integrate these models with our codebase.
114
 
115
+ ## 🔒 License
116
+
117
+ The majority of this project is released under the CC-BY-NC 4.0 license as found in the [LICENSE](https://github.com/MCG-NJU/VideoMAE/blob/main/LICENSE) file.
118
+ Portions of the project are available under separate license terms:
119
+ [ViViT](https://github.com/google-research/scenic/blob/main/scenic/projects/vivit/README.md), [InternVideo2](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/single_modality), [SlowFast](https://github.com/facebookresearch/SlowFast) and [pytorch-image-models](https://github.com/rwightman/pytorch-image-models) are licensed under the Apache 2.0 license.
120
+ [VideoMAE2](https://github.com/OpenGVLab/VideoMAEv2), [SMILE](https://github.com/fmthoker/SMILE), [MGMAE](https://github.com/MCG-NJU/MGMAE), [UMT](https://github.com/OpenGVLab/unmasked_teacher/tree/main/single_modality), and [BEiT](https://github.com/microsoft/unilm/tree/master/beit) are licensed under the MIT license.
121
+ [SIGMA](https://github.com/QUVA-Lab/SIGMA/) is licensed under the BSD 3-Clause Clear license
122
+
123
  ## ✏️ Citation
124
 
125
  If you think this project is helpful, please feel free to like us ❤️ and cite our paper:
 
146
  <tr>
147
  <td>
148
  <a href="https://youtu.be/vPBKj9SF9yg" target="_blank">
149
+ <img src="figs/videos/0RJPQ_97dcs_004503_large-dapt.gif" width="100%">
150
  </a>
151
  </td>
152
  <td>
153
  <a href="https://youtu.be/vGaYPZEuv5k" target="_blank">
154
+ <img src="figs/videos/y4Evv5By6sg_004171_large-dapt.gif" width="100%">
155
  </a>
156
  </td>
157
  </tr>
 
161
  <tr>
162
  <td>
163
  <a href="https://youtu.be/7rH0QP18zsk" target="_blank">
164
+ <img src="figs/videos/PEwiwzyTjX0_000589large-dapt.gif" width="100%">
165
  </a>
166
  </td>
167
  <td>
168
  <a href="https://youtu.be/5ZNYwDGmOZI" target="_blank">
169
+ <img src="figs/videos/RASKiMoxhOE_000246_large-dapt.gif" width="100%">
170
  </a>
171
  </td>
172
  </tr>
 
176
  <tr>
177
  <td>
178
  <a href="https://youtu.be/X7Ij1sc4yCE" target="_blank">
179
+ <img src="figs/videos/T7TkJVmGyts_001011_large-dapt.gif" width="100%">
180
  </a>
181
  </td>
182
  <td>
183
  <a href="https://youtu.be/S5m2ooY6CGc" target="_blank">
184
+ <img src="figs/videos/TNZv-NBcV5U_000066large-dapt.gif" width="100%">
185
  </a>
186
  </td>
187
  </tr>
188
+ </table>