TC-CLIP / README.md

Update README.md

b6dba44 verified 16 days ago

7.8 kB

	---
	license: cc-by-nc-4.0
	base_model:
	- openai/clip-vit-base-patch16
	pipeline_tag: video-text-to-text
	tags:
	- clip
	- tc-clip
	- action-recognition
	- video-understanding
	---

	<h3 align="center"><a href="https://arxiv.org/abs/2404.09490">[ECCV 2024] Leveraging Temporal Contextualization for Video Action Recognition</a></h3>


	<div align="center">
	<img width="1000" alt="teaser" src="https://cdn-uploads.huggingface.co/production/uploads/66e345c9596fcff3e4b22e5a/O4kbJ5psnH78cTYMPQ3xT.png">
	</div>

	<h5 align="center"> Official model checkpoints for TC-CLIP </h5>
	<h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/naver-ai/tc-clip">Github</a> for the latest update. </h5>


	## Introduction

	We present Temporally Contextualized CLIP (TC-CLIP): A novel video understanding framework that leverages holistic video information within its encoding process.
	1. Temporal Contextualization (TC): Unlike prior approaches that access only a limited amount of tokens, TC allows global interactions by
	summarizing informative tokens from the entire video into _context tokens_ and leveraging them during the feature encoding process.
	2. Video-conditional Prompting (VP): Based on the summarized context tokens from the visual domain, VP generates instance-level textual prompts that compensate for the lack of textual semantics in action recognition datasets.
	3. Solid performance: TC-CLIP achieves stat-of-the-art performance across zero-shot, few-shot, base-to-novel, fully-supervised settings on five video action recognition benchmarks.

	This repository contains all model checkpoints used in our experiments.

	## Models
	We use CLIP ViT-B/16 for all experiments below.
	* (LLM) denotes that the models are using LLM-rephrased category names from [FROSTER](https://github.com/Visual-AI/FROSTER). Note that experiments on the SSv2 dataset do not involve LLM-rephrasing.
	* (P) denotes that the models are first pretrained on Kinetics-400 and subsequently fine-tuned on each dataset. Otherwise, models are directly fine-tuned from CLIP. See Appendix A in the paper.

	#### Zero-shot action recognition

	\| Scripts \| HMDB-51 \| UCF-101 \| Kinetics-600 \| Ckpt \|
	\|-------------------------------------------------------------------------\|:------------:\|:------------:\|:--------------:\|:--------------------------------------------------------------------------------:\|
	\| [TC-CLIP](https://github.com/naver-ai/tc-clip/blob/main/scripts/train/zero_shot/train_tc_clip_zero_shot.sh) \| 54.2 ± 0.7 \| 82.9 ± 0.6 \| 75.8 ± 0.5 \| [Link](https://huggingface.co/byminji/TC-CLIP/tree/main/zero_shot_k400) \|
	\| [TC-CLIP (LLM)](https://github.com/naver-ai/tc-clip/blob/main/scripts/train/zero_shot/train_tc_clip_zero_shot_llm.sh) \| 56.0 ± 0.3 \| 85.4 ± 0.8 \| 78.1 ± 1.0 \| [Link](https://huggingface.co/byminji/TC-CLIP/tree/main/zero_shot_k400_llm) \|


	#### Few-shot action recognition

	\| Scripts \| HMDB-51 \| UCF-101 \| SSv2 \| Ckpt \|
	\|----------------------------------------------------------------------------\|:-------------------------:\|:-------------------------:\|:------------------------:\|:-------------:\|
	\| \| K=2 / K=4 / K=8 / K=16 \| K=2 / K=4 / K=8 / K=16 \| K=2 / K=4 / K=8 / K=16 \| \|
	\| [TC-CLIP](https://github.com/naver-ai/tc-clip/blob/main/scripts/train/few_shot/train_tc_clip_few_shot.sh) \| 57.3 / 62.3 / 67.3 / 68.6 \| 85.9 / 89.9 / 92.5 / 94.6 \| 7.3 / 8.6 / 9.3 / 14.0 \| [Link](https://huggingface.co/byminji/TC-CLIP/tree/main/few_shot) \|
	\| [TC-CLIP (LLM)](https://github.com/naver-ai/tc-clip/blob/main/scripts/train/few_shot/train_tc_clip_few_shot_llm.sh) \| 58.6 / 63.3 / 65.5 / 68.8 \| 86.8 / 90.1 / 92.0 / 94.3 \| 7.3 / 8.6 / 9.3 / 14.0 \| [Link](https://huggingface.co/byminji/TC-CLIP/tree/main/few_shot_llm) \|
	\| [TC-CLIP (P)](https://github.com/naver-ai/tc-clip/blob/main/scripts/train/few_shot/train_tc_clip_few_shot_pretrained.sh) \| 65.3 / 68.5 / 71.4 / 73.0 \| 94.1 / 95.6 / 96.6 / 97.3 \| 8.7 / 10.1 / 12.1 / 15.2 \| [Link](https://huggingface.co/byminji/TC-CLIP/tree/main/few_shot_pretrained) \|

	#### Base-to-novel generalization

	\| Scripts \| K-400 \| HMDB-51 \| UCF-101 \| SSv2 \| Ckpt \|
	\|----------------------------------------------------------------------------\|:------------------:\|:------------------:\|:------------------:\|:------------------:\|:--------------------------------------------------------------------------------:\|
	\| \| Base / Novel / HM \| Base / Novel / HM \| Base / Novel / HM \| Base / Novel / HM \| \|
	\| [TC-CLIP](https://github.com/naver-ai/tc-clip/blob/main/scripts/train/base2novel/train_tc_clip_base2novel.sh) \| 78.9 / 63.6 / 70.4 \| 73.3 / 54.1 / 62.2 \| 95.5 / 78.0 / 85.9 \| 17.5 / 13.4 / 15.2 \| [Link](https://huggingface.co/byminji/TC-CLIP/tree/main/base2novel) \|
	\| [TC-CLIP (LLM)](https://github.com/naver-ai/tc-clip/blob/main/scripts/train/base2novel/train_tc_clip_base2novel_llm.sh) \| 79.1 / 65.4 / 71.6 \| 73.3 / 59.1 / 65.5 \| 95.4 / 81.6 / 88.0 \| 17.5 / 13.4 / 15.2 \| [Link](https://huggingface.co/byminji/TC-CLIP/tree/main/base2novel_llm) \|
	\| [TC-CLIP (P)](https://github.com/naver-ai/tc-clip/blob/main/scripts/train/base2novel/train_tc_clip_base2novel_pretrained.sh) \| N/A \| 79.4 / 58.3 / 67.2 \| 97.5 / 84.5 / 90.5 \| 19.6 / 15.6 / 17.4 \| [Link](https://huggingface.co/byminji/TC-CLIP/tree/main/base2novel_pretrained) \|


	#### Fully-supervised action recognition

	\| Scripts \| K-400 (Top-1) \| K-400 (Top-5) \| Ckpt \|
	\|-----------------------------------------------------------------------------\|:-------------:\|:-------------:\|:--------------------------------------------------------------------------------:\|
	\| [TC-CLIP](https://github.com/naver-ai/tc-clip/blob/main/scripts/train/fully_supervised/train_tc_clip_fully_supervised.sh) \| 85.2 \| 96.9 \| [Link](https://huggingface.co/byminji/TC-CLIP/tree/main/fully_supervised_k400) \|


	## Citation
	If you find TC-CLIP useful in your research, please consider citing our paper:
	```
	@article{kim2024tcclip,
	title={Leveraging Temporal Contextualization for Video Action Recognition},
	author={Kim, Minji and Han, Dongyoon and Kim, Taekyung and Han, Bohyung},
	journal={European Conference on Computer Vision (ECCV)},
	year={2024}
	}
	```