sumit7488
/

TimesNet

Video Classification

action-recognition

efficient-models

Model card Files Files and versions

TimesNet / README.md

sumit7488's picture

Create README.md

1293b6c verified 14 days ago

|

history blame contribute delete

2.69 kB

	---
	license: mit
	tags:
	- video-classification
	- timesformer
	- retnet
	- action-recognition
	- ucf101
	- hmdb51
	- transformers
	- efficient-models
	datasets:
	- ucf101
	- hmdb51
	---

	# 🎬 TimeSformer + RetNet Hybrid for Efficient Video Action Recognition

	This project presents a hybrid architecture that replaces the temporal attention mechanism in TimeSformer with RetNet, achieving:

	- ⚡ Faster training
	- 🧠 Lower memory usage
	- 🎯 Comparable or improved accuracy

	---

	## 🚀 Model Variants

	We trained and evaluated 4 configurations:

	\| Model \| Dataset \|
	\|------\|--------\|
	\| TimeSformer (Baseline) \| UCF101 \|
	\| TimeSformer (Baseline) \| HMDB51 \|
	\| TimeSformer + RetNet (Hybrid) \| UCF101 \|
	\| TimeSformer + RetNet (Hybrid) \| HMDB51 \|

	---

	## 🧠 Proposed Architecture

	### 🔹 Baseline
	- TimeSformer
	- Full spatio-temporal attention

	### 🔹 Hybrid Model (Proposed)
	- Spatial Attention → TimeSformer
	- Temporal Modeling → RetNet

	👉 RetNet replaces temporal self-attention to reduce complexity from:
	- Quadratic → Linear time

	---

	## 📊 Hybrid Model Training Results (UCF101)

	\| Epoch \| Train Loss \| Train Acc \| Val Loss \| Val Acc \| F1 \|
	\|------\|------------\|-----------\|----------\|---------\|-----\|
	\| 1 \| 4.5275 \| 0.0458 \| 4.1596 \| 0.3542 \| 0.3076 \|
	\| 2 \| 3.6647 \| 0.4089 \| 2.6496 \| 0.7550 \| 0.7214 \|
	\| 3 \| 2.4221 \| 0.6995 \| 1.5313 \| 0.8623 \| 0.8509 \|
	\| 4 \| 1.8874 \| 0.7841 \| 1.2290 \| 0.8961 \| 0.8918 \|
	\| 5 \| 1.7268 \| 0.8104 \| 1.1584 \| 0.9075 \| 0.9040 \|
	\| 6 \| 1.6615 \| 0.8145 \| 1.1088 \| 0.9167 \| 0.9142 \|
	\| 7 \| 1.6076 \| 0.8191 \| 1.0962 \| 0.9202 \| 0.9168 \|
	\| 8 \| 1.5100 \| 0.8234 \| 1.0865 \| 0.9260 \| 0.9233 \|
	\| 9 \| 1.4704 \| 0.8232 \| 1.0812 \| 0.9260 \| 0.9226 \|

	---

	## 🏆 Best Performance (Hybrid Model)

	- Validation Accuracy: 92.60%
	- F1 Score: 0.9233
	- Achieved at Epoch 8

	---

	## ⚡ Efficiency Comparison

	\| Metric \| TimeSformer \| Hybrid (RetNet) \|
	\|-------\|------------\|----------------\|
	\| Peak GPU Memory \| ~9.3–9.8 GB \| ~7.2 GB ✅ \|
	\| Training Speed \| Slower \| Faster ✅ \|
	\| Temporal Complexity \| O(n²) \| O(n) ✅ \|

	👉 ~25% memory reduction with comparable performance.

	---

	## 🔁 Training Strategy

	Due to Kaggle’s 12-hour runtime limit, training was performed in stages:

	- Initial training
	- Save best checkpoint
	- Resume from `.safetensors`
	- Continue training

	---

	## ⚙️ Training Details

	- Mixed Precision Training (`torch.cuda.amp`)
	- Checkpoint-based training
	- Per-class evaluation reports
	- GPU: Kaggle environment

	---

	## 📦 Base Model

	- `facebook/timesformer-base-finetuned-k400`

	---

	## 🚀 Usage

	```bash
	pip install torch torchvision transformers