sumit7488
/

TimesNet

Video Classification

action-recognition

efficient-models

Model card Files Files and versions

🎬 TimeSformer + RetNet Hybrid for Efficient Video Action Recognition

This project presents a hybrid architecture that replaces the temporal attention mechanism in TimeSformer with RetNet, achieving:

⚡ Faster training
🧠 Lower memory usage
🎯 Comparable or improved accuracy

🚀 Model Variants

We trained and evaluated 4 configurations:

Model	Dataset
TimeSformer (Baseline)	UCF101
TimeSformer (Baseline)	HMDB51
TimeSformer + RetNet (Hybrid)	UCF101
TimeSformer + RetNet (Hybrid)	HMDB51

🧠 Proposed Architecture

🔹 Baseline

TimeSformer
Full spatio-temporal attention

🔹 Hybrid Model (Proposed)

Spatial Attention → TimeSformer
Temporal Modeling → RetNet

👉 RetNet replaces temporal self-attention to reduce complexity from:

Quadratic → Linear time

📊 Hybrid Model Training Results (UCF101)

Epoch	Train Loss	Train Acc	Val Loss	Val Acc	F1
1	4.5275	0.0458	4.1596	0.3542	0.3076
2	3.6647	0.4089	2.6496	0.7550	0.7214
3	2.4221	0.6995	1.5313	0.8623	0.8509
4	1.8874	0.7841	1.2290	0.8961	0.8918
5	1.7268	0.8104	1.1584	0.9075	0.9040
6	1.6615	0.8145	1.1088	0.9167	0.9142
7	1.6076	0.8191	1.0962	0.9202	0.9168
8	1.5100	0.8234	1.0865	0.9260	0.9233
9	1.4704	0.8232	1.0812	0.9260	0.9226

🏆 Best Performance (Hybrid Model)

Validation Accuracy: 92.60%
F1 Score: 0.9233
Achieved at Epoch 8

⚡ Efficiency Comparison

Metric	TimeSformer	Hybrid (RetNet)
Peak GPU Memory	~9.3–9.8 GB	~7.2 GB ✅
Training Speed	Slower	Faster ✅
Temporal Complexity	O(n²)	O(n) ✅

👉 ~25% memory reduction with comparable performance.

🔁 Training Strategy

Due to Kaggle’s 12-hour runtime limit, training was performed in stages:

Initial training
Save best checkpoint
Resume from .safetensors
Continue training

⚙️ Training Details

Mixed Precision Training (torch.cuda.amp)
Checkpoint-based training
Per-class evaluation reports
GPU: Kaggle environment

📦 Base Model

facebook/timesformer-base-finetuned-k400

🚀 Usage

pip install torch torchvision transformers

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F64

·

F32

·

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support