🎬 TimeSformer + RetNet Hybrid for Efficient Video Action Recognition

This project presents a hybrid architecture that replaces the temporal attention mechanism in TimeSformer with RetNet, achieving:

  • ⚑ Faster training
  • 🧠 Lower memory usage
  • 🎯 Comparable or improved accuracy

πŸš€ Model Variants

We trained and evaluated 4 configurations:

Model Dataset
TimeSformer (Baseline) UCF101
TimeSformer (Baseline) HMDB51
TimeSformer + RetNet (Hybrid) UCF101
TimeSformer + RetNet (Hybrid) HMDB51

🧠 Proposed Architecture

πŸ”Ή Baseline

  • TimeSformer
  • Full spatio-temporal attention

πŸ”Ή Hybrid Model (Proposed)

  • Spatial Attention β†’ TimeSformer
  • Temporal Modeling β†’ RetNet

πŸ‘‰ RetNet replaces temporal self-attention to reduce complexity from:

  • Quadratic β†’ Linear time

πŸ“Š Hybrid Model Training Results (UCF101)

Epoch Train Loss Train Acc Val Loss Val Acc F1
1 4.5275 0.0458 4.1596 0.3542 0.3076
2 3.6647 0.4089 2.6496 0.7550 0.7214
3 2.4221 0.6995 1.5313 0.8623 0.8509
4 1.8874 0.7841 1.2290 0.8961 0.8918
5 1.7268 0.8104 1.1584 0.9075 0.9040
6 1.6615 0.8145 1.1088 0.9167 0.9142
7 1.6076 0.8191 1.0962 0.9202 0.9168
8 1.5100 0.8234 1.0865 0.9260 0.9233
9 1.4704 0.8232 1.0812 0.9260 0.9226

πŸ† Best Performance (Hybrid Model)

  • Validation Accuracy: 92.60%
  • F1 Score: 0.9233
  • Achieved at Epoch 8

⚑ Efficiency Comparison

Metric TimeSformer Hybrid (RetNet)
Peak GPU Memory ~9.3–9.8 GB ~7.2 GB βœ…
Training Speed Slower Faster βœ…
Temporal Complexity O(nΒ²) O(n) βœ…

πŸ‘‰ ~25% memory reduction with comparable performance.


πŸ” Training Strategy

Due to Kaggle’s 12-hour runtime limit, training was performed in stages:

  • Initial training
  • Save best checkpoint
  • Resume from .safetensors
  • Continue training

βš™οΈ Training Details

  • Mixed Precision Training (torch.cuda.amp)
  • Checkpoint-based training
  • Per-class evaluation reports
  • GPU: Kaggle environment

πŸ“¦ Base Model

  • facebook/timesformer-base-finetuned-k400

πŸš€ Usage

pip install torch torchvision transformers
Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
F64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support