🎬 RetFormer: Efficient TimeSformer + RetNet for Video Action Recognition

RetFormer is a hybrid video classification model that replaces the temporal attention in TimeSformer with RetNet, achieving:

  • ⚑ Lower memory usage
  • πŸš€ Faster training
  • 🎯 Competitive accuracy

🧠 Model Architecture

πŸ”Ή RetFormer (Proposed)

  • Spatial Modeling β†’ TimeSformer
  • Temporal Modeling β†’ RetNet

πŸ‘‰ This replaces quadratic attention with linear-time temporal modeling (O(n))


πŸ“Š Dataset

  • HMDB51
    • 51 human action classes
    • Complex motion patterns
    • Smaller and more challenging than UCF101

πŸ” Training Strategy

Training was performed in multiple stages due to runtime limits:

  • Initial training (Epoch 1–10)
  • Checkpoint saving
  • Resumed training (Epoch 11–14)
  • Early stopping applied

πŸ“ˆ Training Results (Epoch 1–14)

Epoch Train Loss Train Acc Val Loss Val Acc F1
1 3.9312 0.0350 3.8099 0.0967 0.0855
2 3.6330 0.1791 3.2948 0.3654 0.3149
3 3.0989 0.3691 2.6927 0.5150 0.4579
4 2.6278 0.5048 2.2879 0.5869 0.5503
5 2.3198 0.5782 2.0438 0.6255 0.5961
6 2.1387 0.6194 1.9152 0.6242 0.6074
7 1.9876 0.6657 1.8369 0.6418 0.6308
8 1.9140 0.6936 1.7966 0.6359 0.6188
9 1.8539 0.7041 1.7619 0.6556 0.6426
10 1.8149 0.7244 1.7523 0.6614 0.6512
11 1.7325 0.7524 1.7315 0.6699 0.6614
12 1.7036 0.7584 1.7469 0.6621 0.6515
13 1.6682 0.7717 1.7504 0.6595 0.6496
14 1.6344 0.7785 1.7488 0.6588 0.6494

πŸ† Best Performance

  • Validation Accuracy: 66.99%
  • F1 Score: 0.6614
  • Achieved at Epoch 11

βš™οΈ Training Details

  • Peak GPU Memory: ~7.2 GB
  • Training Time per Epoch: ~52 minutes
  • Evaluation Time: ~8 minutes
  • Mixed Precision Training (torch.cuda.amp)
  • Early stopping triggered after Epoch 14

πŸ“Œ Observations

  • Stable improvement until Epoch 11
  • Slight decline afterward β†’ early overfitting
  • Lower accuracy than baseline (expected for hybrid trade-off)

⚑ Efficiency Advantage

Metric TimeSformer RetFormer
Peak GPU Memory ~9.3 GB ~7.2 GB βœ…
Complexity O(nΒ²) O(n) βœ…
Speed Slower Faster

πŸ‘‰ ~25% reduction in GPU memory


πŸ” Key Insight

RetFormer demonstrates that:

  • Efficient temporal modeling can significantly reduce memory usage
  • Performance remains competitive with baseline models
  • Trade-off exists between efficiency and maximum accuracy

πŸš€ Usage

pip install torch torchvision transformers
Downloads last month
24
Safetensors
Model size
0.1B params
Tensor type
F64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support