sumit7488
/

RetFormerTrainedOnHDMB51

Video Classification

action-recognition

efficient-models

Model card Files Files and versions

sumit7488 commited on Mar 20

Commit

a42b837

·

verified ·

1 Parent(s): 68c08e9

Create README.md

Files changed (1) hide show

README.md +127 -0

README.md ADDED Viewed

	@@ -0,0 +1,127 @@

+---
+license: apache-2.0
+tags:
+- video-classification
+- timesformer
+- retnet
+- action-recognition
+- hmdb51
+- efficient-models
+- transformers
+datasets:
+- hmdb51
+---
+# 🎬 RetFormer: Efficient TimeSformer + RetNet for Video Action Recognition
+RetFormer is a hybrid video classification model that replaces the **temporal attention** in TimeSformer with **RetNet**, achieving:
+- ⚡ Lower memory usage
+- 🚀 Faster training
+- 🎯 Competitive accuracy
+---
+## 🧠 Model Architecture
+### 🔹 RetFormer (Proposed)
+- Spatial Modeling → TimeSformer
+- Temporal Modeling → **RetNet**
+👉 This replaces quadratic attention with **linear-time temporal modeling (O(n))**
+---
+## 📊 Dataset
+- **HMDB51**
+  - 51 human action classes
+  - Complex motion patterns
+  - Smaller and more challenging than UCF101
+---
+## 🔁 Training Strategy
+Training was performed in multiple stages due to runtime limits:
+- Initial training (Epoch 1–10)
+- Checkpoint saving
+- Resumed training (Epoch 11–14)
+- Early stopping applied
+---
+## 📈 Training Results (Epoch 1–14)
+| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 |
+|------|------------|-----------|----------|---------|-----|
+| 1 | 3.9312 | 0.0350 | 3.8099 | 0.0967 | 0.0855 |
+| 2 | 3.6330 | 0.1791 | 3.2948 | 0.3654 | 0.3149 |
+| 3 | 3.0989 | 0.3691 | 2.6927 | 0.5150 | 0.4579 |
+| 4 | 2.6278 | 0.5048 | 2.2879 | 0.5869 | 0.5503 |
+| 5 | 2.3198 | 0.5782 | 2.0438 | 0.6255 | 0.5961 |
+| 6 | 2.1387 | 0.6194 | 1.9152 | 0.6242 | 0.6074 |
+| 7 | 1.9876 | 0.6657 | 1.8369 | 0.6418 | 0.6308 |
+| 8 | 1.9140 | 0.6936 | 1.7966 | 0.6359 | 0.6188 |
+| 9 | 1.8539 | 0.7041 | 1.7619 | 0.6556 | 0.6426 |
+| 10 | 1.8149 | 0.7244 | 1.7523 | 0.6614 | 0.6512 |
+| 11 | 1.7325 | 0.7524 | 1.7315 | **0.6699** | **0.6614** |
+| 12 | 1.7036 | 0.7584 | 1.7469 | 0.6621 | 0.6515 |
+| 13 | 1.6682 | 0.7717 | 1.7504 | 0.6595 | 0.6496 |
+| 14 | 1.6344 | 0.7785 | 1.7488 | 0.6588 | 0.6494 |
+---
+## 🏆 Best Performance
+- **Validation Accuracy:** **66.99%**
+- **F1 Score:** 0.6614
+- Achieved at **Epoch 11**
+---
+## ⚙️ Training Details
+- Peak GPU Memory: **~7.2 GB**
+- Training Time per Epoch: ~52 minutes
+- Evaluation Time: ~8 minutes
+- Mixed Precision Training (`torch.cuda.amp`)
+- Early stopping triggered after Epoch 14
+---
+## 📌 Observations
+- Stable improvement until **Epoch 11**
+- Slight decline afterward → early overfitting
+- Lower accuracy than baseline (expected for hybrid trade-off)
+---
+## ⚡ Efficiency Advantage
+| Metric | TimeSformer | RetFormer |
+|-------|------------|----------|
+| Peak GPU Memory | ~9.3 GB | **~7.2 GB** ✅ |
+| Complexity | O(n²) | **O(n)** ✅ |
+| Speed | Slower | Faster |
+👉 **~25% reduction in GPU memory**
+---
+## 🔍 Key Insight
+RetFormer demonstrates that:
+- Efficient temporal modeling can **significantly reduce memory usage**
+- Performance remains **competitive with baseline models**
+- Trade-off exists between **efficiency and maximum accuracy**
+---
+## 🚀 Usage
+```bash
+pip install torch torchvision transformers