sumit7488
/

TimesNet

Video Classification

action-recognition

efficient-models

Model card Files Files and versions

sumit7488 commited on Mar 19

Commit

1293b6c

·

verified ·

1 Parent(s): d86bf87

Create README.md

Files changed (1) hide show

README.md +120 -0

README.md ADDED Viewed

	@@ -0,0 +1,120 @@

+---
+license: mit
+tags:
+- video-classification
+- timesformer
+- retnet
+- action-recognition
+- ucf101
+- hmdb51
+- transformers
+- efficient-models
+datasets:
+- ucf101
+- hmdb51
+---
+# 🎬 TimeSformer + RetNet Hybrid for Efficient Video Action Recognition
+This project presents a **hybrid architecture** that replaces the temporal attention mechanism in TimeSformer with **RetNet**, achieving:
+- ⚡ Faster training
+- 🧠 Lower memory usage
+- 🎯 Comparable or improved accuracy
+---
+## 🚀 Model Variants
+We trained and evaluated **4 configurations**:
+| Model | Dataset |
+|------|--------|
+| TimeSformer (Baseline) | UCF101 |
+| TimeSformer (Baseline) | HMDB51 |
+| **TimeSformer + RetNet (Hybrid)** | UCF101 |
+| **TimeSformer + RetNet (Hybrid)** | HMDB51 |
+---
+## 🧠 Proposed Architecture
+### 🔹 Baseline
+- **TimeSformer**
+- Full spatio-temporal attention
+### 🔹 Hybrid Model (Proposed)
+- Spatial Attention → TimeSformer
+- Temporal Modeling → **RetNet**
+👉 RetNet replaces temporal self-attention to reduce complexity from:
+- **Quadratic → Linear time**
+---
+## 📊 Hybrid Model Training Results (UCF101)
+| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 |
+|------|------------|-----------|----------|---------|-----|
+| 1 | 4.5275 | 0.0458 | 4.1596 | 0.3542 | 0.3076 |
+| 2 | 3.6647 | 0.4089 | 2.6496 | 0.7550 | 0.7214 |
+| 3 | 2.4221 | 0.6995 | 1.5313 | 0.8623 | 0.8509 |
+| 4 | 1.8874 | 0.7841 | 1.2290 | 0.8961 | 0.8918 |
+| 5 | 1.7268 | 0.8104 | 1.1584 | 0.9075 | 0.9040 |
+| 6 | 1.6615 | 0.8145 | 1.1088 | 0.9167 | 0.9142 |
+| 7 | 1.6076 | 0.8191 | 1.0962 | 0.9202 | 0.9168 |
+| 8 | 1.5100 | 0.8234 | 1.0865 | **0.9260** | **0.9233** |
+| 9 | 1.4704 | 0.8232 | 1.0812 | 0.9260 | 0.9226 |
+---
+## 🏆 Best Performance (Hybrid Model)
+- **Validation Accuracy:** **92.60%**
+- **F1 Score:** 0.9233
+- Achieved at Epoch 8
+---
+## ⚡ Efficiency Comparison
+| Metric | TimeSformer | Hybrid (RetNet) |
+|-------|------------|----------------|
+| Peak GPU Memory | ~9.3–9.8 GB | **~7.2 GB** ✅ |
+| Training Speed | Slower | **Faster** ✅ |
+| Temporal Complexity | O(n²) | **O(n)** ✅ |
+👉 **~25% memory reduction** with comparable performance.
+---
+## 🔁 Training Strategy
+Due to Kaggle’s **12-hour runtime limit**, training was performed in stages:
+- Initial training
+- Save best checkpoint
+- Resume from `.safetensors`
+- Continue training
+---
+## ⚙️ Training Details
+- Mixed Precision Training (`torch.cuda.amp`)
+- Checkpoint-based training
+- Per-class evaluation reports
+- GPU: Kaggle environment
+---
+## 📦 Base Model
+- `facebook/timesformer-base-finetuned-k400`
+---
+## 🚀 Usage
+```bash
+pip install torch torchvision transformers