--- license: mit tags: - video-classification - timesformer - retnet - action-recognition - ucf101 - hmdb51 - transformers - efficient-models datasets: - ucf101 - hmdb51 --- # 🎬 TimeSformer + RetNet Hybrid for Efficient Video Action Recognition This project presents a **hybrid architecture** that replaces the temporal attention mechanism in TimeSformer with **RetNet**, achieving: - ⚑ Faster training - 🧠 Lower memory usage - 🎯 Comparable or improved accuracy --- ## πŸš€ Model Variants We trained and evaluated **4 configurations**: | Model | Dataset | |------|--------| | TimeSformer (Baseline) | UCF101 | | TimeSformer (Baseline) | HMDB51 | | **TimeSformer + RetNet (Hybrid)** | UCF101 | | **TimeSformer + RetNet (Hybrid)** | HMDB51 | --- ## 🧠 Proposed Architecture ### πŸ”Ή Baseline - **TimeSformer** - Full spatio-temporal attention ### πŸ”Ή Hybrid Model (Proposed) - Spatial Attention β†’ TimeSformer - Temporal Modeling β†’ **RetNet** πŸ‘‰ RetNet replaces temporal self-attention to reduce complexity from: - **Quadratic β†’ Linear time** --- ## πŸ“Š Hybrid Model Training Results (UCF101) | Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 | |------|------------|-----------|----------|---------|-----| | 1 | 4.5275 | 0.0458 | 4.1596 | 0.3542 | 0.3076 | | 2 | 3.6647 | 0.4089 | 2.6496 | 0.7550 | 0.7214 | | 3 | 2.4221 | 0.6995 | 1.5313 | 0.8623 | 0.8509 | | 4 | 1.8874 | 0.7841 | 1.2290 | 0.8961 | 0.8918 | | 5 | 1.7268 | 0.8104 | 1.1584 | 0.9075 | 0.9040 | | 6 | 1.6615 | 0.8145 | 1.1088 | 0.9167 | 0.9142 | | 7 | 1.6076 | 0.8191 | 1.0962 | 0.9202 | 0.9168 | | 8 | 1.5100 | 0.8234 | 1.0865 | **0.9260** | **0.9233** | | 9 | 1.4704 | 0.8232 | 1.0812 | 0.9260 | 0.9226 | --- ## πŸ† Best Performance (Hybrid Model) - **Validation Accuracy:** **92.60%** - **F1 Score:** 0.9233 - Achieved at Epoch 8 --- ## ⚑ Efficiency Comparison | Metric | TimeSformer | Hybrid (RetNet) | |-------|------------|----------------| | Peak GPU Memory | ~9.3–9.8 GB | **~7.2 GB** βœ… | | Training Speed | Slower | **Faster** βœ… | | Temporal Complexity | O(nΒ²) | **O(n)** βœ… | πŸ‘‰ **~25% memory reduction** with comparable performance. --- ## πŸ” Training Strategy Due to Kaggle’s **12-hour runtime limit**, training was performed in stages: - Initial training - Save best checkpoint - Resume from `.safetensors` - Continue training --- ## βš™οΈ Training Details - Mixed Precision Training (`torch.cuda.amp`) - Checkpoint-based training - Per-class evaluation reports - GPU: Kaggle environment --- ## πŸ“¦ Base Model - `facebook/timesformer-base-finetuned-k400` --- ## πŸš€ Usage ```bash pip install torch torchvision transformers