π¬ TimeSformer + RetNet Hybrid for Efficient Video Action Recognition
This project presents a hybrid architecture that replaces the temporal attention mechanism in TimeSformer with RetNet, achieving:
- β‘ Faster training
- π§ Lower memory usage
- π― Comparable or improved accuracy
π Model Variants
We trained and evaluated 4 configurations:
| Model | Dataset |
|---|---|
| TimeSformer (Baseline) | UCF101 |
| TimeSformer (Baseline) | HMDB51 |
| TimeSformer + RetNet (Hybrid) | UCF101 |
| TimeSformer + RetNet (Hybrid) | HMDB51 |
π§ Proposed Architecture
πΉ Baseline
- TimeSformer
- Full spatio-temporal attention
πΉ Hybrid Model (Proposed)
- Spatial Attention β TimeSformer
- Temporal Modeling β RetNet
π RetNet replaces temporal self-attention to reduce complexity from:
- Quadratic β Linear time
π Hybrid Model Training Results (UCF101)
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 |
|---|---|---|---|---|---|
| 1 | 4.5275 | 0.0458 | 4.1596 | 0.3542 | 0.3076 |
| 2 | 3.6647 | 0.4089 | 2.6496 | 0.7550 | 0.7214 |
| 3 | 2.4221 | 0.6995 | 1.5313 | 0.8623 | 0.8509 |
| 4 | 1.8874 | 0.7841 | 1.2290 | 0.8961 | 0.8918 |
| 5 | 1.7268 | 0.8104 | 1.1584 | 0.9075 | 0.9040 |
| 6 | 1.6615 | 0.8145 | 1.1088 | 0.9167 | 0.9142 |
| 7 | 1.6076 | 0.8191 | 1.0962 | 0.9202 | 0.9168 |
| 8 | 1.5100 | 0.8234 | 1.0865 | 0.9260 | 0.9233 |
| 9 | 1.4704 | 0.8232 | 1.0812 | 0.9260 | 0.9226 |
π Best Performance (Hybrid Model)
- Validation Accuracy: 92.60%
- F1 Score: 0.9233
- Achieved at Epoch 8
β‘ Efficiency Comparison
| Metric | TimeSformer | Hybrid (RetNet) |
|---|---|---|
| Peak GPU Memory | ~9.3β9.8 GB | ~7.2 GB β |
| Training Speed | Slower | Faster β |
| Temporal Complexity | O(nΒ²) | O(n) β |
π ~25% memory reduction with comparable performance.
π Training Strategy
Due to Kaggleβs 12-hour runtime limit, training was performed in stages:
- Initial training
- Save best checkpoint
- Resume from
.safetensors - Continue training
βοΈ Training Details
- Mixed Precision Training (
torch.cuda.amp) - Checkpoint-based training
- Per-class evaluation reports
- GPU: Kaggle environment
π¦ Base Model
facebook/timesformer-base-finetuned-k400
π Usage
pip install torch torchvision transformers
- Downloads last month
- 13
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support