File size: 2,689 Bytes
1293b6c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | ---
license: mit
tags:
- video-classification
- timesformer
- retnet
- action-recognition
- ucf101
- hmdb51
- transformers
- efficient-models
datasets:
- ucf101
- hmdb51
---
# π¬ TimeSformer + RetNet Hybrid for Efficient Video Action Recognition
This project presents a **hybrid architecture** that replaces the temporal attention mechanism in TimeSformer with **RetNet**, achieving:
- β‘ Faster training
- π§ Lower memory usage
- π― Comparable or improved accuracy
---
## π Model Variants
We trained and evaluated **4 configurations**:
| Model | Dataset |
|------|--------|
| TimeSformer (Baseline) | UCF101 |
| TimeSformer (Baseline) | HMDB51 |
| **TimeSformer + RetNet (Hybrid)** | UCF101 |
| **TimeSformer + RetNet (Hybrid)** | HMDB51 |
---
## π§ Proposed Architecture
### πΉ Baseline
- **TimeSformer**
- Full spatio-temporal attention
### πΉ Hybrid Model (Proposed)
- Spatial Attention β TimeSformer
- Temporal Modeling β **RetNet**
π RetNet replaces temporal self-attention to reduce complexity from:
- **Quadratic β Linear time**
---
## π Hybrid Model Training Results (UCF101)
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 |
|------|------------|-----------|----------|---------|-----|
| 1 | 4.5275 | 0.0458 | 4.1596 | 0.3542 | 0.3076 |
| 2 | 3.6647 | 0.4089 | 2.6496 | 0.7550 | 0.7214 |
| 3 | 2.4221 | 0.6995 | 1.5313 | 0.8623 | 0.8509 |
| 4 | 1.8874 | 0.7841 | 1.2290 | 0.8961 | 0.8918 |
| 5 | 1.7268 | 0.8104 | 1.1584 | 0.9075 | 0.9040 |
| 6 | 1.6615 | 0.8145 | 1.1088 | 0.9167 | 0.9142 |
| 7 | 1.6076 | 0.8191 | 1.0962 | 0.9202 | 0.9168 |
| 8 | 1.5100 | 0.8234 | 1.0865 | **0.9260** | **0.9233** |
| 9 | 1.4704 | 0.8232 | 1.0812 | 0.9260 | 0.9226 |
---
## π Best Performance (Hybrid Model)
- **Validation Accuracy:** **92.60%**
- **F1 Score:** 0.9233
- Achieved at Epoch 8
---
## β‘ Efficiency Comparison
| Metric | TimeSformer | Hybrid (RetNet) |
|-------|------------|----------------|
| Peak GPU Memory | ~9.3β9.8 GB | **~7.2 GB** β
|
| Training Speed | Slower | **Faster** β
|
| Temporal Complexity | O(nΒ²) | **O(n)** β
|
π **~25% memory reduction** with comparable performance.
---
## π Training Strategy
Due to Kaggleβs **12-hour runtime limit**, training was performed in stages:
- Initial training
- Save best checkpoint
- Resume from `.safetensors`
- Continue training
---
## βοΈ Training Details
- Mixed Precision Training (`torch.cuda.amp`)
- Checkpoint-based training
- Per-class evaluation reports
- GPU: Kaggle environment
---
## π¦ Base Model
- `facebook/timesformer-base-finetuned-k400`
---
## π Usage
```bash
pip install torch torchvision transformers |