| --- |
| license: mit |
| tags: |
| - video-classification |
| - timesformer |
| - retnet |
| - action-recognition |
| - ucf101 |
| - hmdb51 |
| - transformers |
| - efficient-models |
| datasets: |
| - ucf101 |
| - hmdb51 |
| --- |
| |
| # π¬ TimeSformer + RetNet Hybrid for Efficient Video Action Recognition |
|
|
| This project presents a **hybrid architecture** that replaces the temporal attention mechanism in TimeSformer with **RetNet**, achieving: |
|
|
| - β‘ Faster training |
| - π§ Lower memory usage |
| - π― Comparable or improved accuracy |
|
|
| --- |
|
|
| ## π Model Variants |
|
|
| We trained and evaluated **4 configurations**: |
|
|
| | Model | Dataset | |
| |------|--------| |
| | TimeSformer (Baseline) | UCF101 | |
| | TimeSformer (Baseline) | HMDB51 | |
| | **TimeSformer + RetNet (Hybrid)** | UCF101 | |
| | **TimeSformer + RetNet (Hybrid)** | HMDB51 | |
|
|
| --- |
|
|
| ## π§ Proposed Architecture |
|
|
| ### πΉ Baseline |
| - **TimeSformer** |
| - Full spatio-temporal attention |
|
|
| ### πΉ Hybrid Model (Proposed) |
| - Spatial Attention β TimeSformer |
| - Temporal Modeling β **RetNet** |
|
|
| π RetNet replaces temporal self-attention to reduce complexity from: |
| - **Quadratic β Linear time** |
|
|
| --- |
|
|
| ## π Hybrid Model Training Results (UCF101) |
|
|
| | Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 | |
| |------|------------|-----------|----------|---------|-----| |
| | 1 | 4.5275 | 0.0458 | 4.1596 | 0.3542 | 0.3076 | |
| | 2 | 3.6647 | 0.4089 | 2.6496 | 0.7550 | 0.7214 | |
| | 3 | 2.4221 | 0.6995 | 1.5313 | 0.8623 | 0.8509 | |
| | 4 | 1.8874 | 0.7841 | 1.2290 | 0.8961 | 0.8918 | |
| | 5 | 1.7268 | 0.8104 | 1.1584 | 0.9075 | 0.9040 | |
| | 6 | 1.6615 | 0.8145 | 1.1088 | 0.9167 | 0.9142 | |
| | 7 | 1.6076 | 0.8191 | 1.0962 | 0.9202 | 0.9168 | |
| | 8 | 1.5100 | 0.8234 | 1.0865 | **0.9260** | **0.9233** | |
| | 9 | 1.4704 | 0.8232 | 1.0812 | 0.9260 | 0.9226 | |
|
|
| --- |
|
|
| ## π Best Performance (Hybrid Model) |
|
|
| - **Validation Accuracy:** **92.60%** |
| - **F1 Score:** 0.9233 |
| - Achieved at Epoch 8 |
|
|
| --- |
|
|
| ## β‘ Efficiency Comparison |
|
|
| | Metric | TimeSformer | Hybrid (RetNet) | |
| |-------|------------|----------------| |
| | Peak GPU Memory | ~9.3β9.8 GB | **~7.2 GB** β
| |
| | Training Speed | Slower | **Faster** β
| |
| | Temporal Complexity | O(nΒ²) | **O(n)** β
| |
|
|
| π **~25% memory reduction** with comparable performance. |
|
|
| --- |
|
|
| ## π Training Strategy |
|
|
| Due to Kaggleβs **12-hour runtime limit**, training was performed in stages: |
|
|
| - Initial training |
| - Save best checkpoint |
| - Resume from `.safetensors` |
| - Continue training |
|
|
| --- |
|
|
| ## βοΈ Training Details |
|
|
| - Mixed Precision Training (`torch.cuda.amp`) |
| - Checkpoint-based training |
| - Per-class evaluation reports |
| - GPU: Kaggle environment |
|
|
| --- |
|
|
| ## π¦ Base Model |
|
|
| - `facebook/timesformer-base-finetuned-k400` |
|
|
| --- |
|
|
| ## π Usage |
|
|
| ```bash |
| pip install torch torchvision transformers |