File size: 2,689 Bytes
1293b6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
license: mit
tags:
- video-classification
- timesformer
- retnet
- action-recognition
- ucf101
- hmdb51
- transformers
- efficient-models
datasets:
- ucf101
- hmdb51
---

# 🎬 TimeSformer + RetNet Hybrid for Efficient Video Action Recognition

This project presents a **hybrid architecture** that replaces the temporal attention mechanism in TimeSformer with **RetNet**, achieving:

- ⚑ Faster training
- 🧠 Lower memory usage
- 🎯 Comparable or improved accuracy

---

## πŸš€ Model Variants

We trained and evaluated **4 configurations**:

| Model | Dataset |
|------|--------|
| TimeSformer (Baseline) | UCF101 |
| TimeSformer (Baseline) | HMDB51 |
| **TimeSformer + RetNet (Hybrid)** | UCF101 |
| **TimeSformer + RetNet (Hybrid)** | HMDB51 |

---

## 🧠 Proposed Architecture

### πŸ”Ή Baseline
- **TimeSformer**
- Full spatio-temporal attention

### πŸ”Ή Hybrid Model (Proposed)
- Spatial Attention β†’ TimeSformer
- Temporal Modeling β†’ **RetNet**

πŸ‘‰ RetNet replaces temporal self-attention to reduce complexity from:
- **Quadratic β†’ Linear time**

---

## πŸ“Š Hybrid Model Training Results (UCF101)

| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 |
|------|------------|-----------|----------|---------|-----|
| 1 | 4.5275 | 0.0458 | 4.1596 | 0.3542 | 0.3076 |
| 2 | 3.6647 | 0.4089 | 2.6496 | 0.7550 | 0.7214 |
| 3 | 2.4221 | 0.6995 | 1.5313 | 0.8623 | 0.8509 |
| 4 | 1.8874 | 0.7841 | 1.2290 | 0.8961 | 0.8918 |
| 5 | 1.7268 | 0.8104 | 1.1584 | 0.9075 | 0.9040 |
| 6 | 1.6615 | 0.8145 | 1.1088 | 0.9167 | 0.9142 |
| 7 | 1.6076 | 0.8191 | 1.0962 | 0.9202 | 0.9168 |
| 8 | 1.5100 | 0.8234 | 1.0865 | **0.9260** | **0.9233** |
| 9 | 1.4704 | 0.8232 | 1.0812 | 0.9260 | 0.9226 |

---

## πŸ† Best Performance (Hybrid Model)

- **Validation Accuracy:** **92.60%**
- **F1 Score:** 0.9233  
- Achieved at Epoch 8  

---

## ⚑ Efficiency Comparison

| Metric | TimeSformer | Hybrid (RetNet) |
|-------|------------|----------------|
| Peak GPU Memory | ~9.3–9.8 GB | **~7.2 GB** βœ… |
| Training Speed | Slower | **Faster** βœ… |
| Temporal Complexity | O(nΒ²) | **O(n)** βœ… |

πŸ‘‰ **~25% memory reduction** with comparable performance.

---

## πŸ” Training Strategy

Due to Kaggle’s **12-hour runtime limit**, training was performed in stages:

- Initial training
- Save best checkpoint
- Resume from `.safetensors`
- Continue training

---

## βš™οΈ Training Details

- Mixed Precision Training (`torch.cuda.amp`)
- Checkpoint-based training
- Per-class evaluation reports
- GPU: Kaggle environment

---

## πŸ“¦ Base Model

- `facebook/timesformer-base-finetuned-k400`

---

## πŸš€ Usage

```bash
pip install torch torchvision transformers