sumit7488 commited on
Commit
1293b6c
Β·
verified Β·
1 Parent(s): d86bf87

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -0
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - video-classification
5
+ - timesformer
6
+ - retnet
7
+ - action-recognition
8
+ - ucf101
9
+ - hmdb51
10
+ - transformers
11
+ - efficient-models
12
+ datasets:
13
+ - ucf101
14
+ - hmdb51
15
+ ---
16
+
17
+ # 🎬 TimeSformer + RetNet Hybrid for Efficient Video Action Recognition
18
+
19
+ This project presents a **hybrid architecture** that replaces the temporal attention mechanism in TimeSformer with **RetNet**, achieving:
20
+
21
+ - ⚑ Faster training
22
+ - 🧠 Lower memory usage
23
+ - 🎯 Comparable or improved accuracy
24
+
25
+ ---
26
+
27
+ ## πŸš€ Model Variants
28
+
29
+ We trained and evaluated **4 configurations**:
30
+
31
+ | Model | Dataset |
32
+ |------|--------|
33
+ | TimeSformer (Baseline) | UCF101 |
34
+ | TimeSformer (Baseline) | HMDB51 |
35
+ | **TimeSformer + RetNet (Hybrid)** | UCF101 |
36
+ | **TimeSformer + RetNet (Hybrid)** | HMDB51 |
37
+
38
+ ---
39
+
40
+ ## 🧠 Proposed Architecture
41
+
42
+ ### πŸ”Ή Baseline
43
+ - **TimeSformer**
44
+ - Full spatio-temporal attention
45
+
46
+ ### πŸ”Ή Hybrid Model (Proposed)
47
+ - Spatial Attention β†’ TimeSformer
48
+ - Temporal Modeling β†’ **RetNet**
49
+
50
+ πŸ‘‰ RetNet replaces temporal self-attention to reduce complexity from:
51
+ - **Quadratic β†’ Linear time**
52
+
53
+ ---
54
+
55
+ ## πŸ“Š Hybrid Model Training Results (UCF101)
56
+
57
+ | Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 |
58
+ |------|------------|-----------|----------|---------|-----|
59
+ | 1 | 4.5275 | 0.0458 | 4.1596 | 0.3542 | 0.3076 |
60
+ | 2 | 3.6647 | 0.4089 | 2.6496 | 0.7550 | 0.7214 |
61
+ | 3 | 2.4221 | 0.6995 | 1.5313 | 0.8623 | 0.8509 |
62
+ | 4 | 1.8874 | 0.7841 | 1.2290 | 0.8961 | 0.8918 |
63
+ | 5 | 1.7268 | 0.8104 | 1.1584 | 0.9075 | 0.9040 |
64
+ | 6 | 1.6615 | 0.8145 | 1.1088 | 0.9167 | 0.9142 |
65
+ | 7 | 1.6076 | 0.8191 | 1.0962 | 0.9202 | 0.9168 |
66
+ | 8 | 1.5100 | 0.8234 | 1.0865 | **0.9260** | **0.9233** |
67
+ | 9 | 1.4704 | 0.8232 | 1.0812 | 0.9260 | 0.9226 |
68
+
69
+ ---
70
+
71
+ ## πŸ† Best Performance (Hybrid Model)
72
+
73
+ - **Validation Accuracy:** **92.60%**
74
+ - **F1 Score:** 0.9233
75
+ - Achieved at Epoch 8
76
+
77
+ ---
78
+
79
+ ## ⚑ Efficiency Comparison
80
+
81
+ | Metric | TimeSformer | Hybrid (RetNet) |
82
+ |-------|------------|----------------|
83
+ | Peak GPU Memory | ~9.3–9.8 GB | **~7.2 GB** βœ… |
84
+ | Training Speed | Slower | **Faster** βœ… |
85
+ | Temporal Complexity | O(nΒ²) | **O(n)** βœ… |
86
+
87
+ πŸ‘‰ **~25% memory reduction** with comparable performance.
88
+
89
+ ---
90
+
91
+ ## πŸ” Training Strategy
92
+
93
+ Due to Kaggle’s **12-hour runtime limit**, training was performed in stages:
94
+
95
+ - Initial training
96
+ - Save best checkpoint
97
+ - Resume from `.safetensors`
98
+ - Continue training
99
+
100
+ ---
101
+
102
+ ## βš™οΈ Training Details
103
+
104
+ - Mixed Precision Training (`torch.cuda.amp`)
105
+ - Checkpoint-based training
106
+ - Per-class evaluation reports
107
+ - GPU: Kaggle environment
108
+
109
+ ---
110
+
111
+ ## πŸ“¦ Base Model
112
+
113
+ - `facebook/timesformer-base-finetuned-k400`
114
+
115
+ ---
116
+
117
+ ## πŸš€ Usage
118
+
119
+ ```bash
120
+ pip install torch torchvision transformers