sumit7488 commited on
Commit
a42b837
Β·
verified Β·
1 Parent(s): 68c08e9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -0
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - video-classification
5
+ - timesformer
6
+ - retnet
7
+ - action-recognition
8
+ - hmdb51
9
+ - efficient-models
10
+ - transformers
11
+ datasets:
12
+ - hmdb51
13
+ ---
14
+
15
+ # 🎬 RetFormer: Efficient TimeSformer + RetNet for Video Action Recognition
16
+
17
+ RetFormer is a hybrid video classification model that replaces the **temporal attention** in TimeSformer with **RetNet**, achieving:
18
+
19
+ - ⚑ Lower memory usage
20
+ - πŸš€ Faster training
21
+ - 🎯 Competitive accuracy
22
+
23
+ ---
24
+
25
+ ## 🧠 Model Architecture
26
+
27
+ ### πŸ”Ή RetFormer (Proposed)
28
+ - Spatial Modeling β†’ TimeSformer
29
+ - Temporal Modeling β†’ **RetNet**
30
+
31
+ πŸ‘‰ This replaces quadratic attention with **linear-time temporal modeling (O(n))**
32
+
33
+ ---
34
+
35
+ ## πŸ“Š Dataset
36
+
37
+ - **HMDB51**
38
+ - 51 human action classes
39
+ - Complex motion patterns
40
+ - Smaller and more challenging than UCF101
41
+
42
+ ---
43
+
44
+ ## πŸ” Training Strategy
45
+
46
+ Training was performed in multiple stages due to runtime limits:
47
+
48
+ - Initial training (Epoch 1–10)
49
+ - Checkpoint saving
50
+ - Resumed training (Epoch 11–14)
51
+ - Early stopping applied
52
+
53
+ ---
54
+
55
+ ## πŸ“ˆ Training Results (Epoch 1–14)
56
+
57
+ | Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 |
58
+ |------|------------|-----------|----------|---------|-----|
59
+ | 1 | 3.9312 | 0.0350 | 3.8099 | 0.0967 | 0.0855 |
60
+ | 2 | 3.6330 | 0.1791 | 3.2948 | 0.3654 | 0.3149 |
61
+ | 3 | 3.0989 | 0.3691 | 2.6927 | 0.5150 | 0.4579 |
62
+ | 4 | 2.6278 | 0.5048 | 2.2879 | 0.5869 | 0.5503 |
63
+ | 5 | 2.3198 | 0.5782 | 2.0438 | 0.6255 | 0.5961 |
64
+ | 6 | 2.1387 | 0.6194 | 1.9152 | 0.6242 | 0.6074 |
65
+ | 7 | 1.9876 | 0.6657 | 1.8369 | 0.6418 | 0.6308 |
66
+ | 8 | 1.9140 | 0.6936 | 1.7966 | 0.6359 | 0.6188 |
67
+ | 9 | 1.8539 | 0.7041 | 1.7619 | 0.6556 | 0.6426 |
68
+ | 10 | 1.8149 | 0.7244 | 1.7523 | 0.6614 | 0.6512 |
69
+ | 11 | 1.7325 | 0.7524 | 1.7315 | **0.6699** | **0.6614** |
70
+ | 12 | 1.7036 | 0.7584 | 1.7469 | 0.6621 | 0.6515 |
71
+ | 13 | 1.6682 | 0.7717 | 1.7504 | 0.6595 | 0.6496 |
72
+ | 14 | 1.6344 | 0.7785 | 1.7488 | 0.6588 | 0.6494 |
73
+
74
+ ---
75
+
76
+ ## πŸ† Best Performance
77
+
78
+ - **Validation Accuracy:** **66.99%**
79
+ - **F1 Score:** 0.6614
80
+ - Achieved at **Epoch 11**
81
+
82
+ ---
83
+
84
+ ## βš™οΈ Training Details
85
+
86
+ - Peak GPU Memory: **~7.2 GB**
87
+ - Training Time per Epoch: ~52 minutes
88
+ - Evaluation Time: ~8 minutes
89
+ - Mixed Precision Training (`torch.cuda.amp`)
90
+ - Early stopping triggered after Epoch 14
91
+
92
+ ---
93
+
94
+ ## πŸ“Œ Observations
95
+
96
+ - Stable improvement until **Epoch 11**
97
+ - Slight decline afterward β†’ early overfitting
98
+ - Lower accuracy than baseline (expected for hybrid trade-off)
99
+
100
+ ---
101
+
102
+ ## ⚑ Efficiency Advantage
103
+
104
+ | Metric | TimeSformer | RetFormer |
105
+ |-------|------------|----------|
106
+ | Peak GPU Memory | ~9.3 GB | **~7.2 GB** βœ… |
107
+ | Complexity | O(nΒ²) | **O(n)** βœ… |
108
+ | Speed | Slower | Faster |
109
+
110
+ πŸ‘‰ **~25% reduction in GPU memory**
111
+
112
+ ---
113
+
114
+ ## πŸ” Key Insight
115
+
116
+ RetFormer demonstrates that:
117
+
118
+ - Efficient temporal modeling can **significantly reduce memory usage**
119
+ - Performance remains **competitive with baseline models**
120
+ - Trade-off exists between **efficiency and maximum accuracy**
121
+
122
+ ---
123
+
124
+ ## πŸš€ Usage
125
+
126
+ ```bash
127
+ pip install torch torchvision transformers