TimesNet / README.md
sumit7488's picture
Create README.md
1293b6c verified
---
license: mit
tags:
- video-classification
- timesformer
- retnet
- action-recognition
- ucf101
- hmdb51
- transformers
- efficient-models
datasets:
- ucf101
- hmdb51
---
# 🎬 TimeSformer + RetNet Hybrid for Efficient Video Action Recognition
This project presents a **hybrid architecture** that replaces the temporal attention mechanism in TimeSformer with **RetNet**, achieving:
- ⚑ Faster training
- 🧠 Lower memory usage
- 🎯 Comparable or improved accuracy
---
## πŸš€ Model Variants
We trained and evaluated **4 configurations**:
| Model | Dataset |
|------|--------|
| TimeSformer (Baseline) | UCF101 |
| TimeSformer (Baseline) | HMDB51 |
| **TimeSformer + RetNet (Hybrid)** | UCF101 |
| **TimeSformer + RetNet (Hybrid)** | HMDB51 |
---
## 🧠 Proposed Architecture
### πŸ”Ή Baseline
- **TimeSformer**
- Full spatio-temporal attention
### πŸ”Ή Hybrid Model (Proposed)
- Spatial Attention β†’ TimeSformer
- Temporal Modeling β†’ **RetNet**
πŸ‘‰ RetNet replaces temporal self-attention to reduce complexity from:
- **Quadratic β†’ Linear time**
---
## πŸ“Š Hybrid Model Training Results (UCF101)
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 |
|------|------------|-----------|----------|---------|-----|
| 1 | 4.5275 | 0.0458 | 4.1596 | 0.3542 | 0.3076 |
| 2 | 3.6647 | 0.4089 | 2.6496 | 0.7550 | 0.7214 |
| 3 | 2.4221 | 0.6995 | 1.5313 | 0.8623 | 0.8509 |
| 4 | 1.8874 | 0.7841 | 1.2290 | 0.8961 | 0.8918 |
| 5 | 1.7268 | 0.8104 | 1.1584 | 0.9075 | 0.9040 |
| 6 | 1.6615 | 0.8145 | 1.1088 | 0.9167 | 0.9142 |
| 7 | 1.6076 | 0.8191 | 1.0962 | 0.9202 | 0.9168 |
| 8 | 1.5100 | 0.8234 | 1.0865 | **0.9260** | **0.9233** |
| 9 | 1.4704 | 0.8232 | 1.0812 | 0.9260 | 0.9226 |
---
## πŸ† Best Performance (Hybrid Model)
- **Validation Accuracy:** **92.60%**
- **F1 Score:** 0.9233
- Achieved at Epoch 8
---
## ⚑ Efficiency Comparison
| Metric | TimeSformer | Hybrid (RetNet) |
|-------|------------|----------------|
| Peak GPU Memory | ~9.3–9.8 GB | **~7.2 GB** βœ… |
| Training Speed | Slower | **Faster** βœ… |
| Temporal Complexity | O(nΒ²) | **O(n)** βœ… |
πŸ‘‰ **~25% memory reduction** with comparable performance.
---
## πŸ” Training Strategy
Due to Kaggle’s **12-hour runtime limit**, training was performed in stages:
- Initial training
- Save best checkpoint
- Resume from `.safetensors`
- Continue training
---
## βš™οΈ Training Details
- Mixed Precision Training (`torch.cuda.amp`)
- Checkpoint-based training
- Per-class evaluation reports
- GPU: Kaggle environment
---
## πŸ“¦ Base Model
- `facebook/timesformer-base-finetuned-k400`
---
## πŸš€ Usage
```bash
pip install torch torchvision transformers