---
license: mit
tags:
- video-classification
- timesformer
- retnet
- action-recognition
- ucf101
- hmdb51
- transformers
- efficient-models
datasets:
- ucf101
- hmdb51
---

# 🎬 TimeSformer + RetNet Hybrid for Efficient Video Action Recognition

This project presents a **hybrid architecture** that replaces the temporal attention mechanism in TimeSformer with **RetNet**, achieving:

- ⚡ Faster training
- 🧠 Lower memory usage
- 🎯 Comparable or improved accuracy

---

## 🚀 Model Variants

We trained and evaluated **4 configurations**:

| Model | Dataset |
|------|--------|
| TimeSformer (Baseline) | UCF101 |
| TimeSformer (Baseline) | HMDB51 |
| **TimeSformer + RetNet (Hybrid)** | UCF101 |
| **TimeSformer + RetNet (Hybrid)** | HMDB51 |

---

## 🧠 Proposed Architecture

### 🔹 Baseline
- **TimeSformer**
- Full spatio-temporal attention

### 🔹 Hybrid Model (Proposed)
- Spatial Attention → TimeSformer
- Temporal Modeling → **RetNet**

👉 RetNet replaces temporal self-attention to reduce complexity from:
- **Quadratic → Linear time**

---

## 📊 Hybrid Model Training Results (UCF101)

| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 |
|------|------------|-----------|----------|---------|-----|
| 1 | 4.5275 | 0.0458 | 4.1596 | 0.3542 | 0.3076 |
| 2 | 3.6647 | 0.4089 | 2.6496 | 0.7550 | 0.7214 |
| 3 | 2.4221 | 0.6995 | 1.5313 | 0.8623 | 0.8509 |
| 4 | 1.8874 | 0.7841 | 1.2290 | 0.8961 | 0.8918 |
| 5 | 1.7268 | 0.8104 | 1.1584 | 0.9075 | 0.9040 |
| 6 | 1.6615 | 0.8145 | 1.1088 | 0.9167 | 0.9142 |
| 7 | 1.6076 | 0.8191 | 1.0962 | 0.9202 | 0.9168 |
| 8 | 1.5100 | 0.8234 | 1.0865 | **0.9260** | **0.9233** |
| 9 | 1.4704 | 0.8232 | 1.0812 | 0.9260 | 0.9226 |

---

## 🏆 Best Performance (Hybrid Model)

- **Validation Accuracy:** **92.60%**
- **F1 Score:** 0.9233  
- Achieved at Epoch 8  

---

## ⚡ Efficiency Comparison

| Metric | TimeSformer | Hybrid (RetNet) |
|-------|------------|----------------|
| Peak GPU Memory | ~9.3–9.8 GB | **~7.2 GB** ✅ |
| Training Speed | Slower | **Faster** ✅ |
| Temporal Complexity | O(n²) | **O(n)** ✅ |

👉 **~25% memory reduction** with comparable performance.

---

## 🔁 Training Strategy

Due to Kaggle’s **12-hour runtime limit**, training was performed in stages:

- Initial training
- Save best checkpoint
- Resume from `.safetensors`
- Continue training

---

## ⚙️ Training Details

- Mixed Precision Training (`torch.cuda.amp`)
- Checkpoint-based training
- Per-class evaluation reports
- GPU: Kaggle environment

---

## 📦 Base Model

- `facebook/timesformer-base-finetuned-k400`

---

## 🚀 Usage

```bash
pip install torch torchvision transformers