🎬 RetFormer: Efficient TimeSformer + RetNet for Video Action Recognition

RetFormer is a hybrid video classification model that replaces the temporal attention in TimeSformer with RetNet, achieving:

🧠 Model Architecture

👉 This replaces quadratic attention with linear-time temporal modeling (O(n))

HMDB51
- 51 human action classes
- Complex motion patterns
- Smaller and more challenging than UCF101

Training was performed in multiple stages due to runtime limits:

Epoch	Train Loss	Train Acc	Val Loss	Val Acc	F1
1	3.9312	0.0350	3.8099	0.0967	0.0855
2	3.6330	0.1791	3.2948	0.3654	0.3149
3	3.0989	0.3691	2.6927	0.5150	0.4579
4	2.6278	0.5048	2.2879	0.5869	0.5503
5	2.3198	0.5782	2.0438	0.6255	0.5961
6	2.1387	0.6194	1.9152	0.6242	0.6074
7	1.9876	0.6657	1.8369	0.6418	0.6308
8	1.9140	0.6936	1.7966	0.6359	0.6188
9	1.8539	0.7041	1.7619	0.6556	0.6426
10	1.8149	0.7244	1.7523	0.6614	0.6512
11	1.7325	0.7524	1.7315	0.6699	0.6614
12	1.7036	0.7584	1.7469	0.6621	0.6515
13	1.6682	0.7717	1.7504	0.6595	0.6496
14	1.6344	0.7785	1.7488	0.6588	0.6494

👉 ~25% reduction in GPU memory

RetFormer demonstrates that:

pip install torch torchvision transformers

Safetensors

Model size

0.1B params

Tensor type

F64

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support