Less is More: Recursive Reasoning with Tiny Networks
Paper
β’ 2510.04871 β’ Published
β’ 509
Extends the ViT-TRM architecture from HMDB51 (51 classes) to UCF-101 (101 action classes) via transfer learning.
Video Frames β ViT (per-frame) β Mean Pool β Positional Encoding
β TRM Reasoning (H=2 cycles, L=2 shared layers) β Mean Pool β Classifier (101 classes)
vit_tiny_patch16_224 (ImageNet pretrained)pip install torch torchvision pytorch-lightning timm torchmetrics datasets
# Transfer from HMDB51 (recommended)
python train_ucf101.py --pretrained_ckpt vit-trm-epoch=29-val_acc=0.7113.ckpt
# From scratch
python train_ucf101.py
# Smoke test
python train_ucf101.py --fast_dev_run --max_videos 50
import torch
from vit_trm_video import ViTTRMVideo
model = ViTTRMVideo.load_from_checkpoint("checkpoints/best.ckpt", strict=False)
model.eval()
video = torch.randn(1, 16, 3, 224, 224) # (batch, frames, C, H, W)
with torch.no_grad():
logits = model(video)
pred = logits.argmax(dim=-1)
Base model
adelabdalla221/vit-trm-hmdb51