ViT-TRM on UCF-101

Extends the ViT-TRM architecture from HMDB51 (51 classes) to UCF-101 (101 action classes) via transfer learning.

Architecture

Video Frames β†’ ViT (per-frame) β†’ Mean Pool β†’ Positional Encoding
    β†’ TRM Reasoning (H=2 cycles, L=2 shared layers) β†’ Mean Pool β†’ Classifier (101 classes)
  • Backbone: vit_tiny_patch16_224 (ImageNet pretrained)
  • TRM: 2 cycles Γ— 2 shared transformer layers, 4 heads (~6M params)
  • Dataset: UCF-101 β€” 101 action classes, 13K+ videos
  • Transfer: Backbone + TRM weights initialized from HMDB51 checkpoint (69% accuracy)
  • Paper: Less is More: Recursive Reasoning with Tiny Networks

Setup

pip install torch torchvision pytorch-lightning timm torchmetrics datasets

Training

# Transfer from HMDB51 (recommended)
python train_ucf101.py --pretrained_ckpt vit-trm-epoch=29-val_acc=0.7113.ckpt

# From scratch
python train_ucf101.py

# Smoke test
python train_ucf101.py --fast_dev_run --max_videos 50

Usage

import torch
from vit_trm_video import ViTTRMVideo

model = ViTTRMVideo.load_from_checkpoint("checkpoints/best.ckpt", strict=False)
model.eval()

video = torch.randn(1, 16, 3, 224, 224)  # (batch, frames, C, H, W)
with torch.no_grad():
    logits = model(video)
    pred = logits.argmax(dim=-1)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bcgxtberg/vit-trm-ssv2

Finetuned
(1)
this model

Dataset used to train bcgxtberg/vit-trm-ssv2

Space using bcgxtberg/vit-trm-ssv2 1

Paper for bcgxtberg/vit-trm-ssv2