A Closer Look at Spatiotemporal Convolutions for Action Recognition
Paper
•
1711.11248
•
Published
Lightweight 3D ResNet-18 fine-tuned on UCF-101. Good baseline for action recognition with fast inference.
| Metric | Value |
|---|---|
| Accuracy | 83.80% |
| F1 Score | 0.828 |
| Precision | 0.842 |
Comparison:
import torch
# Load from HuggingFace
from huggingface_hub import hf_hub_download
from torchvision.transforms import Compose, Resize, CenterCrop, Normalize, ToTensor
model_path = hf_hub_download(repo_id="dronefreak/r3d-18-ucf101", filename="r3d18-ufc101-split-1.pth")
model = torch.load(model_path)
# Prepare video (16 frames, C×T×H×W)
transform = Compose([
Resize((128, 171)),
CenterCrop(112),
ToTensor(),
Normalize(mean=[0.43216, 0.394666, 0.37645],
std=[0.22803, 0.22145, 0.216989])
])
# Inference
with torch.no_grad():
output = model(video_tensor)
prediction = output.argmax(dim=1)
✅ Best for:
⚠️ Consider alternatives for:
| Model | Accuracy | Speed | Use Case |
|---|---|---|---|
| R3D-18 | 83.80% | Fast | Baseline, prototyping |
| MC3-18 | 87.05% | Moderate | Best performance |
@misc{r3d_18_ucf101,
author = {Saumya Saksena},
title = {R3D-18 for UCF-101 Action Recognition},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/dronefreak/r3d-18-ucf101}}
}
Apache-2.0