TrackMAE: Video Representation Learning via Track Mask and Predict
Paper • 2603.27268 • Published
Fine-tuned video classification checkpoints from TrackMAE: Video Representation Learning via Track Mask and Predict. The weights are PyTorch state dictionaries compatible with the official TrackMAE codebase.
| File | Backbone | Pretraining | Fine-tuning |
|---|---|---|---|
vit_b/k400_finetuned_vit_b_with_k400_pretraining_trackmae.pth |
ViT-B | Kinetics-400 | Kinetics-400 |
vit_b/ssv2_finetuned_vit_b_with_k400_pretraining_trackmae.pth |
ViT-B | Kinetics-400 | Something-Something V2 |
vit_l/ssv2_finetuned_vit_l_with_k700_pretraining_trackmae.pth |
ViT-L | Kinetics-700 | Something-Something V2 |
Download a checkpoint:
from huggingface_hub import hf_hub_download
checkpoint = hf_hub_download(
repo_id="YOUR_USERNAME/TrackMAE",
filename="vit_b/ssv2_finetuned_vit_b_with_k400_pretraining_trackmae.pth",
)
Evaluate it with the TrackMAE repository:
torchrun --standalone --nproc_per_node=4 run_class_finetuning.py \
--model vit_base_patch16_224 \
--data_set SSV2 \
--nb_classes 174 \
--data_path /path/to/ssv2/annotations \
--finetune /path/to/checkpoint.pth \
--output_dir outputs/eval_ssv2 \
--batch_size 8 \
--num_sample 1 \
--num_frames 16 \
--test_num_segment 2 \
--test_num_crop 3 \
--dist_eval \
--eval \
--no_auto_resume
The converted SSV2 ViT-B checkpoint reproduces the original evaluation result: 72.77% Top-1 and 93.76% Top-5 using 2 temporal views and 3 spatial crops.
@inproceedings{vandeghen2026trackmae,
title = {TrackMAE: Video Representation Learning via Track Mask and Predict},
author = {Vandeghen, Renaud and Thoker, Fida Mohammad and Van Droogenbroeck, Marc and Ghanem, Bernard},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2026}
}