TrackMAE Fine-Tuned Checkpoints

Fine-tuned video classification checkpoints from TrackMAE: Video Representation Learning via Track Mask and Predict. The weights are PyTorch state dictionaries compatible with the official TrackMAE codebase.

Checkpoints

File Backbone Pretraining Fine-tuning
vit_b/k400_finetuned_vit_b_with_k400_pretraining_trackmae.pth ViT-B Kinetics-400 Kinetics-400
vit_b/ssv2_finetuned_vit_b_with_k400_pretraining_trackmae.pth ViT-B Kinetics-400 Something-Something V2
vit_l/ssv2_finetuned_vit_l_with_k700_pretraining_trackmae.pth ViT-L Kinetics-700 Something-Something V2

Usage

Download a checkpoint:

from huggingface_hub import hf_hub_download

checkpoint = hf_hub_download(
    repo_id="YOUR_USERNAME/TrackMAE",
    filename="vit_b/ssv2_finetuned_vit_b_with_k400_pretraining_trackmae.pth",
)

Evaluate it with the TrackMAE repository:

torchrun --standalone --nproc_per_node=4 run_class_finetuning.py \
  --model vit_base_patch16_224 \
  --data_set SSV2 \
  --nb_classes 174 \
  --data_path /path/to/ssv2/annotations \
  --finetune /path/to/checkpoint.pth \
  --output_dir outputs/eval_ssv2 \
  --batch_size 8 \
  --num_sample 1 \
  --num_frames 16 \
  --test_num_segment 2 \
  --test_num_crop 3 \
  --dist_eval \
  --eval \
  --no_auto_resume

The converted SSV2 ViT-B checkpoint reproduces the original evaluation result: 72.77% Top-1 and 93.76% Top-5 using 2 temporal views and 3 spatial crops.

Citation

@inproceedings{vandeghen2026trackmae,
  title     = {TrackMAE: Video Representation Learning via Track Mask and Predict},
  author    = {Vandeghen, Renaud and Thoker, Fida Mohammad and Van Droogenbroeck, Marc and Ghanem, Bernard},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for rvandeghen/TrackMAE