sumit7488's picture
Create README.md
603747e verified
metadata
license: mit
tags:
  - video-classification
  - timesformer
  - action-recognition
  - ucf101
  - pytorch
datasets:
  - ucf101

🎬 TimeSformer Fine-Tuned for Video Action Recognition

This model is a fine-tuned version of TimeSformer (Time-Space Transformer) for video action recognition, trained on the UCF101 dataset.


πŸ“Œ Model Overview

  • Base Model: facebook/timesformer-base-finetuned-k400
  • Task: Video Classification / Action Recognition
  • Dataset: UCF101 (101 action classes)
  • Framework: PyTorch + Hugging Face Transformers
  • Training Environment: Kaggle (GPU)

🧠 Training Strategy

Due to Kaggle’s 12-hour session limit, training was performed in multiple stages:

  1. Initial training run
  2. Checkpoint saving (best model)
  3. Resume training from best checkpoint
  4. Further fine-tuning across sessions

This approach ensures efficient long training without losing progress.


πŸ“Š Training Results

πŸ”Ή Initial Training

Epoch Train Loss Train Acc Val Loss Val Acc
1 4.5066 0.0622 4.1089 0.4245
2 3.5721 0.4711 2.5276 0.8007
3 2.3239 0.7323 1.4321 0.8993

πŸ”Ή Continued Training (Checkpoint Resume)

Epoch Train Loss Train Acc Val Loss Val Acc
4 1.8289 0.7991 1.1802 0.9199
5 1.7119 0.8094 1.1372 0.9128
6 1.6365 0.8153 1.1085 0.9191
7 1.5982 0.8139 1.0868 0.9218
8 1.5053 0.8194 1.0763 0.9262
9 1.4673 0.8201 1.0824 0.9225

πŸ† Best Performance

  • Best Validation Accuracy: 92.62%
  • F1 Score: 0.9244
  • Precision: 0.9315
  • Recall: 0.9262
  • Achieved at Epoch 8

πŸ“ˆ Additional Metrics

Metric Value
Precision 0.9315
Recall 0.9262
F1 Score 0.9244

βš™οΈ Training Details

  • Mixed Precision Training (torch.cuda.amp)
  • GPU Memory Usage: ~9.3–9.8 GB
  • Training Time per Epoch: ~2.5 hours
  • Evaluation Time per Epoch: ~20 minutes
  • Best model checkpoint saved automatically

πŸš€ Usage

Install Dependencies

pip install torch torchvision transformers