ViT Human Action Recognition Model

This is a fine-tuned Vision Transformer (ViT) model for image-based multi-class human action recognition. The model predicts one of 15 common human activities such as "running", "eating", "texting", or "using a laptop" from static images.


Model Details

  • Base Model: google/vit-base-patch16-224-in21k
  • Developed by: Harsha Vardhan Mannem
  • Task: Image Classification
  • Architecture: Vision Transformer (ViT)
  • Total Classes: 15
  • Language: N/A (Vision Model)
  • License: MIT

Supported Classes

The model predicts the following 15 activities:

  • calling
  • clapping
  • cycling
  • dancing
  • drinking
  • eating
  • fighting
  • hugging
  • laughing
  • listening_to_music
  • running
  • sitting
  • sleeping
  • texting
  • using_laptop

Intended Use

Direct Use

  • Predicts human activity from a single image
  • Useful for research, prototypes, computer vision pipelines involving human action detection

Limitations & Risks

  • Not designed for multi-label scenarios (only one action per image)
  • Not intended for video recognition tasks
  • Accuracy may degrade with poor lighting, extreme angles, or occlusions
  • Model may reflect biases present in the training data

Quickstart Example

from transformers import pipeline

pipe = pipeline("image-classification", model="harsha90145/vit-human-pose-classification-model")

url = "https://images.pexels.com/photos/1755385/pexels-photo-1755385.jpeg"
output = pipe(url)
print(output)

Example Output:

[{'label': 'running', 'score': 0.92}]


## Evaluation Results
Test set size: 2520 samples

Class	Precision	Recall	F1-Score	Support
calling	0.66	0.69	0.67	159
clapping	0.82	0.81	0.82	191
cycling	0.94	0.92	0.93	167
dancing	0.91	0.83	0.87	155
drinking	0.79	0.82	0.81	170
eating	0.83	0.86	0.85	169
fighting	0.85	0.89	0.87	154
hugging	0.77	0.81	0.79	149
laughing	0.80	0.77	0.78	176
listening_to_music	0.75	0.65	0.70	179
running	0.80	0.88	0.84	159
sitting	0.65	0.67	0.66	166
sleeping	0.80	0.81	0.81	167
texting	0.67	0.66	0.66	179
using_laptop	0.72	0.69	0.71	180

Overall Accuracy: 78%
Macro F1 Score: 78%
Weighted F1 Score: 78%

@misc{harsha2024humanactionvit,
  title={ViT-based Human Action Recognition Model},
  author={Harsha Vardhan Mannem},
  howpublished={\url{https://huggingface.co/harsha90145/vit-human-pose-classification-model}},
  year={2024}
}
Downloads last month
11
Safetensors
Model size
85.8M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Harsha901/vit-human-pose-classification-model

Finetuned
(2460)
this model