ViT Human Action Recognition Model

This is a fine-tuned Vision Transformer (ViT) model for image-based multi-class human action recognition. The model predicts one of 15 common human activities such as "running", "eating", "texting", or "using a laptop" from static images.

Model Details

Base Model: google/vit-base-patch16-224-in21k
Developed by: Harsha Vardhan Mannem
Task: Image Classification
Architecture: Vision Transformer (ViT)
Total Classes: 15
Language: N/A (Vision Model)
License: MIT

Supported Classes

The model predicts the following 15 activities:

calling
clapping
cycling
dancing
drinking
eating
fighting
hugging
laughing
listening_to_music
running
sitting
sleeping
texting
using_laptop

Intended Use

Direct Use

Predicts human activity from a single image
Useful for research, prototypes, computer vision pipelines involving human action detection

Limitations & Risks

Not designed for multi-label scenarios (only one action per image)
Not intended for video recognition tasks
Accuracy may degrade with poor lighting, extreme angles, or occlusions
Model may reflect biases present in the training data

Quickstart Example

from transformers import pipeline

pipe = pipeline("image-classification", model="harsha90145/vit-human-pose-classification-model")

url = "https://images.pexels.com/photos/1755385/pexels-photo-1755385.jpeg"
output = pipe(url)
print(output)

Example Output:

[{'label': 'running', 'score': 0.92}]


## Evaluation Results
Test set size: 2520 samples

Class	Precision	Recall	F1-Score	Support
calling	0.66	0.69	0.67	159
clapping	0.82	0.81	0.82	191
cycling	0.94	0.92	0.93	167
dancing	0.91	0.83	0.87	155
drinking	0.79	0.82	0.81	170
eating	0.83	0.86	0.85	169
fighting	0.85	0.89	0.87	154
hugging	0.77	0.81	0.79	149
laughing	0.80	0.77	0.78	176
listening_to_music	0.75	0.65	0.70	179
running	0.80	0.88	0.84	159
sitting	0.65	0.67	0.66	166
sleeping	0.80	0.81	0.81	167
texting	0.67	0.66	0.66	179
using_laptop	0.72	0.69	0.71	180

Overall Accuracy: 78%
Macro F1 Score: 78%
Weighted F1 Score: 78%

@misc{harsha2024humanactionvit,
  title={ViT-based Human Action Recognition Model},
  author={Harsha Vardhan Mannem},
  howpublished={\url{https://huggingface.co/harsha90145/vit-human-pose-classification-model}},
  year={2024}
}

Downloads last month: 22

Safetensors

Model size

85.8M params

Tensor type

F32

Model tree for Harsha901/vit-human-pose-classification-model

Base model

google/vit-base-patch16-224-in21k

Finetuned

(2530)

this model

Collection including Harsha901/vit-human-pose-classification-model

Computer Vision Models

Collection

Computer Vision Models by Harsha901 • 2 items • Updated Jan 3