---
library_name: transformers
tags:
- image-classification
- vision-transformer
- computer-vision
- human-action-recognition
license: mit
language:
- en
metrics:
- accuracy
base_model:
- google/vit-base-patch16-224-in21k
pipeline_tag: image-classification
---

# ViT Human Action Recognition Model

This is a fine-tuned Vision Transformer (ViT) model for image-based multi-class human action recognition. The model predicts one of 15 common human activities such as "running", "eating", "texting", or "using a laptop" from static images.

---

## Model Details

- **Base Model:** [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k)  
- **Developed by:** Harsha Vardhan Mannem  
- **Task:** Image Classification  
- **Architecture:** Vision Transformer (ViT)  
- **Total Classes:** 15  
- **Language:** N/A (Vision Model)  
- **License:** MIT  

---

## Supported Classes

The model predicts the following 15 activities:

- `calling`  
- `clapping`  
- `cycling`  
- `dancing`  
- `drinking`  
- `eating`  
- `fighting`  
- `hugging`  
- `laughing`  
- `listening_to_music`  
- `running`  
- `sitting`  
- `sleeping`  
- `texting`  
- `using_laptop`  

---

## Intended Use

### Direct Use
- Predicts human activity from a single image
- Useful for research, prototypes, computer vision pipelines involving human action detection

### Limitations & Risks
- Not designed for multi-label scenarios (only one action per image)
- Not intended for video recognition tasks
- Accuracy may degrade with poor lighting, extreme angles, or occlusions
- Model may reflect biases present in the training data

---

## Quickstart Example

```python
from transformers import pipeline

pipe = pipeline("image-classification", model="harsha90145/vit-human-pose-classification-model")

url = "https://images.pexels.com/photos/1755385/pexels-photo-1755385.jpeg"
output = pipe(url)
print(output)

Example Output:

[{'label': 'running', 'score': 0.92}]


## Evaluation Results
Test set size: 2520 samples

Class	Precision	Recall	F1-Score	Support
calling	0.66	0.69	0.67	159
clapping	0.82	0.81	0.82	191
cycling	0.94	0.92	0.93	167
dancing	0.91	0.83	0.87	155
drinking	0.79	0.82	0.81	170
eating	0.83	0.86	0.85	169
fighting	0.85	0.89	0.87	154
hugging	0.77	0.81	0.79	149
laughing	0.80	0.77	0.78	176
listening_to_music	0.75	0.65	0.70	179
running	0.80	0.88	0.84	159
sitting	0.65	0.67	0.66	166
sleeping	0.80	0.81	0.81	167
texting	0.67	0.66	0.66	179
using_laptop	0.72	0.69	0.71	180

Overall Accuracy: 78%
Macro F1 Score: 78%
Weighted F1 Score: 78%

@misc{harsha2024humanactionvit,
  title={ViT-based Human Action Recognition Model},
  author={Harsha Vardhan Mannem},
  howpublished={\url{https://huggingface.co/harsha90145/vit-human-pose-classification-model}},
  year={2024}
}