|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- image-classification |
|
|
- vision-transformer |
|
|
- computer-vision |
|
|
- human-action-recognition |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: |
|
|
- google/vit-base-patch16-224-in21k |
|
|
pipeline_tag: image-classification |
|
|
--- |
|
|
|
|
|
# ViT Human Action Recognition Model |
|
|
|
|
|
This is a fine-tuned Vision Transformer (ViT) model for image-based multi-class human action recognition. The model predicts one of 15 common human activities such as "running", "eating", "texting", or "using a laptop" from static images. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) |
|
|
- **Developed by:** Harsha Vardhan Mannem |
|
|
- **Task:** Image Classification |
|
|
- **Architecture:** Vision Transformer (ViT) |
|
|
- **Total Classes:** 15 |
|
|
- **Language:** N/A (Vision Model) |
|
|
- **License:** MIT |
|
|
|
|
|
--- |
|
|
|
|
|
## Supported Classes |
|
|
|
|
|
The model predicts the following 15 activities: |
|
|
|
|
|
- `calling` |
|
|
- `clapping` |
|
|
- `cycling` |
|
|
- `dancing` |
|
|
- `drinking` |
|
|
- `eating` |
|
|
- `fighting` |
|
|
- `hugging` |
|
|
- `laughing` |
|
|
- `listening_to_music` |
|
|
- `running` |
|
|
- `sitting` |
|
|
- `sleeping` |
|
|
- `texting` |
|
|
- `using_laptop` |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Direct Use |
|
|
- Predicts human activity from a single image |
|
|
- Useful for research, prototypes, computer vision pipelines involving human action detection |
|
|
|
|
|
### Limitations & Risks |
|
|
- Not designed for multi-label scenarios (only one action per image) |
|
|
- Not intended for video recognition tasks |
|
|
- Accuracy may degrade with poor lighting, extreme angles, or occlusions |
|
|
- Model may reflect biases present in the training data |
|
|
|
|
|
--- |
|
|
|
|
|
## Quickstart Example |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline("image-classification", model="harsha90145/vit-human-pose-classification-model") |
|
|
|
|
|
url = "https://images.pexels.com/photos/1755385/pexels-photo-1755385.jpeg" |
|
|
output = pipe(url) |
|
|
print(output) |
|
|
|
|
|
Example Output: |
|
|
|
|
|
[{'label': 'running', 'score': 0.92}] |
|
|
|
|
|
|
|
|
## Evaluation Results |
|
|
Test set size: 2520 samples |
|
|
|
|
|
Class Precision Recall F1-Score Support |
|
|
calling 0.66 0.69 0.67 159 |
|
|
clapping 0.82 0.81 0.82 191 |
|
|
cycling 0.94 0.92 0.93 167 |
|
|
dancing 0.91 0.83 0.87 155 |
|
|
drinking 0.79 0.82 0.81 170 |
|
|
eating 0.83 0.86 0.85 169 |
|
|
fighting 0.85 0.89 0.87 154 |
|
|
hugging 0.77 0.81 0.79 149 |
|
|
laughing 0.80 0.77 0.78 176 |
|
|
listening_to_music 0.75 0.65 0.70 179 |
|
|
running 0.80 0.88 0.84 159 |
|
|
sitting 0.65 0.67 0.66 166 |
|
|
sleeping 0.80 0.81 0.81 167 |
|
|
texting 0.67 0.66 0.66 179 |
|
|
using_laptop 0.72 0.69 0.71 180 |
|
|
|
|
|
Overall Accuracy: 78% |
|
|
Macro F1 Score: 78% |
|
|
Weighted F1 Score: 78% |
|
|
|
|
|
@misc{harsha2024humanactionvit, |
|
|
title={ViT-based Human Action Recognition Model}, |
|
|
author={Harsha Vardhan Mannem}, |
|
|
howpublished={\url{https://huggingface.co/harsha90145/vit-human-pose-classification-model}}, |
|
|
year={2024} |
|
|
} |
|
|
|