--- library_name: transformers tags: - image-classification - vision-transformer - computer-vision - human-action-recognition license: mit language: - en metrics: - accuracy base_model: - google/vit-base-patch16-224-in21k pipeline_tag: image-classification --- # ViT Human Action Recognition Model This is a fine-tuned Vision Transformer (ViT) model for image-based multi-class human action recognition. The model predicts one of 15 common human activities such as "running", "eating", "texting", or "using a laptop" from static images. --- ## Model Details - **Base Model:** [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) - **Developed by:** Harsha Vardhan Mannem - **Task:** Image Classification - **Architecture:** Vision Transformer (ViT) - **Total Classes:** 15 - **Language:** N/A (Vision Model) - **License:** MIT --- ## Supported Classes The model predicts the following 15 activities: - `calling` - `clapping` - `cycling` - `dancing` - `drinking` - `eating` - `fighting` - `hugging` - `laughing` - `listening_to_music` - `running` - `sitting` - `sleeping` - `texting` - `using_laptop` --- ## Intended Use ### Direct Use - Predicts human activity from a single image - Useful for research, prototypes, computer vision pipelines involving human action detection ### Limitations & Risks - Not designed for multi-label scenarios (only one action per image) - Not intended for video recognition tasks - Accuracy may degrade with poor lighting, extreme angles, or occlusions - Model may reflect biases present in the training data --- ## Quickstart Example ```python from transformers import pipeline pipe = pipeline("image-classification", model="harsha90145/vit-human-pose-classification-model") url = "https://images.pexels.com/photos/1755385/pexels-photo-1755385.jpeg" output = pipe(url) print(output) Example Output: [{'label': 'running', 'score': 0.92}] ## Evaluation Results Test set size: 2520 samples Class Precision Recall F1-Score Support calling 0.66 0.69 0.67 159 clapping 0.82 0.81 0.82 191 cycling 0.94 0.92 0.93 167 dancing 0.91 0.83 0.87 155 drinking 0.79 0.82 0.81 170 eating 0.83 0.86 0.85 169 fighting 0.85 0.89 0.87 154 hugging 0.77 0.81 0.79 149 laughing 0.80 0.77 0.78 176 listening_to_music 0.75 0.65 0.70 179 running 0.80 0.88 0.84 159 sitting 0.65 0.67 0.66 166 sleeping 0.80 0.81 0.81 167 texting 0.67 0.66 0.66 179 using_laptop 0.72 0.69 0.71 180 Overall Accuracy: 78% Macro F1 Score: 78% Weighted F1 Score: 78% @misc{harsha2024humanactionvit, title={ViT-based Human Action Recognition Model}, author={Harsha Vardhan Mannem}, howpublished={\url{https://huggingface.co/harsha90145/vit-human-pose-classification-model}}, year={2024} }