Harsha901
/

vit-human-pose-classification-model

Image Classification

vision-transformer

computer-vision

human-action-recognition

Model card Files Files and versions

vit-human-pose-classification-model / README.md

Harsha901's picture

Update README.md

42c98c2 verified 7 months ago

|

history blame contribute delete

2.79 kB

	---
	library_name: transformers
	tags:
	- image-classification
	- vision-transformer
	- computer-vision
	- human-action-recognition
	license: mit
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- google/vit-base-patch16-224-in21k
	pipeline_tag: image-classification
	---

	# ViT Human Action Recognition Model

	This is a fine-tuned Vision Transformer (ViT) model for image-based multi-class human action recognition. The model predicts one of 15 common human activities such as "running", "eating", "texting", or "using a laptop" from static images.

	---

	## Model Details

	- Base Model: [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k)
	- Developed by: Harsha Vardhan Mannem
	- Task: Image Classification
	- Architecture: Vision Transformer (ViT)
	- Total Classes: 15
	- Language: N/A (Vision Model)
	- License: MIT

	---

	## Supported Classes

	The model predicts the following 15 activities:

	- `calling`
	- `clapping`
	- `cycling`
	- `dancing`
	- `drinking`
	- `eating`
	- `fighting`
	- `hugging`
	- `laughing`
	- `listening_to_music`
	- `running`
	- `sitting`
	- `sleeping`
	- `texting`
	- `using_laptop`

	---

	## Intended Use

	### Direct Use
	- Predicts human activity from a single image
	- Useful for research, prototypes, computer vision pipelines involving human action detection

	### Limitations & Risks
	- Not designed for multi-label scenarios (only one action per image)
	- Not intended for video recognition tasks
	- Accuracy may degrade with poor lighting, extreme angles, or occlusions
	- Model may reflect biases present in the training data

	---

	## Quickstart Example

	```python
	from transformers import pipeline

	pipe = pipeline("image-classification", model="harsha90145/vit-human-pose-classification-model")

	url = "https://images.pexels.com/photos/1755385/pexels-photo-1755385.jpeg"
	output = pipe(url)
	print(output)

	Example Output:

	[{'label': 'running', 'score': 0.92}]


	## Evaluation Results
	Test set size: 2520 samples

	Class Precision Recall F1-Score Support
	calling 0.66 0.69 0.67 159
	clapping 0.82 0.81 0.82 191
	cycling 0.94 0.92 0.93 167
	dancing 0.91 0.83 0.87 155
	drinking 0.79 0.82 0.81 170
	eating 0.83 0.86 0.85 169
	fighting 0.85 0.89 0.87 154
	hugging 0.77 0.81 0.79 149
	laughing 0.80 0.77 0.78 176
	listening_to_music 0.75 0.65 0.70 179
	running 0.80 0.88 0.84 159
	sitting 0.65 0.67 0.66 166
	sleeping 0.80 0.81 0.81 167
	texting 0.67 0.66 0.66 179
	using_laptop 0.72 0.69 0.71 180

	Overall Accuracy: 78%
	Macro F1 Score: 78%
	Weighted F1 Score: 78%

	@misc{harsha2024humanactionvit,
	title={ViT-based Human Action Recognition Model},
	author={Harsha Vardhan Mannem},
	howpublished={\url{https://huggingface.co/harsha90145/vit-human-pose-classification-model}},
	year={2024}
	}