maceythm
/

vit-90-animals

Image Classification

vision-transformer

transfer-learning

Generated from Trainer

Eval Results (legacy)

Model card Files Files and versions

Metrics Training metrics Community

vit-90-animals / README.md

maceythm's picture

Update README.md

e4c6682 verified 10 months ago

|

history blame contribute delete

3.66 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: google/vit-base-patch16-224
	tags:
	- image-classification
	- animals
	- vision-transformer
	- vit
	- transfer-learning
	- generated_from_trainer
	datasets:
	- imagefolder
	metrics:
	- accuracy
	model-index:
	- name: vit-90-animals
	results:
	- task:
	name: Image Classification
	type: image-classification
	dataset:
	name: iamsouravbanerjee/animal-image-dataset-90-different-animals
	type: imagefolder
	config: default
	split: train
	args: default
	metrics:
	- name: Accuracy
	type: accuracy
	value: 0.9796296296296296
	---
	___
	# vit-90-animals
	___

	## Model description
	This model is a fine-tuned Vision Transformer version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on the [animal image dataset](https://www.kaggle.com/datasets/iamsouravbanerjee/animal-image-dataset-90-different-animals) from kaggle - trained to classify images into 90 different animal species. It achieves high accuracy on unseen data and was trained using supervised learning. The model can be used for general-purpose image classification in the animal domain and serves as a comparison baseline for zero-shot classification models such as CLIP.

	The model achieves the following results on the evaluation set:
	- Loss: 0.0840
	- Accuracy: 0.9796

	## Intended uses & limitations
	### Intended uses
	- Animal image classification (educational, demo, prototyping)
	- Benchmarking against zero-shot classification models
	- Use in Gradio interfaces or image analysis tools

	### Limitations
	- The model is limited to the 90 animal classes it was trained on
	- It may not generalize well to image domains outside of its training distribution
	- Performance can degrade with poor image quality or occlusions

	## Training and evaluation data
	The model was trained on a dataset containing 5,400 animal images categorized into 90 distinct classes. The dataset was obtained from Kaggle and according to the creator originally sourced from Google Images. The training/validation/test split was 80/10/10, and the label distribution is relatively balanced across classes.

	Evaluation was conducted on the test split and compared to results from a zero-shot model (openai/clip-vit-large-patch14) using the same label set.

	## Training procedure
	- Base model: google/vit-base-patch16-224
	- Fine-tuning method: Supervised training using the Hugging Face Trainer class
	- Data augmentation: Applied during training (e.g., RandomHorizontalFlip, ColorJitter)
	- Training time: ~5 epochs with and without augmentation
	- Optimizer: AdamW (default settings)
	- Evaluation metrics: Accuracy, precision, and recall
	- Best performance (no augmentation): 98.3% test accuracy

	### Training hyperparameters
	The following hyperparameters were used during training:
	- learning_rate: 0.0003
	- train_batch_size: 16
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- num_epochs: 5

	### Training results
	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:--------:\|
	\| 1.2021 \| 1.0 \| 270 \| 0.3500 \| 0.9611 \|
	\| 0.2978 \| 2.0 \| 540 \| 0.1766 \| 0.9685 \|
	\| 0.1886 \| 3.0 \| 810 \| 0.1500 \| 0.9685 \|
	\| 0.1706 \| 4.0 \| 1080 \| 0.1409 \| 0.9685 \|
	\| 0.1678 \| 5.0 \| 1350 \| 0.1373 \| 0.9667 \|

	### Framework versions
	- Transformers 4.50.0
	- Pytorch 2.6.0+cu124
	- Datasets 3.4.1
	- Tokenizers 0.21.1