Upload README.md with huggingface_hub

a6c60ad verified about 1 month ago

5.09 kB

	---
	language: eng
	license: apache-2.0
	tags:
	- image-classification
	- vision
	- vit
	- house-condition
	datasets:
	- custom
	metrics:
	- accuracy
	library_name: transformers
	pipeline_tag: image-classification
	---

	# Fine-tuned ViT for House Condition Classification

	This model is a fine-tuned version of [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) for classifying house conditions into 4 categories.

	## Model Description

	This Vision Transformer (ViT) model has been fine-tuned to classify house images into four condition categories:
	- good (dobre)
	- unknown (nepoznato)
	- ruined (oronule)
	- medium (srednje)

	## Training Details

	### Training Data
	- Total dataset: 935 images
	- Training set: 776 images
	- Validation set: 80 images
	- Test set: 79 images
	- Classes: 4 (dobre, nepoznato, oronule, srednje)

	### Training Hyperparameters
	- Epochs: 10.0
	- Batch size: 16 per device
	- Learning rate: 2e-5
	- Optimizer: AdamW
	- Seed: 42 (for reproducibility)
	- Training time: 5m 45s
	- Samples per second: 22.43

	## Evaluation Results

	### Validation Set Performance
	- Accuracy: 81.2%
	- Loss: 0.5629

	### Training Set Performance
	- Final Training Loss: 0.5295

	### Per-Class Metrics (Validation)

	\| Class \| Precision \| Recall \| F1-Score \| Support \|
	\|------------\|-----------\|--------\|----------\|---------\|
	\| good \| 0.78 \| 0.70 \| 0.74 \| 10 \|
	\| unknown \| 1.00 \| 0.83 \| 0.91 \| 24 \|
	\| ruined \| 0.62 \| 1.00 \| 0.77 \| 15 \|
	\| medium \| 0.85 \| 0.74 \| 0.79 \| 31 \|

	Overall Metrics:
	- Accuracy: 81.0% (65/80 correct)
	- Macro Average: Precision=0.81, Recall=0.82, F1=0.80
	- Weighted Average: Precision=0.84, Recall=0.81, F1=0.82

	### Confusion Matrix (Validation)

	```
	Predicted →
	good unknown ruined medium
	good [ 7 0 0 3 ]
	unknown [ 1 20 2 1 ]
	ruined [ 0 0 15 0 ]
	medium [ 1 0 7 23 ]
	```

	Key Insights:
	- 'unknown' class has perfect precision (1.00) - no false positives
	- 'ruined' class has perfect recall (1.00) - catches all ruined houses
	- Main confusion: 'medium' condition sometimes mistaken for 'ruined' (7 cases)
	- 'good' houses occasionally misclassified as 'medium' (3 cases)

	## Usage

	```python
	from transformers import ViTForImageClassification, ViTImageProcessor
	from PIL import Image
	import torch

	# Load model and processor
	model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")
	processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")

	# Load and preprocess image
	image = Image.open("path_to_image.jpg").convert("RGB")
	inputs = processor(image, return_tensors="pt")

	# Make prediction
	with torch.no_grad():
	outputs = model(**inputs)

	predicted_class_idx = outputs.logits.argmax(-1).item()
	predicted_label = model.config.id2label[str(predicted_class_idx)]

	print(f"Predicted class: {predicted_label}")

	# Get probabilities
	probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
	for idx, prob in enumerate(probs):
	label = model.config.id2label[str(idx)]
	print(f"{label}: {prob.item():.2%}")
	```

	## Limitations and Bias

	- The model was trained on a specific dataset of house images and may not generalize well to different architectural styles or regions
	- Performance varies by class - see validation metrics for details
	- The model may have difficulty distinguishing between similar condition categories
	- Dataset size: 935 images (relatively small for deep learning)
	- Images are from a specific geographical/architectural context

	## Training Procedure

	The model was fine-tuned using the Hugging Face Transformers library with the following approach:

	1. Pre-trained weights: Initialized from google/vit-base-patch16-224-in21k
	2. Classification head: Replaced with a new 4-class classifier
	3. Fine-tuning: All model parameters were fine-tuned on the custom dataset
	4. Data preprocessing: Images converted to RGB to ensure consistent 3-channel input
	5. Evaluation strategy: Evaluated every 50 steps with checkpoint saving
	6. Best model selection: Best model automatically loaded based on validation performance

	## Base Model

	[google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k)

	Vision Transformer (ViT) model pre-trained on ImageNet-21k at resolution 224x224.

	## Framework Versions

	- Transformers: 4.57.1
	- PyTorch: 2.x
	- Datasets: 3.x
	- Python: 3.13

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{house-condition-vit,
	author = {Your Name},
	title = {Fine-tuned ViT for House Condition Classification},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/YOUR_USERNAME/YOUR_MODEL_NAME}}
	}
	```

	## Model Card Authors

	This model card was created by the model author.

	## Additional Information

	- Repository: [GitHub Repository URL]
	- Contact: [Your Email or Contact]