DejanX13's picture
Upload README.md with huggingface_hub
a6c60ad verified
---
language: eng
license: apache-2.0
tags:
- image-classification
- vision
- vit
- house-condition
datasets:
- custom
metrics:
- accuracy
library_name: transformers
pipeline_tag: image-classification
---
# Fine-tuned ViT for House Condition Classification
This model is a fine-tuned version of [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) for classifying house conditions into 4 categories.
## Model Description
This Vision Transformer (ViT) model has been fine-tuned to classify house images into four condition categories:
- **good** (dobre)
- **unknown** (nepoznato)
- **ruined** (oronule)
- **medium** (srednje)
## Training Details
### Training Data
- **Total dataset**: 935 images
- **Training set**: 776 images
- **Validation set**: 80 images
- **Test set**: 79 images
- **Classes**: 4 (dobre, nepoznato, oronule, srednje)
### Training Hyperparameters
- **Epochs**: 10.0
- **Batch size**: 16 per device
- **Learning rate**: 2e-5
- **Optimizer**: AdamW
- **Seed**: 42 (for reproducibility)
- **Training time**: 5m 45s
- **Samples per second**: 22.43
## Evaluation Results
### Validation Set Performance
- **Accuracy**: 81.2%
- **Loss**: 0.5629
### Training Set Performance
- **Final Training Loss**: 0.5295
### Per-Class Metrics (Validation)
| Class | Precision | Recall | F1-Score | Support |
|------------|-----------|--------|----------|---------|
| good | 0.78 | 0.70 | 0.74 | 10 |
| unknown | 1.00 | 0.83 | 0.91 | 24 |
| ruined | 0.62 | 1.00 | 0.77 | 15 |
| medium | 0.85 | 0.74 | 0.79 | 31 |
**Overall Metrics:**
- Accuracy: 81.0% (65/80 correct)
- Macro Average: Precision=0.81, Recall=0.82, F1=0.80
- Weighted Average: Precision=0.84, Recall=0.81, F1=0.82
### Confusion Matrix (Validation)
```
Predicted →
good unknown ruined medium
good [ 7 0 0 3 ]
unknown [ 1 20 2 1 ]
ruined [ 0 0 15 0 ]
medium [ 1 0 7 23 ]
```
**Key Insights:**
- 'unknown' class has perfect precision (1.00) - no false positives
- 'ruined' class has perfect recall (1.00) - catches all ruined houses
- Main confusion: 'medium' condition sometimes mistaken for 'ruined' (7 cases)
- 'good' houses occasionally misclassified as 'medium' (3 cases)
## Usage
```python
from transformers import ViTForImageClassification, ViTImageProcessor
from PIL import Image
import torch
# Load model and processor
model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")
processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")
# Load and preprocess image
image = Image.open("path_to_image.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt")
# Make prediction
with torch.no_grad():
outputs = model(**inputs)
predicted_class_idx = outputs.logits.argmax(-1).item()
predicted_label = model.config.id2label[str(predicted_class_idx)]
print(f"Predicted class: {predicted_label}")
# Get probabilities
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
for idx, prob in enumerate(probs):
label = model.config.id2label[str(idx)]
print(f"{label}: {prob.item():.2%}")
```
## Limitations and Bias
- The model was trained on a specific dataset of house images and may not generalize well to different architectural styles or regions
- Performance varies by class - see validation metrics for details
- The model may have difficulty distinguishing between similar condition categories
- Dataset size: 935 images (relatively small for deep learning)
- Images are from a specific geographical/architectural context
## Training Procedure
The model was fine-tuned using the Hugging Face Transformers library with the following approach:
1. **Pre-trained weights**: Initialized from google/vit-base-patch16-224-in21k
2. **Classification head**: Replaced with a new 4-class classifier
3. **Fine-tuning**: All model parameters were fine-tuned on the custom dataset
4. **Data preprocessing**: Images converted to RGB to ensure consistent 3-channel input
5. **Evaluation strategy**: Evaluated every 50 steps with checkpoint saving
6. **Best model selection**: Best model automatically loaded based on validation performance
## Base Model
[google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k)
Vision Transformer (ViT) model pre-trained on ImageNet-21k at resolution 224x224.
## Framework Versions
- Transformers: 4.57.1
- PyTorch: 2.x
- Datasets: 3.x
- Python: 3.13
## Citation
If you use this model, please cite:
```bibtex
@misc{house-condition-vit,
author = {Your Name},
title = {Fine-tuned ViT for House Condition Classification},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/YOUR_USERNAME/YOUR_MODEL_NAME}}
}
```
## Model Card Authors
This model card was created by the model author.
## Additional Information
- Repository: [GitHub Repository URL]
- Contact: [Your Email or Contact]