File size: 5,094 Bytes
2975692 12aa02c 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad 2975692 a6c60ad |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
---
language: eng
license: apache-2.0
tags:
- image-classification
- vision
- vit
- house-condition
datasets:
- custom
metrics:
- accuracy
library_name: transformers
pipeline_tag: image-classification
---
# Fine-tuned ViT for House Condition Classification
This model is a fine-tuned version of [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) for classifying house conditions into 4 categories.
## Model Description
This Vision Transformer (ViT) model has been fine-tuned to classify house images into four condition categories:
- **good** (dobre)
- **unknown** (nepoznato)
- **ruined** (oronule)
- **medium** (srednje)
## Training Details
### Training Data
- **Total dataset**: 935 images
- **Training set**: 776 images
- **Validation set**: 80 images
- **Test set**: 79 images
- **Classes**: 4 (dobre, nepoznato, oronule, srednje)
### Training Hyperparameters
- **Epochs**: 10.0
- **Batch size**: 16 per device
- **Learning rate**: 2e-5
- **Optimizer**: AdamW
- **Seed**: 42 (for reproducibility)
- **Training time**: 5m 45s
- **Samples per second**: 22.43
## Evaluation Results
### Validation Set Performance
- **Accuracy**: 81.2%
- **Loss**: 0.5629
### Training Set Performance
- **Final Training Loss**: 0.5295
### Per-Class Metrics (Validation)
| Class | Precision | Recall | F1-Score | Support |
|------------|-----------|--------|----------|---------|
| good | 0.78 | 0.70 | 0.74 | 10 |
| unknown | 1.00 | 0.83 | 0.91 | 24 |
| ruined | 0.62 | 1.00 | 0.77 | 15 |
| medium | 0.85 | 0.74 | 0.79 | 31 |
**Overall Metrics:**
- Accuracy: 81.0% (65/80 correct)
- Macro Average: Precision=0.81, Recall=0.82, F1=0.80
- Weighted Average: Precision=0.84, Recall=0.81, F1=0.82
### Confusion Matrix (Validation)
```
Predicted →
good unknown ruined medium
good [ 7 0 0 3 ]
unknown [ 1 20 2 1 ]
ruined [ 0 0 15 0 ]
medium [ 1 0 7 23 ]
```
**Key Insights:**
- 'unknown' class has perfect precision (1.00) - no false positives
- 'ruined' class has perfect recall (1.00) - catches all ruined houses
- Main confusion: 'medium' condition sometimes mistaken for 'ruined' (7 cases)
- 'good' houses occasionally misclassified as 'medium' (3 cases)
## Usage
```python
from transformers import ViTForImageClassification, ViTImageProcessor
from PIL import Image
import torch
# Load model and processor
model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")
processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")
# Load and preprocess image
image = Image.open("path_to_image.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt")
# Make prediction
with torch.no_grad():
outputs = model(**inputs)
predicted_class_idx = outputs.logits.argmax(-1).item()
predicted_label = model.config.id2label[str(predicted_class_idx)]
print(f"Predicted class: {predicted_label}")
# Get probabilities
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
for idx, prob in enumerate(probs):
label = model.config.id2label[str(idx)]
print(f"{label}: {prob.item():.2%}")
```
## Limitations and Bias
- The model was trained on a specific dataset of house images and may not generalize well to different architectural styles or regions
- Performance varies by class - see validation metrics for details
- The model may have difficulty distinguishing between similar condition categories
- Dataset size: 935 images (relatively small for deep learning)
- Images are from a specific geographical/architectural context
## Training Procedure
The model was fine-tuned using the Hugging Face Transformers library with the following approach:
1. **Pre-trained weights**: Initialized from google/vit-base-patch16-224-in21k
2. **Classification head**: Replaced with a new 4-class classifier
3. **Fine-tuning**: All model parameters were fine-tuned on the custom dataset
4. **Data preprocessing**: Images converted to RGB to ensure consistent 3-channel input
5. **Evaluation strategy**: Evaluated every 50 steps with checkpoint saving
6. **Best model selection**: Best model automatically loaded based on validation performance
## Base Model
[google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k)
Vision Transformer (ViT) model pre-trained on ImageNet-21k at resolution 224x224.
## Framework Versions
- Transformers: 4.57.1
- PyTorch: 2.x
- Datasets: 3.x
- Python: 3.13
## Citation
If you use this model, please cite:
```bibtex
@misc{house-condition-vit,
author = {Your Name},
title = {Fine-tuned ViT for House Condition Classification},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/YOUR_USERNAME/YOUR_MODEL_NAME}}
}
```
## Model Card Authors
This model card was created by the model author.
## Additional Information
- Repository: [GitHub Repository URL]
- Contact: [Your Email or Contact]
|