|
|
--- |
|
|
language: eng |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- image-classification |
|
|
- vision |
|
|
- vit |
|
|
- house-condition |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- accuracy |
|
|
library_name: transformers |
|
|
pipeline_tag: image-classification |
|
|
--- |
|
|
|
|
|
# Fine-tuned ViT for House Condition Classification |
|
|
|
|
|
This model is a fine-tuned version of [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) for classifying house conditions into 4 categories. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This Vision Transformer (ViT) model has been fine-tuned to classify house images into four condition categories: |
|
|
- **good** (dobre) |
|
|
- **unknown** (nepoznato) |
|
|
- **ruined** (oronule) |
|
|
- **medium** (srednje) |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
- **Total dataset**: 935 images |
|
|
- **Training set**: 776 images |
|
|
- **Validation set**: 80 images |
|
|
- **Test set**: 79 images |
|
|
- **Classes**: 4 (dobre, nepoznato, oronule, srednje) |
|
|
|
|
|
### Training Hyperparameters |
|
|
- **Epochs**: 10.0 |
|
|
- **Batch size**: 16 per device |
|
|
- **Learning rate**: 2e-5 |
|
|
- **Optimizer**: AdamW |
|
|
- **Seed**: 42 (for reproducibility) |
|
|
- **Training time**: 5m 45s |
|
|
- **Samples per second**: 22.43 |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
### Validation Set Performance |
|
|
- **Accuracy**: 81.2% |
|
|
- **Loss**: 0.5629 |
|
|
|
|
|
### Training Set Performance |
|
|
- **Final Training Loss**: 0.5295 |
|
|
|
|
|
### Per-Class Metrics (Validation) |
|
|
|
|
|
| Class | Precision | Recall | F1-Score | Support | |
|
|
|------------|-----------|--------|----------|---------| |
|
|
| good | 0.78 | 0.70 | 0.74 | 10 | |
|
|
| unknown | 1.00 | 0.83 | 0.91 | 24 | |
|
|
| ruined | 0.62 | 1.00 | 0.77 | 15 | |
|
|
| medium | 0.85 | 0.74 | 0.79 | 31 | |
|
|
|
|
|
**Overall Metrics:** |
|
|
- Accuracy: 81.0% (65/80 correct) |
|
|
- Macro Average: Precision=0.81, Recall=0.82, F1=0.80 |
|
|
- Weighted Average: Precision=0.84, Recall=0.81, F1=0.82 |
|
|
|
|
|
### Confusion Matrix (Validation) |
|
|
|
|
|
``` |
|
|
Predicted → |
|
|
good unknown ruined medium |
|
|
good [ 7 0 0 3 ] |
|
|
unknown [ 1 20 2 1 ] |
|
|
ruined [ 0 0 15 0 ] |
|
|
medium [ 1 0 7 23 ] |
|
|
``` |
|
|
|
|
|
**Key Insights:** |
|
|
- 'unknown' class has perfect precision (1.00) - no false positives |
|
|
- 'ruined' class has perfect recall (1.00) - catches all ruined houses |
|
|
- Main confusion: 'medium' condition sometimes mistaken for 'ruined' (7 cases) |
|
|
- 'good' houses occasionally misclassified as 'medium' (3 cases) |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import ViTForImageClassification, ViTImageProcessor |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
# Load model and processor |
|
|
model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME") |
|
|
processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME") |
|
|
|
|
|
# Load and preprocess image |
|
|
image = Image.open("path_to_image.jpg").convert("RGB") |
|
|
inputs = processor(image, return_tensors="pt") |
|
|
|
|
|
# Make prediction |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
predicted_class_idx = outputs.logits.argmax(-1).item() |
|
|
predicted_label = model.config.id2label[str(predicted_class_idx)] |
|
|
|
|
|
print(f"Predicted class: {predicted_label}") |
|
|
|
|
|
# Get probabilities |
|
|
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0] |
|
|
for idx, prob in enumerate(probs): |
|
|
label = model.config.id2label[str(idx)] |
|
|
print(f"{label}: {prob.item():.2%}") |
|
|
``` |
|
|
|
|
|
## Limitations and Bias |
|
|
|
|
|
- The model was trained on a specific dataset of house images and may not generalize well to different architectural styles or regions |
|
|
- Performance varies by class - see validation metrics for details |
|
|
- The model may have difficulty distinguishing between similar condition categories |
|
|
- Dataset size: 935 images (relatively small for deep learning) |
|
|
- Images are from a specific geographical/architectural context |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
The model was fine-tuned using the Hugging Face Transformers library with the following approach: |
|
|
|
|
|
1. **Pre-trained weights**: Initialized from google/vit-base-patch16-224-in21k |
|
|
2. **Classification head**: Replaced with a new 4-class classifier |
|
|
3. **Fine-tuning**: All model parameters were fine-tuned on the custom dataset |
|
|
4. **Data preprocessing**: Images converted to RGB to ensure consistent 3-channel input |
|
|
5. **Evaluation strategy**: Evaluated every 50 steps with checkpoint saving |
|
|
6. **Best model selection**: Best model automatically loaded based on validation performance |
|
|
|
|
|
## Base Model |
|
|
|
|
|
[google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) |
|
|
|
|
|
Vision Transformer (ViT) model pre-trained on ImageNet-21k at resolution 224x224. |
|
|
|
|
|
## Framework Versions |
|
|
|
|
|
- Transformers: 4.57.1 |
|
|
- PyTorch: 2.x |
|
|
- Datasets: 3.x |
|
|
- Python: 3.13 |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{house-condition-vit, |
|
|
author = {Your Name}, |
|
|
title = {Fine-tuned ViT for House Condition Classification}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/YOUR_USERNAME/YOUR_MODEL_NAME}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
This model card was created by the model author. |
|
|
|
|
|
## Additional Information |
|
|
|
|
|
- Repository: [GitHub Repository URL] |
|
|
- Contact: [Your Email or Contact] |
|
|
|