---
language: eng
license: apache-2.0
tags:
- image-classification
- vision
- vit
- house-condition
datasets:
- custom
metrics:
- accuracy
library_name: transformers
pipeline_tag: image-classification
---

# Fine-tuned ViT for House Condition Classification

This model is a fine-tuned version of [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) for classifying house conditions into 4 categories.

## Model Description

This Vision Transformer (ViT) model has been fine-tuned to classify house images into four condition categories:
- **good** (dobre)
- **unknown** (nepoznato)
- **ruined** (oronule)
- **medium** (srednje)

## Training Details

### Training Data
- **Total dataset**: 935 images
- **Training set**: 776 images
- **Validation set**: 80 images
- **Test set**: 79 images
- **Classes**: 4 (dobre, nepoznato, oronule, srednje)

### Training Hyperparameters
- **Epochs**: 10.0
- **Batch size**: 16 per device
- **Learning rate**: 2e-5
- **Optimizer**: AdamW
- **Seed**: 42 (for reproducibility)
- **Training time**: 5m 45s
- **Samples per second**: 22.43

## Evaluation Results

### Validation Set Performance
- **Accuracy**: 81.2%
- **Loss**: 0.5629

### Training Set Performance
- **Final Training Loss**: 0.5295

### Per-Class Metrics (Validation)

| Class      | Precision | Recall | F1-Score | Support |
|------------|-----------|--------|----------|---------|
| good       | 0.78      | 0.70   | 0.74     | 10      |
| unknown    | 1.00      | 0.83   | 0.91     | 24      |
| ruined     | 0.62      | 1.00   | 0.77     | 15      |
| medium     | 0.85      | 0.74   | 0.79     | 31      |

**Overall Metrics:**
- Accuracy: 81.0% (65/80 correct)
- Macro Average: Precision=0.81, Recall=0.82, F1=0.80
- Weighted Average: Precision=0.84, Recall=0.81, F1=0.82

### Confusion Matrix (Validation)

```
              Predicted →
           good  unknown  ruined  medium
good       [  7      0      0      3 ]
unknown    [  1     20      2      1 ]
ruined     [  0      0     15      0 ]
medium     [  1      0      7     23 ]
```

**Key Insights:**
- 'unknown' class has perfect precision (1.00) - no false positives
- 'ruined' class has perfect recall (1.00) - catches all ruined houses
- Main confusion: 'medium' condition sometimes mistaken for 'ruined' (7 cases)
- 'good' houses occasionally misclassified as 'medium' (3 cases)

## Usage

```python
from transformers import ViTForImageClassification, ViTImageProcessor
from PIL import Image
import torch

# Load model and processor
model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")
processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")

# Load and preprocess image
image = Image.open("path_to_image.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt")

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)

predicted_class_idx = outputs.logits.argmax(-1).item()
predicted_label = model.config.id2label[str(predicted_class_idx)]

print(f"Predicted class: {predicted_label}")

# Get probabilities
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
for idx, prob in enumerate(probs):
    label = model.config.id2label[str(idx)]
    print(f"{label}: {prob.item():.2%}")
```

## Limitations and Bias

- The model was trained on a specific dataset of house images and may not generalize well to different architectural styles or regions
- Performance varies by class - see validation metrics for details
- The model may have difficulty distinguishing between similar condition categories
- Dataset size: 935 images (relatively small for deep learning)
- Images are from a specific geographical/architectural context

## Training Procedure

The model was fine-tuned using the Hugging Face Transformers library with the following approach:

1. **Pre-trained weights**: Initialized from google/vit-base-patch16-224-in21k
2. **Classification head**: Replaced with a new 4-class classifier
3. **Fine-tuning**: All model parameters were fine-tuned on the custom dataset
4. **Data preprocessing**: Images converted to RGB to ensure consistent 3-channel input
5. **Evaluation strategy**: Evaluated every 50 steps with checkpoint saving
6. **Best model selection**: Best model automatically loaded based on validation performance

## Base Model

[google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k)

Vision Transformer (ViT) model pre-trained on ImageNet-21k at resolution 224x224.

## Framework Versions

- Transformers: 4.57.1
- PyTorch: 2.x
- Datasets: 3.x
- Python: 3.13

## Citation

If you use this model, please cite:

```bibtex
@misc{house-condition-vit,
  author = {Your Name},
  title = {Fine-tuned ViT for House Condition Classification},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/YOUR_MODEL_NAME}}
}
```

## Model Card Authors

This model card was created by the model author.

## Additional Information

- Repository: [GitHub Repository URL]
- Contact: [Your Email or Contact]