--- language: eng license: apache-2.0 tags: - image-classification - vision - vit - house-condition datasets: - custom metrics: - accuracy library_name: transformers pipeline_tag: image-classification --- # Fine-tuned ViT for House Condition Classification This model is a fine-tuned version of [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) for classifying house conditions into 4 categories. ## Model Description This Vision Transformer (ViT) model has been fine-tuned to classify house images into four condition categories: - **good** (dobre) - **unknown** (nepoznato) - **ruined** (oronule) - **medium** (srednje) ## Training Details ### Training Data - **Total dataset**: 935 images - **Training set**: 776 images - **Validation set**: 80 images - **Test set**: 79 images - **Classes**: 4 (dobre, nepoznato, oronule, srednje) ### Training Hyperparameters - **Epochs**: 10.0 - **Batch size**: 16 per device - **Learning rate**: 2e-5 - **Optimizer**: AdamW - **Seed**: 42 (for reproducibility) - **Training time**: 5m 45s - **Samples per second**: 22.43 ## Evaluation Results ### Validation Set Performance - **Accuracy**: 81.2% - **Loss**: 0.5629 ### Training Set Performance - **Final Training Loss**: 0.5295 ### Per-Class Metrics (Validation) | Class | Precision | Recall | F1-Score | Support | |------------|-----------|--------|----------|---------| | good | 0.78 | 0.70 | 0.74 | 10 | | unknown | 1.00 | 0.83 | 0.91 | 24 | | ruined | 0.62 | 1.00 | 0.77 | 15 | | medium | 0.85 | 0.74 | 0.79 | 31 | **Overall Metrics:** - Accuracy: 81.0% (65/80 correct) - Macro Average: Precision=0.81, Recall=0.82, F1=0.80 - Weighted Average: Precision=0.84, Recall=0.81, F1=0.82 ### Confusion Matrix (Validation) ``` Predicted → good unknown ruined medium good [ 7 0 0 3 ] unknown [ 1 20 2 1 ] ruined [ 0 0 15 0 ] medium [ 1 0 7 23 ] ``` **Key Insights:** - 'unknown' class has perfect precision (1.00) - no false positives - 'ruined' class has perfect recall (1.00) - catches all ruined houses - Main confusion: 'medium' condition sometimes mistaken for 'ruined' (7 cases) - 'good' houses occasionally misclassified as 'medium' (3 cases) ## Usage ```python from transformers import ViTForImageClassification, ViTImageProcessor from PIL import Image import torch # Load model and processor model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME") processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME") # Load and preprocess image image = Image.open("path_to_image.jpg").convert("RGB") inputs = processor(image, return_tensors="pt") # Make prediction with torch.no_grad(): outputs = model(**inputs) predicted_class_idx = outputs.logits.argmax(-1).item() predicted_label = model.config.id2label[str(predicted_class_idx)] print(f"Predicted class: {predicted_label}") # Get probabilities probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0] for idx, prob in enumerate(probs): label = model.config.id2label[str(idx)] print(f"{label}: {prob.item():.2%}") ``` ## Limitations and Bias - The model was trained on a specific dataset of house images and may not generalize well to different architectural styles or regions - Performance varies by class - see validation metrics for details - The model may have difficulty distinguishing between similar condition categories - Dataset size: 935 images (relatively small for deep learning) - Images are from a specific geographical/architectural context ## Training Procedure The model was fine-tuned using the Hugging Face Transformers library with the following approach: 1. **Pre-trained weights**: Initialized from google/vit-base-patch16-224-in21k 2. **Classification head**: Replaced with a new 4-class classifier 3. **Fine-tuning**: All model parameters were fine-tuned on the custom dataset 4. **Data preprocessing**: Images converted to RGB to ensure consistent 3-channel input 5. **Evaluation strategy**: Evaluated every 50 steps with checkpoint saving 6. **Best model selection**: Best model automatically loaded based on validation performance ## Base Model [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) Vision Transformer (ViT) model pre-trained on ImageNet-21k at resolution 224x224. ## Framework Versions - Transformers: 4.57.1 - PyTorch: 2.x - Datasets: 3.x - Python: 3.13 ## Citation If you use this model, please cite: ```bibtex @misc{house-condition-vit, author = {Your Name}, title = {Fine-tuned ViT for House Condition Classification}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/YOUR_USERNAME/YOUR_MODEL_NAME}} } ``` ## Model Card Authors This model card was created by the model author. ## Additional Information - Repository: [GitHub Repository URL] - Contact: [Your Email or Contact]