0xgr3y's picture
Update README.md
f9344ba verified
|
Raw
History Blame Contribute Delete
31.4 kB
---
license: apache-2.0
pipeline_tag: image-classification
tags:
- efficientnetv2
- fgic
- safetensors
- transfer-learning
- gem-pooling
- focal-loss
- swa
- grad-cam
- calibration
- temperature-scaling
- computer-vision
- tensorflow.js
library_name: keras
language: en
datasets:
- 0xgr3y/arch-building-dataset
model-index:
- name: Architectural Building Image Classifier
results:
- task:
type: image-classification
name: Fine-Grained Image Classification
dataset:
type: imagefolder
name: arch-building-dataset
split: test
metrics:
- type: accuracy
value: 0.9777
name: Test Accuracy
- type: accuracy
value: 0.9836
name: Validation Accuracy (SWA)
- type: accuracy
value: 0.9799
name: TTA Accuracy
- type: f1
value: 0.9777
name: Macro F1
- type: precision
value: 0.9777
name: Macro Precision
- type: recall
value: 0.9777
name: Macro Recall
- type: roc_auc
value: 0.9985
name: Macro ROC-AUC (OvR)
---
![Arch-Building-Image-Classification](results/greyscope-labs-architecture-classification-efficientnetv2.jpg)
# Fine-Grained Image Classification of World Architecture: An EfficientNetV2-S Transfer Learning Approach with Layered Regularization
### Architectural Building Image Classifier
Fine-Grained Image Classification (FGIC) of world architectural buildings using CNN transfer learning with EfficientNetV2-S, enhanced with GeM Pooling, Focal Loss, Discriminative AdamW (LR), Stochastic Weight Averaging (SWA), Grad-CAM explainability, and calibration analysis.
<table>
<tr><td><strong>Architecture</strong></td><td>EfficientNetV2-S + GeM Pooling + Focal Loss + SWA</td></tr>
<tr><td><strong>Task</strong></td><td>Fine-Grained Image Classification (FGIC)</td></tr>
<tr><td><strong>Test Accuracy</strong></td><td>97.77%</td></tr>
<tr><td><strong>Classes</strong></td><td>8 (barn, bridge, castle, mosque, skyscraper, stadium, temple, windmill)</td></tr>
<tr><td><strong>Input Size</strong></td><td>320 Γ— 320 pixels</td></tr>
<tr><td><strong>Parameters</strong></td><td>23,350,633</td></tr>
<tr><td><strong>Framework</strong></td><td>TensorFlow / Keras 3</td></tr>
<tr><td><strong>License</strong></td><td><a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0</a></td></tr>
</table>
## Model Description
A fine-grained image classification model for world architectural buildings. Built on EfficientNetV2-S pretrained on ImageNet, enhanced with GeM Pooling (learnable generalized mean pooling), Focal Loss, Discriminative AdamW and Stochastic Weight Averaging (SWA). Extended with Grad-CAM explainability visualization, ROC-AUC evaluation, ECE calibration analysis, and t-SNE embedding visualization.
**Key architectural contributions:**
- **GeM Pooling** (Radenovic et al., CVPR 2018) β€” replaces global average pooling with a learnable power parameter (p=3.0) that emphasizes high-activation features, yielding stronger discriminative representations for FGIC tasks
- **Focal Loss** (Lin et al., ICCV 2017, gamma=2.0) β€” down-weights well-classified examples to focus gradient updates on hard-to-classify building pairs
- **DiscriminativeAdamW LR** β€” extends AdamW with per-variable LR scaling on block6 (Γ—0.1) via (update_step) override, combined with selective fine-tuning (block6+top_conv unfrozen, BN frozen). LR scaling produces truly discriminative updates β€” block6 variables receive 10Γ— smaller learning rate than head variables (117 total: 105 block6 + 12 head)
- **Mixup + CutMix** (Zhang et al., ICLR 2018. Yun et al., ICCV 2019) β€” alternating per-batch (50/50): Mixup (alpha=0.2, linear interpolation) and CutMix (alpha=1.0, spatial patch). Applied only in Phase 1 training to regularize head learning
- **Selective Unfreeze** (Yosinski et al., 2014) β€” Phase 2 unfreezes block6+top_conv layers (180/513 EfficientNetV2-S layers) while keeping BatchNormalization frozen to preserve pretrained statistics
- **SWA with BN re-estimation** (Izmailov et al., UAI 2018) β€” 10-epoch post-training weight averaging with constant LR 1e-4, followed by 100-step batch normalization statistics re-estimation (3,200 images)
- **Test-Time Augmentation** β€” 6 variations averaged at inference: original, horizontal flip, center crop 85%, center crop 70%, corner crop top-left 80%, corner crop bottom-right 80%. Yields +0.22% accuracy improvement (97.77% β†’ 97.99%)
- **Grad-CAM** (Selvaraju et al., ICCV 2017) β€” gradient-weighted class activation mapping for explainability, targeting *top_conv* (last Conv2D layer of EfficientNetV2-S)
- **ECE Calibration** (Guo et al., ICML 2017) β€” Expected Calibration Error with 15-bin reliability diagram to assess prediction confidence reliability
- **Temperature Scaling** (Guo et al., ICML 2017) β€” post-hoc calibration via scalar temperature parameter T optimized on validation set (NLL minimization). T=0.54 reduces ECE from 12.04% (underconfident due to Label Smoothing) to 0.53% β€” applied at inference via (softmax(log(probs) / T)) trick
## Architecture
```
Input (320, 320, 3)
β”‚
EfficientNetV2-S (ImageNet pretrained, 513 layers, 20.33M params)
β”‚
Conv2D(256, 3Γ—3, ReLU, padding=same) β†’ 2,949,376 params
BatchNormalization β†’ 1,024 params
MaxPooling2D(2Γ—2) β†’ 0 params
β”‚
GeM Pooling(p=3.0, eps=1e-6, learnable) β†’ 1 param
β”‚
Dense(256, ReLU) β†’ 65,792 params
BatchNormalization β†’ 1,024 params
Dropout(0.4) β†’ 0 params
β”‚
Dense(8, Softmax) β†’ 2,056 params
β”‚
Output (8 classes)
```
| Component | Output Shape | Parameters |
|-----------|-------------|------------|
| EfficientNetV2-S (Functional) | (None, 10, 10, 1280) | 20,331,360 |
| Conv2D 256 3Γ—3 | (None, 10, 10, 256) | 2,949,376 |
| BatchNormalization | (None, 10, 10, 256) | 1,024 |
| MaxPooling2D 2Γ—2 | (None, 5, 5, 256) | 0 |
| GeM Pooling p=3.0 | (None, 256) | 1 |
| Dense 256 ReLU | (None, 256) | 65,792 |
| BatchNormalization | (None, 256) | 1,024 |
| Dropout 0.4 | (None, 256) | 0 |
| Dense 8 Softmax | (None, 8) | 2,056 |
| **Total** | | **23,350,633** |
| Trainable (Phase 1) | | **3,018,249** (11.51 MB) |
| Trainable (Phase 2) | | **17,810,225** (67.94 MB) |
| Non-trainable (Phase 1) | | **20,332,384** (77.56 MB) |
## Performance
### Overall Metrics
| Metric | Value |
|--------|-------|
| Test Accuracy | 97.77% |
| Validation Accuracy (SWA) | 98.36% |
| Test-Time Augmentation | 97.99% |
| Test Loss | 0.4262 |
| Overfitting Gap (Train βˆ’ Test) | 2.11% |
| Macro Avg Precision | 0.9777 |
| Macro Avg Recall | 0.9777 |
| Macro Avg F1-Score | 0.9777 |
| Top-2 Accuracy | 99.26% |
| Top-3 Accuracy | 99.70% |
| Macro ROC-AUC (OvR) | 0.9985 |
| ECE (15 bins) | 0.1204 (pre-T-scaling. post-T-scaling: 0.0053, T=0.54) |
### Per-Class Results
| Class | Precision | Recall | F1-Score | AUC (OvR) | Support |
|-------|-----------|--------|----------|-----------|---------|
| barn | 0.9760 | 0.9702 | 0.9731 | 0.9950 | 168 |
| bridge | 0.9591 | 0.9762 | 0.9676 | 0.9983 | 168 |
| castle | 0.9763 | 0.9821 | 0.9792 | 0.9996 | 168 |
| mosque | 0.9763 | 0.9821 | 0.9792 | 0.9987 | 168 |
| skyscraper | 0.9940 | 0.9940 | 0.9940 | 0.9999 | 168 |
| stadium | 0.9820 | 0.9762 | 0.9791 | 0.9999 | 168 |
| temple | 0.9816 | 0.9524 | 0.9668 | 0.9976 | 168 |
| windmill | 0.9765 | 0.9881 | 0.9822 | 0.9987 | 168 |
| **Macro Avg** | **0.9777** | **0.9777** | **0.9777** | **0.9985** | **1,344** |
### Model Selection
Four candidate models were evaluated on the validation set:
| Checkpoint | Val Accuracy | Val Loss | Description |
|------------|-------------|----------|-------------|
| `head_training.keras` | 92.34% | 1.0109 | Phase 1 checkpoint (backbone frozen) |
| `fine_tuning.keras` | 96.28% | 0.5655 | Phase 2 checkpoint (block6+top_conv unfrozen) |
| `fine_tuning_ema.keras` | 93.53% | 0.6007 | Phase 2 EMA (per-step Polyak averaging) |
| **`fine_tuning_swa.keras`** | **98.36%** | **0.4109** | **SWA averaged weights ← SELECTED** |
### Training Progression
| Phase | Epoch | Train Acc | Val Accuracy | Val Loss |
|-------|-------|-----------|-------------|----------|
| Phase 1 (Head Training) | 1 | 56.96% | 92.19% | 1.0079 |
| Phase 2 (Selective Fine-Tuning) | 1 | 84.96% | 96.21% | 0.5656 |
| SWA | 1 | 90.83% | 95.76% | 0.5831 |
| SWA | 2 | 94.07% | 97.62% | 0.5116 |
| SWA | 3 | 95.36% | 97.69% | 0.4748 |
| SWA | 4 | 96.56% | 96.95% | 0.4390 |
| SWA | 5 | 97.18% | 97.47% | 0.4490 |
| SWA | 6 | 97.76% | 97.84% | 0.4416 |
| SWA | 7 | 97.91% | 98.14% | 0.4055 |
| SWA | 8 | 98.19% | 97.32% | 0.4359 |
| SWA | 9 | 98.14% | 97.02% | 0.4519 |
| SWA | 10 | 98.59% | 97.54% | 0.4226 |
| **SWA + BN (final)** | β€” | β€” | **98.36%** | **0.4109** |
> Phase 1 and Phase 2 each stopped after 1 epoch via `myCallback` (custom early stopping at target accuracy: 85% Phase 1, 92% Phase 2). SWA ran 10 epochs with constant LR 1e-4, followed by BN re-estimation (100 steps, 3,200 images). Values shown are training-time metrics from progress bar. checkpoint evaluation values may differ slightly (see Model Selection table above).
![Training Curves](results/training_curves.png)
![Confusion Matrix](results/confusion_matrix.png)
![Per-Class Accuracy](results/per_class_accuracy.png)
![Confidence Per Class](results/confidence_per_class.png)
![t-SNE Embedding](results/tsne_embedding.png)
![Grad-CAM Heatmaps](results/gradcam_heatmaps.png)
## Training Details
### Training Strategy
Two-phase progressive training with SWA post-processing:
| Phase | Description | Backbone | Optimizer | LR | Max Epochs | Actual Epochs | CutMix+Mixup | FocalLoss LS |
|-------|-------------|----------|-----------|-----|-----------|---------------|---------------|-------------|
| **Phase 1** β€” Feature Extraction | Train custom head only | Frozen (all) | AdamW (wd=2e-5) | 0.001 + CosineDecay + Warmup 3ep | 25 | 1 | Yes (50/50 alternation) | 0.1 |
| **Phase 2** β€” Selective Fine-Tuning | Load head_training β†’ fine-tune | block6 + top_conv unfrozen (BN frozen) | DiscriminativeAdamW (block6=0.1Γ—) | 3e-4 + CosineDecay + Warmup 5ep | 50 | 1 + 10 SWA | No | 0.05 |
> ΒΉ Phase 1 stops when `val_accuracy β‰₯ 85%` threshold (myCallback).
> Β² Phase 2 stops when `val_accuracy β‰₯ 92%` threshold (myCallback), followed by 10 SWA epochs (constant LR 1e-4).
### Hyperparameters
| Parameter | Phase 1 | Phase 2 |
|-----------|---------|---------|
| Optimizer | AdamW | DiscriminativeAdamW |
| Learning Rate | 0.001 | 3Γ—10⁻⁴ |
| LR Schedule | WarmupCosineDecay (warmup=3) | WarmupCosineDecay (warmup=5) |
| Weight Decay | 2Γ—10⁻⁡ | 2Γ—10⁻⁡ |
| LR Multiplier (block6) | β€” | 0.1Γ— (LR scaling via update_step, truly discriminative) |
| LR Multiplier (top_conv+head) | β€” | 1.0Γ— |
| Loss | FocalLoss (gamma=2.0, LS=0.1) | FocalLoss (gamma=2.0, LS=0.05) |
| Batch Size | 32 | 32 |
| Early Stopping Patience | 7 | 12 |
| myCallback Threshold | val_acc β‰₯ 0.85 | val_acc β‰₯ 0.92 |
| EMA Decay (per-step) | 0.999 | 0.999 |
| SWA Epochs | β€” | 10 (post-training) |
| SWA LR | β€” | 1Γ—10⁻⁴ (constant) |
| BN Re-estimation Steps | β€” | 100 |
| CutMix (alpha=1.0) | Yes (50% batches) | No |
| Mixup (alpha=0.2) | Yes (50% batches) | No |
| Hardware | 2Γ— Tesla T4 (MirroredStrategy) | 2Γ— Tesla T4 (MirroredStrategy) |
### Regularization Strategy
| Technique | Implementation | Reference |
|-----------|---------------|-----------|
| Transfer Learning | EfficientNetV2-S backbone frozen in Phase 1 | Yosinski et al., NeurIPS 2014 |
| Selective Fine-Tuning | Unfreeze block6+top_conv only, BN stays frozen | Howard & Ruder, ACL 2018 |
| Discriminative LR Scaling | block6 LRΓ—0.1 via update_step (truly discriminative β€” 10Γ— smaller updates for pretrained features) | Howard & Ruder, ACL 2018 |
| CutMix + Mixup | Alternation per batch (50/50), Phase 1 only | Yun et al., ICCV 2019. Zhang et al., ICLR 2018 |
| Focal Loss | gamma=2.0, down-weights easy examples | Lin et al., ICCV 2017 |
| Label Smoothing | 0.1 (Phase 1) β†’ 0.05 (Phase 2) | Szegedy et al., CVPR 2016 |
| GeM Pooling | p=3.0 learnable, replaces GAP | Radenovic et al., CVPR 2018 |
| Dropout | 0.4 after Dense(256)+BN | Srivastava et al., JMLR 2014 |
| Batch Normalization | After Conv2D and Dense. frozen during fine-tuning | Ioffe & Szegedy, arXiv 2015 |
| EMA (per-step) | Shadow weights, decay=0.999, Polyak averaging | Tarvainen & Valpola, NeurIPS 2017 |
| SWA | 10-epoch post-training, constant LR 1e-4 | Izmailov et al., UAI 2018 |
| Data Augmentation | Rotation Β±15Β°, shift Β±10%, shear Β±0.1 rad, zoom Β±20%, brightness 0.75–1.15, channel shift Β±10.0, horizontal flip | Perez & Wang, arXiv 2017 |
| Random Erasing | p=0.5, area [0.02–0.15], aspect [0.3–3.3], applied pre-normalization | Zhong et al., AAAI 2020 |
| Test-Time Augmentation | 6 augmentation variants, averaged | Shanmugam et al., ICML 2020 |
| WarmupCosineDecay | Linear warmup + cosine annealing | Loshchilov & Hutter, ICLR 2017 (SGDR) |
| Early Stopping | Patience 7 (Phase 1) / 12 (Phase 2) | Prechelt, Neural Networks 1998 |
### Dataset
See the dataset curation page for [World Architectural Buildings Dataset for Multi‑Class Image Classification](https://huggingface.co/datasets/0xgr3y/arch-building-dataset) β€” 13,440 images (8 classes Γ— 1,680, balanced) sourced from Pexels with perceptual (pHash) and exact (SHA256) deduplication.
| Split | Images | Percentage |
|-------|--------|------------|
| Train | 10,752 | 80% |
| Validation | 1,344 | 10% |
| Test | 1,344 | 10% |
### Data Preprocessing
- **Normalization:** `preprocess_input` from `tf.keras.applications.efficientnet_v2` (ImageNet distribution)
- **Input resolution:** 320Γ—320 (higher than ImageNet default 224Γ—224 to capture fine-grained architectural details β€” textures, ornaments, facade patterns)
- **Augmentation:** Applied to training set only. validation and test sets use clean preprocessing
- **Split method:** `splitfolders.ratio` from `dataset/`, seed=42
## Files
| Category | Files |
|----------|-------|
| **Model (best)** | `fine_tuning_swa.keras` (227 MB) Β· `.weights.h5` (158 MB) Β· `.safetensors` (157 MB) |
| **Code** | `build_model.py` (21 KB) β€” architecture + CLI inference |
| **Config** | `config.json` Β· `label_mapping.json` Β· `preprocessor_config.json` |
| **Evaluation** | `calibration_data.json` Β· `model_benchmark.json` Β· `confusion_pairs.json` Β· `class_confidence_stats.json` Β· `temperature_config.json` |
| **Deployment** | `saved_model/` (183 MB) Β· `tflite/` (88 MB) Β· `tfjs_model/` (90 MB, 23 shards) |
| **Results** | `results/` β€” 12 PNG (augmentation, reliability-diagram, training curves, confusion matrix, ROC, t-SNE, Grad-CAM, etc.) |
| **Archive** | `models_keras/` β€” 3 checkpoints (head_training, fine_tuning, fine_tuning_ema) |
## Usage
### Gradio Space
Try the live building classify: [Architecture Building Image Classifier with Space](https://huggingface.co/spaces/0xgr3y/arch-building-classifier)
### Python β€” build_model.py (recommended)
`build_model.py` is a standalone module that provides:
- **Custom class definitions** (`GeMPooling`, `FocalLoss`, `DiscriminativeAdamW`) with `@register_keras_serializable` β€” importing the module registers all custom classes globally, so `load_model()` works without explicit `custom_objects`.
- **`ArchBuildingClassifier`** β€” high-level wrapper class with `build()`, `from_weights()`, `from_keras()`, `predict()`, `predict_batch()` methods.
- **`CUSTOM_OBJECTS`** dict β€” fallback for explicit `custom_objects=` in `load_model()`.
- **`build_model()`** β€” backward-compatible function that returns a raw `tf.keras.Model`.
Upload `build_model.py` to the same directory as your script or add it to `PYTHONPATH`.
> **Note:** Filenames below use `fine_tuning_swa` as an example. The actual best checkpoint filename depends on training results β€” check the repo for the actual `.keras`, `.weights.h5`, and `.safetensors` filenames.
```python
from build_model import ArchBuildingClassifier
from huggingface_hub import hf_hub_download
# Download weights (clean format)
weights_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "fine_tuning_swa.weights.h5")
# Load model: architecture + weights
clf = ArchBuildingClassifier.from_weights(weights_path)
# Inference
from PIL import Image
import numpy as np
label, confidence, top3 = clf.predict(Image.open("skyscraper_00000.jpg"))
print(f"Predicted: {label} ({confidence:.1%})")
for cls, prob in top3:
print(f" {cls}: {prob:.1%}")
```
### Python β€” TF-Lite (fastest inference)
```python
import numpy as np
import tensorflow as tf
from huggingface_hub import hf_hub_download
from PIL import Image
import json
try:
from tensorflow.keras.applications.efficientnet_v2 import preprocess_input
except (ImportError, ModuleNotFoundError):
from tensorflow.keras.applications.efficientnet import preprocess_input
# Download
model_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "tflite/model.tflite")
labels_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "label_mapping.json")
with open(labels_path) as f:
LABELS = json.load(f)["labels"]
interpreter = tf.lite.Interpreter(model_path=model_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
arr = np.expand_dims(preprocess_input(
np.array(img, dtype=np.float32)), axis=0)
interpreter.set_tensor(input_details[0]["index"], arr)
interpreter.invoke()
preds = interpreter.get_tensor(output_details[0]["index"])[0]
top3_idx = np.argsort(preds)[::-1][:3]
for i in top3_idx:
print(f" {LABELS[i]}: {preds[i]*100:.1f}%")
```
### Python β€” Keras (convenient)
```python
import build_model # registers custom classes via @register_keras_serializable
import tensorflow as tf
from huggingface_hub import hf_hub_download
try:
from tensorflow.keras.applications.efficientnet_v2 import preprocess_input
except (ImportError, ModuleNotFoundError):
from tensorflow.keras.applications.efficientnet import preprocess_input
from PIL import Image
import numpy as np
import json
model_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "fine_tuning_swa.keras")
labels_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "label_mapping.json")
model = tf.keras.models.load_model(model_path, compile=False) # custom_objects not needed
with open(labels_path) as f:
LABELS = json.load(f)["labels"]
img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
arr = np.expand_dims(preprocess_input(np.array(img, dtype=np.float32)), axis=0)
preds = model.predict(arr, verbose=0)[0]
print(f"Predicted: {LABELS[np.argmax(preds)]} ({np.max(preds)*100:.1f}%)")
```
### Python β€” SavedModel (TF Serving)
```python
from huggingface_hub import snapshot_download
import tensorflow as tf
import numpy as np
from PIL import Image
try:
from tensorflow.keras.applications.efficientnet_v2 import preprocess_input
except (ImportError, ModuleNotFoundError):
from tensorflow.keras.applications.efficientnet import preprocess_input
snapshot_download("0xgr3y/Arch-Building-Image-Classification", allow_patterns=["saved_model/*"], local_dir=".")
# Load SavedModel (created via model.export() β€” inference-only, no custom_objects needed)
loaded = tf.saved_model.load("saved_model")
img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
arr = tf.constant(np.expand_dims(preprocess_input(np.array(img, dtype=np.float32)), axis=0))
preds = loaded(arr).numpy()[0]
top3_idx = np.argsort(preds)[::-1][:3]
for i in top3_idx:
print(f" Class {i}: {preds[i]*100:.1f}%")
```
### Python β€” safetensors (HF standard, cross-framework)
> **Note:** safetensors stores raw weight tensors without architecture metadata. To load, reconstruct the architecture with `build_model.py` first, then map tensors manually. For most use cases, `.weights.h5` (via `ArchBuildingClassifier.from_weights()`) is simpler and equally clean.
```python
from safetensors.numpy import load_file
from build_model import ArchBuildingClassifier
from PIL import Image
# Reconstruct architecture
clf = ArchBuildingClassifier.build()
# Load safetensors tensors
tensors = load_file("fine_tuning_swa.safetensors")
# Map tensors to model weights (iterate layers, not .variables β€” Keras 3 compatible)
for layer in clf.keras_model.layers:
for w in layer.weights:
name = w.name.replace(':', '_').replace('/', '_')
if name in tensors:
w.assign(tensors[name])
# Inference
label, confidence, top3 = clf.predict(Image.open("skyscraper_00000.jpg"))
```
## Inference Verification
Keras vs TFLite consistency was verified on 8 random test samples (1 per class):
| Metric | Result |
|--------|--------|
| Keras correct | 7/8 (88%) |
| TFLite correct | 7/8 (88%) |
| Keras vs TFLite match | **8/8 (100%)** β€” identical predictions |
| Keras inference speed | 358.0 ms |
| TFLite inference speed | 170.0 ms |
> The 1 misclassification (castle→barn, 65% confidence) is consistent with the 97.77% test accuracy. The 8/8 match confirms TFLite conversion preserves model behavior exactly.
![TFLite Inference](results/inference_tflite.png)
## Security Notice (PAIT-KERAS-301)
The `.keras` files in this repository are flagged **"Unsafe"** by [Protect AI Guardian](https://protectai.com/insights/models/0xgr3y/Arch-Building-Image-Classification) (threat: PAIT-KERAS-301). This is a **structural false positive**, not a malware detection:
- **What the scanner checks:** String-matching of `class_name` fields in the Keras v3 config against a whitelist of built-in Keras layers.
- **Why flagged:** The model contains a custom layer (`GeMPooling`) β€” a non-standard class name triggers the flag.
- **What it does NOT check:** The scanner does not analyze the Python code of the custom class, does not look for `eval()`/`exec()`/`os.system()`, and does not detect actual malware.
- **Other scanners:** VirusTotal, JFrog, HF Picklescan β€” all clean. Only Protect AI flags this file.
**The custom classes are safe and open source:**
- `GeMPooling` β€” Generalized Mean Pooling (Radenovic et al., CVPR 2018). Pure tensor ops: `tf.pow`, `tf.reduce_mean`, `tf.maximum`.
- `FocalLoss` β€” Focal Loss (Lin et al., ICCV 2017). Pure tensor ops.
- `DiscriminativeAdamW` β€” AdamW subclass with gradient scaling. No file I/O, no network calls, no arbitrary code.
Full source code for all custom classes is available in [`build_model.py`](https://huggingface.co/0xgr3y/Arch-Building-Image-Classification/blob/main/build_model.py) and the training notebook for public audit.
## Multi-Format Deployment Guide
With model is provided in multiple formats to suit different deployment scenarios. Formats marked βœ“ are **not flagged** by Protect AI (no custom class serialization).
| Format | File | Size | Protect AI | Inference Speed | Best For |
|--------|------|------|------------|-----------------|----------|
| **TF-Lite** βœ“ | `tflite/model.tflite` | ~88 MB | βœ“ Safe | **170.0 ms** (fastest) | Mobile, edge, embedded, HF Space |
| **SavedModel** βœ“ | `saved_model/` | ~183 MB | βœ“ Safe | β€” | TensorFlow Serving, cloud backend |
| **TFJS** βœ“ | `tfjs_model/` | ~90 MB | βœ“ Safe | β€” | Browser, Node.js (no backend) |
| **Weights H5** βœ“ | `fine_tuning_swa.weights.h5` | ~158 MB | βœ“ Safe | β€” | Programmatic load via `build_model.py` |
| **safetensors** βœ“ | `fine_tuning_swa.safetensors` | ~157 MB | βœ“ Safe | β€” | HF standard, cross-framework |
| **Build Script** βœ“ | `build_model.py` | ~21 KB | βœ“ Safe | β€” | Architecture reconstruction + `load_weights()` |
| **Keras** β„Ή | `fine_tuning_swa.keras` | ~227 MB | β„Ή Flagged | 358.0 ms | Developer reference, fine-tuning |
### Load Examples
See **Usage** section above for complete load + inference examples for each format.
## Intended Use
- Architectural style classification from building photographs
- Educational tool for architecture recognition
- Research baseline for fine-grained image classification (FGIC)
- Transfer learning experiments on architectural imagery
## Limitations
- Trained on Pexels stock photography β€” performance may differ on user-generated or field photographs
- Limited to 8 architectural classes (barn, bridge, castle, mosque, skyscraper, stadium, temple, windmill)
- Confusion pair analysis found **0 significant pairs** (threshold >5%) β€” all 8 classes are well-distinguished by the model. see `confusion_pairs.json` for details
- Barn and windmill share 3 cross-class duplicates (0.02% of dataset) β€” left as-is due to negligible impact
- Inference confidence can be low on atypical examples
![Misclassification Examples](results/misclassification_examples.png)
## Ethical Considerations
- All training images sourced from [Pexels.com](https://www.pexels.com) under the Pexels License (free for commercial use, no attribution required). No copyrighted or personally identifiable images were used.
- The dataset contains only photographs of buildings and structures β€” no people, faces, or private property are the subject of classification.
- The model reflects the visual distribution of Pexels stock photography, which may over-represent Western and iconic architectural styles and under-represent vernacular or regional architecture.
- The 8 class categories are broad and do not capture the full diversity of world architecture. Results should not be used to make definitive claims about architectural categorization.
- URL pattern filtering during dataset collection explicitly excluded AI-generated art, illustrations, and non-photographic content to ensure authenticity.
## Links
- **Gradio Space (Live):** [arch-building-classifier Space](https://huggingface.co/spaces/0xgr3y/arch-building-classifier)
- **Dataset Studio:** [0xgr3y/arch-building-dataset](https://huggingface.co/datasets/0xgr3y/arch-building-dataset)
- **GitHub Repository:** [arcxteam/building-architectural-image-classifier](https://github.com/arcxteam/building-architectural-image-classifier)
## References
1. Tan, M., & Le, Q. V. (2021). EfficientNetV2: Smaller Models and Faster Training. *ICML 2021*. [arXiv:2104.00298](https://arxiv.org/abs/2104.00298)
2. Radenovic, F., Tolias, G., & Chum, O. (2018). Fine-Tuning CNN Image Retrieval with No Human Annotation. *IEEE TPAMI*. [arXiv:1711.02512](https://arxiv.org/abs/1711.02512)
3. Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal Loss for Dense Object Detection. *ICCV 2017*. [arXiv:1708.02002](https://arxiv.org/abs/1708.02002)
4. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging Weights Leads to Wider Optima and Better Generalization. *UAI 2018*. [arXiv:1803.05407](https://arxiv.org/abs/1803.05407)
5. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). mixup: Beyond Empirical Risk Minimization. *ICLR 2018*. [arXiv:1710.09412](https://arxiv.org/abs/1710.09412)
6. Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. *ICCV 2019*. [arXiv:1905.04899](https://arxiv.org/abs/1905.04899)
7. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the Inception Architecture for Computer Vision. *CVPR 2016*. [arXiv:1512.00567](https://arxiv.org/abs/1512.00567)
8. Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How Transferable Are Features in Deep Neural Networks? *NeurIPS 2014*. [arXiv:1411.1792](https://arxiv.org/abs/1411.1792)
9. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. *ACL 2018*. [arXiv:1801.06146](https://arxiv.org/abs/1801.06146)
10. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. *JMLR*, 15(56), 1929–1958. [http://jmlr.org/papers/v15/srivastava14a.html](http://jmlr.org/papers/v15/srivastava14a.html)
11. Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. *arXiv preprint*. [arXiv:1502.03167](https://arxiv.org/abs/1502.03167)
12. Tarvainen, A., & Valpola, H. (2017). Mean Teachers are Better Role Models: Weight-averaged Consistency Targets Improve Semi-supervised Deep Learning Results. *NeurIPS 2017*. [arXiv:1703.01780](https://arxiv.org/abs/1703.01780)
13. Perez, L., & Wang, J. (2017). The Effectiveness of Data Augmentation in Image Classification using Deep Learning. *arXiv preprint*. [arXiv:1712.04621](https://arxiv.org/abs/1712.04621)
14. Shanmugam, D., Blalock, D., Balakrishnan, G., Guttag, J., & Sarma, A. (2020). Towards Principled Test-Time Augmentation. *ICML 2020*. [PDF](https://dmshanmugam.github.io/pdfs/icml_2020_testaug.pdf)
15. Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. *ICLR 2017*. [arXiv:1608.03983](https://arxiv.org/abs/1608.03983)
16. Prechelt, L. (1998). Automatic Early Stopping Using Cross Validation: Quantifying the Criteria. *Neural Networks*, 11(4), 761–767. [https://doi.org/10.1016/S0893-6080(98)00010-0](https://doi.org/10.1016/S0893-6080(98)00010-0)
17. Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. *ICML 2017*. [arXiv:1706.04599](https://arxiv.org/abs/1706.04599)
18. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. *ICCV 2017*. [arXiv:1610.02391](https://arxiv.org/abs/1610.02391)
19. van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. *JMLR*, 9(Nov), 2579–2605. [http://jmlr.org/papers/v9/vandermaaten08a.html](http://jmlr.org/papers/v9/vandermaaten08a.html)
20. Hand, D. J., & Till, R. J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. *Machine Learning*, 45(2), 171–186. [https://doi.org/10.1023/A:1010920819831](https://doi.org/10.1023/A:1010920819831)
21. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. *IJCV*, 115(3), 211–252. [arXiv:1409.0575](https://arxiv.org/abs/1409.0575)
22. Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. *NeurIPS 2017*. [arXiv:1612.01474](https://arxiv.org/abs/1612.01474)
## Citation
```bibtex
@misc{saugani2026_arch_building,
title={Fine-Grained Image Classification of World Architecture:
An EfficientNetV2-S Transfer Learning Approach with Layered Regularization},
author={Saugani},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/0xgr3y/Arch-Building-Image-Classification}
}
```