---
license: apache-2.0
pipeline_tag: image-classification
tags:
  - efficientnetv2
  - fgic
  - safetensors
  - transfer-learning
  - gem-pooling
  - focal-loss
  - swa
  - grad-cam
  - calibration
  - temperature-scaling
  - computer-vision
  - tensorflow.js
library_name: keras
language: en
datasets:
  - 0xgr3y/arch-building-dataset
model-index:
  - name: Architectural Building Image Classifier
    results:
      - task:
          type: image-classification
          name: Fine-Grained Image Classification
        dataset:
          type: imagefolder
          name: arch-building-dataset
          split: test
        metrics:
          - type: accuracy
            value: 0.9777
            name: Test Accuracy
          - type: accuracy
            value: 0.9836
            name: Validation Accuracy (SWA)
          - type: accuracy
            value: 0.9799
            name: TTA Accuracy
          - type: f1
            value: 0.9777
            name: Macro F1
          - type: precision
            value: 0.9777
            name: Macro Precision
          - type: recall
            value: 0.9777
            name: Macro Recall
          - type: roc_auc
            value: 0.9985
            name: Macro ROC-AUC (OvR)
---

![Arch-Building-Image-Classification](results/greyscope-labs-architecture-classification-efficientnetv2.jpg)

# Fine-Grained Image Classification of World Architecture: An EfficientNetV2-S Transfer Learning Approach with Layered Regularization

### Architectural Building Image Classifier

Fine-Grained Image Classification (FGIC) of world architectural buildings using CNN transfer learning with EfficientNetV2-S, enhanced with GeM Pooling, Focal Loss, Discriminative AdamW (LR), Stochastic Weight Averaging (SWA), Grad-CAM explainability, and calibration analysis.

<table>
  <tr><td><strong>Architecture</strong></td><td>EfficientNetV2-S + GeM Pooling + Focal Loss + SWA</td></tr>
  <tr><td><strong>Task</strong></td><td>Fine-Grained Image Classification (FGIC)</td></tr>
  <tr><td><strong>Test Accuracy</strong></td><td>97.77%</td></tr>
  <tr><td><strong>Classes</strong></td><td>8 (barn, bridge, castle, mosque, skyscraper, stadium, temple, windmill)</td></tr>
  <tr><td><strong>Input Size</strong></td><td>320 × 320 pixels</td></tr>
  <tr><td><strong>Parameters</strong></td><td>23,350,633</td></tr>
  <tr><td><strong>Framework</strong></td><td>TensorFlow / Keras 3</td></tr>
  <tr><td><strong>License</strong></td><td><a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0</a></td></tr>
</table>

## Model Description

A fine-grained image classification model for world architectural buildings. Built on EfficientNetV2-S pretrained on ImageNet, enhanced with GeM Pooling (learnable generalized mean pooling), Focal Loss, Discriminative AdamW and Stochastic Weight Averaging (SWA). Extended with Grad-CAM explainability visualization, ROC-AUC evaluation, ECE calibration analysis, and t-SNE embedding visualization.

**Key architectural contributions:**

- **GeM Pooling** (Radenovic et al., CVPR 2018) — replaces global average pooling with a learnable power parameter (p=3.0) that emphasizes high-activation features, yielding stronger discriminative representations for FGIC tasks
- **Focal Loss** (Lin et al., ICCV 2017, gamma=2.0) — down-weights well-classified examples to focus gradient updates on hard-to-classify building pairs
- **DiscriminativeAdamW LR** — extends AdamW with per-variable LR scaling on block6 (×0.1) via (update_step) override, combined with selective fine-tuning (block6+top_conv unfrozen, BN frozen). LR scaling produces truly discriminative updates — block6 variables receive 10× smaller learning rate than head variables (117 total: 105 block6 + 12 head)
- **Mixup + CutMix** (Zhang et al., ICLR 2018. Yun et al., ICCV 2019) — alternating per-batch (50/50): Mixup (alpha=0.2, linear interpolation) and CutMix (alpha=1.0, spatial patch). Applied only in Phase 1 training to regularize head learning
- **Selective Unfreeze** (Yosinski et al., 2014) — Phase 2 unfreezes block6+top_conv layers (180/513 EfficientNetV2-S layers) while keeping BatchNormalization frozen to preserve pretrained statistics
- **SWA with BN re-estimation** (Izmailov et al., UAI 2018) — 10-epoch post-training weight averaging with constant LR 1e-4, followed by 100-step batch normalization statistics re-estimation (3,200 images)
- **Test-Time Augmentation** — 6 variations averaged at inference: original, horizontal flip, center crop 85%, center crop 70%, corner crop top-left 80%, corner crop bottom-right 80%. Yields +0.22% accuracy improvement (97.77% → 97.99%)
- **Grad-CAM** (Selvaraju et al., ICCV 2017) — gradient-weighted class activation mapping for explainability, targeting *top_conv* (last Conv2D layer of EfficientNetV2-S)
- **ECE Calibration** (Guo et al., ICML 2017) — Expected Calibration Error with 15-bin reliability diagram to assess prediction confidence reliability
- **Temperature Scaling** (Guo et al., ICML 2017) — post-hoc calibration via scalar temperature parameter T optimized on validation set (NLL minimization). T=0.54 reduces ECE from 12.04% (underconfident due to Label Smoothing) to 0.53% — applied at inference via (softmax(log(probs) / T)) trick

## Architecture

```
Input (320, 320, 3)
  │
  EfficientNetV2-S (ImageNet pretrained, 513 layers, 20.33M params)
  │
  Conv2D(256, 3×3, ReLU, padding=same)     →  2,949,376 params
  BatchNormalization                        →  1,024 params
  MaxPooling2D(2×2)                         →  0 params
  │
  GeM Pooling(p=3.0, eps=1e-6, learnable)  →  1 param
  │
  Dense(256, ReLU)                          →  65,792 params
  BatchNormalization                        →  1,024 params
  Dropout(0.4)                              →  0 params
  │
  Dense(8, Softmax)                         →  2,056 params
  │
Output (8 classes)
```

| Component | Output Shape | Parameters |
|-----------|-------------|------------|
| EfficientNetV2-S (Functional) | (None, 10, 10, 1280) | 20,331,360 |
| Conv2D 256 3×3 | (None, 10, 10, 256) | 2,949,376 |
| BatchNormalization | (None, 10, 10, 256) | 1,024 |
| MaxPooling2D 2×2 | (None, 5, 5, 256) | 0 |
| GeM Pooling p=3.0 | (None, 256) | 1 |
| Dense 256 ReLU | (None, 256) | 65,792 |
| BatchNormalization | (None, 256) | 1,024 |
| Dropout 0.4 | (None, 256) | 0 |
| Dense 8 Softmax | (None, 8) | 2,056 |
| **Total** | | **23,350,633** |
| Trainable (Phase 1) | | **3,018,249** (11.51 MB) |
| Trainable (Phase 2) | | **17,810,225** (67.94 MB) |
| Non-trainable (Phase 1) | | **20,332,384** (77.56 MB) |

## Performance

### Overall Metrics

| Metric | Value |
|--------|-------|
| Test Accuracy | 97.77% |
| Validation Accuracy (SWA) | 98.36% |
| Test-Time Augmentation | 97.99% |
| Test Loss | 0.4262 |
| Overfitting Gap (Train − Test) | 2.11% |
| Macro Avg Precision | 0.9777 |
| Macro Avg Recall | 0.9777 |
| Macro Avg F1-Score | 0.9777 |
| Top-2 Accuracy | 99.26% |
| Top-3 Accuracy | 99.70% |
| Macro ROC-AUC (OvR) | 0.9985 |
| ECE (15 bins) | 0.1204 (pre-T-scaling. post-T-scaling: 0.0053, T=0.54) |

### Per-Class Results

| Class | Precision | Recall | F1-Score | AUC (OvR) | Support |
|-------|-----------|--------|----------|-----------|---------|
| barn | 0.9760 | 0.9702 | 0.9731 | 0.9950 | 168 |
| bridge | 0.9591 | 0.9762 | 0.9676 | 0.9983 | 168 |
| castle | 0.9763 | 0.9821 | 0.9792 | 0.9996 | 168 |
| mosque | 0.9763 | 0.9821 | 0.9792 | 0.9987 | 168 |
| skyscraper | 0.9940 | 0.9940 | 0.9940 | 0.9999 | 168 |
| stadium | 0.9820 | 0.9762 | 0.9791 | 0.9999 | 168 |
| temple | 0.9816 | 0.9524 | 0.9668 | 0.9976 | 168 |
| windmill | 0.9765 | 0.9881 | 0.9822 | 0.9987 | 168 |
| **Macro Avg** | **0.9777** | **0.9777** | **0.9777** | **0.9985** | **1,344** |

### Model Selection

Four candidate models were evaluated on the validation set:

| Checkpoint | Val Accuracy | Val Loss | Description |
|------------|-------------|----------|-------------|
| `head_training.keras` | 92.34% | 1.0109 | Phase 1 checkpoint (backbone frozen) |
| `fine_tuning.keras` | 96.28% | 0.5655 | Phase 2 checkpoint (block6+top_conv unfrozen) |
| `fine_tuning_ema.keras` | 93.53% | 0.6007 | Phase 2 EMA (per-step Polyak averaging) |
| **`fine_tuning_swa.keras`** | **98.36%** | **0.4109** | **SWA averaged weights ← SELECTED** |

### Training Progression

| Phase | Epoch | Train Acc | Val Accuracy | Val Loss |
|-------|-------|-----------|-------------|----------|
| Phase 1 (Head Training) | 1 | 56.96% | 92.19% | 1.0079 |
| Phase 2 (Selective Fine-Tuning) | 1 | 84.96% | 96.21% | 0.5656 |
| SWA | 1 | 90.83% | 95.76% | 0.5831 |
| SWA | 2 | 94.07% | 97.62% | 0.5116 |
| SWA | 3 | 95.36% | 97.69% | 0.4748 |
| SWA | 4 | 96.56% | 96.95% | 0.4390 |
| SWA | 5 | 97.18% | 97.47% | 0.4490 |
| SWA | 6 | 97.76% | 97.84% | 0.4416 |
| SWA | 7 | 97.91% | 98.14% | 0.4055 |
| SWA | 8 | 98.19% | 97.32% | 0.4359 |
| SWA | 9 | 98.14% | 97.02% | 0.4519 |
| SWA | 10 | 98.59% | 97.54% | 0.4226 |
| **SWA + BN (final)** | — | — | **98.36%** | **0.4109** |

> Phase 1 and Phase 2 each stopped after 1 epoch via `myCallback` (custom early stopping at target accuracy: 85% Phase 1, 92% Phase 2). SWA ran 10 epochs with constant LR 1e-4, followed by BN re-estimation (100 steps, 3,200 images). Values shown are training-time metrics from progress bar. checkpoint evaluation values may differ slightly (see Model Selection table above).

![Training Curves](results/training_curves.png)

![Confusion Matrix](results/confusion_matrix.png)

![Per-Class Accuracy](results/per_class_accuracy.png)

![Confidence Per Class](results/confidence_per_class.png)

![t-SNE Embedding](results/tsne_embedding.png)

![Grad-CAM Heatmaps](results/gradcam_heatmaps.png)

## Training Details

### Training Strategy

Two-phase progressive training with SWA post-processing:

| Phase | Description | Backbone | Optimizer | LR | Max Epochs | Actual Epochs | CutMix+Mixup | FocalLoss LS |
|-------|-------------|----------|-----------|-----|-----------|---------------|---------------|-------------|
| **Phase 1** — Feature Extraction | Train custom head only | Frozen (all) | AdamW (wd=2e-5) | 0.001 + CosineDecay + Warmup 3ep | 25 | 1 | Yes (50/50 alternation) | 0.1 |
| **Phase 2** — Selective Fine-Tuning | Load head_training → fine-tune | block6 + top_conv unfrozen (BN frozen) | DiscriminativeAdamW (block6=0.1×) | 3e-4 + CosineDecay + Warmup 5ep | 50 | 1 + 10 SWA | No | 0.05 |

> ¹ Phase 1 stops when `val_accuracy ≥ 85%` threshold (myCallback).

> ² Phase 2 stops when `val_accuracy ≥ 92%` threshold (myCallback), followed by 10 SWA epochs (constant LR 1e-4).

### Hyperparameters

| Parameter | Phase 1 | Phase 2 |
|-----------|---------|---------|
| Optimizer | AdamW | DiscriminativeAdamW |
| Learning Rate | 0.001 | 3×10⁻⁴ |
| LR Schedule | WarmupCosineDecay (warmup=3) | WarmupCosineDecay (warmup=5) |
| Weight Decay | 2×10⁻⁵ | 2×10⁻⁵ |
| LR Multiplier (block6) | — | 0.1× (LR scaling via update_step, truly discriminative) |
| LR Multiplier (top_conv+head) | — | 1.0× |
| Loss | FocalLoss (gamma=2.0, LS=0.1) | FocalLoss (gamma=2.0, LS=0.05) |
| Batch Size | 32 | 32 |
| Early Stopping Patience | 7 | 12 |
| myCallback Threshold | val_acc ≥ 0.85 | val_acc ≥ 0.92 |
| EMA Decay (per-step) | 0.999 | 0.999 |
| SWA Epochs | — | 10 (post-training) |
| SWA LR | — | 1×10⁻⁴ (constant) |
| BN Re-estimation Steps | — | 100 |
| CutMix (alpha=1.0) | Yes (50% batches) | No |
| Mixup (alpha=0.2) | Yes (50% batches) | No |
| Hardware | 2× Tesla T4 (MirroredStrategy) | 2× Tesla T4 (MirroredStrategy) |

### Regularization Strategy

| Technique | Implementation | Reference |
|-----------|---------------|-----------|
| Transfer Learning | EfficientNetV2-S backbone frozen in Phase 1 | Yosinski et al., NeurIPS 2014 |
| Selective Fine-Tuning | Unfreeze block6+top_conv only, BN stays frozen | Howard & Ruder, ACL 2018 |
| Discriminative LR Scaling | block6 LR×0.1 via update_step (truly discriminative — 10× smaller updates for pretrained features) | Howard & Ruder, ACL 2018 |
| CutMix + Mixup | Alternation per batch (50/50), Phase 1 only | Yun et al., ICCV 2019. Zhang et al., ICLR 2018 |
| Focal Loss | gamma=2.0, down-weights easy examples | Lin et al., ICCV 2017 |
| Label Smoothing | 0.1 (Phase 1) → 0.05 (Phase 2) | Szegedy et al., CVPR 2016 |
| GeM Pooling | p=3.0 learnable, replaces GAP | Radenovic et al., CVPR 2018 |
| Dropout | 0.4 after Dense(256)+BN | Srivastava et al., JMLR 2014 |
| Batch Normalization | After Conv2D and Dense. frozen during fine-tuning | Ioffe & Szegedy, arXiv 2015 |
| EMA (per-step) | Shadow weights, decay=0.999, Polyak averaging | Tarvainen & Valpola, NeurIPS 2017 |
| SWA | 10-epoch post-training, constant LR 1e-4 | Izmailov et al., UAI 2018 |
| Data Augmentation | Rotation ±15°, shift ±10%, shear ±0.1 rad, zoom ±20%, brightness 0.75–1.15, channel shift ±10.0, horizontal flip | Perez & Wang, arXiv 2017 |
| Random Erasing | p=0.5, area [0.02–0.15], aspect [0.3–3.3], applied pre-normalization | Zhong et al., AAAI 2020 |
| Test-Time Augmentation | 6 augmentation variants, averaged | Shanmugam et al., ICML 2020 |
| WarmupCosineDecay | Linear warmup + cosine annealing | Loshchilov & Hutter, ICLR 2017 (SGDR) |
| Early Stopping | Patience 7 (Phase 1) / 12 (Phase 2) | Prechelt, Neural Networks 1998 |

### Dataset

See the dataset curation page for [World Architectural Buildings Dataset for Multi‑Class Image Classification](https://huggingface.co/datasets/0xgr3y/arch-building-dataset) — 13,440 images (8 classes × 1,680, balanced) sourced from Pexels with perceptual (pHash) and exact (SHA256) deduplication.

| Split | Images | Percentage |
|-------|--------|------------|
| Train | 10,752 | 80% |
| Validation | 1,344 | 10% |
| Test | 1,344 | 10% |

### Data Preprocessing

- **Normalization:** `preprocess_input` from `tf.keras.applications.efficientnet_v2` (ImageNet distribution)
- **Input resolution:** 320×320 (higher than ImageNet default 224×224 to capture fine-grained architectural details — textures, ornaments, facade patterns)
- **Augmentation:** Applied to training set only. validation and test sets use clean preprocessing
- **Split method:** `splitfolders.ratio` from `dataset/`, seed=42

## Files

| Category | Files |
|----------|-------|
| **Model (best)** | `fine_tuning_swa.keras` (227 MB) · `.weights.h5` (158 MB) · `.safetensors` (157 MB) |
| **Code** | `build_model.py` (21 KB) — architecture + CLI inference |
| **Config** | `config.json` · `label_mapping.json` · `preprocessor_config.json` |
| **Evaluation** | `calibration_data.json` · `model_benchmark.json` · `confusion_pairs.json` · `class_confidence_stats.json` · `temperature_config.json` |
| **Deployment** | `saved_model/` (183 MB) · `tflite/` (88 MB) · `tfjs_model/` (90 MB, 23 shards) |
| **Results** | `results/` — 12 PNG (augmentation, reliability-diagram, training curves, confusion matrix, ROC, t-SNE, Grad-CAM, etc.) |
| **Archive** | `models_keras/` — 3 checkpoints (head_training, fine_tuning, fine_tuning_ema) |

## Usage

### Gradio Space

Try the live building classify: [Architecture Building Image Classifier with Space](https://huggingface.co/spaces/0xgr3y/arch-building-classifier)

### Python — build_model.py (recommended)

`build_model.py` is a standalone module that provides:
- **Custom class definitions** (`GeMPooling`, `FocalLoss`, `DiscriminativeAdamW`) with `@register_keras_serializable` — importing the module registers all custom classes globally, so `load_model()` works without explicit `custom_objects`.
- **`ArchBuildingClassifier`** — high-level wrapper class with `build()`, `from_weights()`, `from_keras()`, `predict()`, `predict_batch()` methods.
- **`CUSTOM_OBJECTS`** dict — fallback for explicit `custom_objects=` in `load_model()`.
- **`build_model()`** — backward-compatible function that returns a raw `tf.keras.Model`.

Upload `build_model.py` to the same directory as your script or add it to `PYTHONPATH`.

> **Note:** Filenames below use `fine_tuning_swa` as an example. The actual best checkpoint filename depends on training results — check the repo for the actual `.keras`, `.weights.h5`, and `.safetensors` filenames.

```python
from build_model import ArchBuildingClassifier
from huggingface_hub import hf_hub_download

# Download weights (clean format)
weights_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "fine_tuning_swa.weights.h5")

# Load model: architecture + weights
clf = ArchBuildingClassifier.from_weights(weights_path)

# Inference
from PIL import Image
import numpy as np
label, confidence, top3 = clf.predict(Image.open("skyscraper_00000.jpg"))
print(f"Predicted: {label} ({confidence:.1%})")
for cls, prob in top3:
    print(f"  {cls}: {prob:.1%}")
```

### Python — TF-Lite (fastest inference)

```python
import numpy as np
import tensorflow as tf
from huggingface_hub import hf_hub_download
from PIL import Image
import json

try:
    from tensorflow.keras.applications.efficientnet_v2 import preprocess_input
except (ImportError, ModuleNotFoundError):
    from tensorflow.keras.applications.efficientnet import preprocess_input

# Download
model_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "tflite/model.tflite")
labels_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "label_mapping.json")

with open(labels_path) as f:
    LABELS = json.load(f)["labels"]

interpreter = tf.lite.Interpreter(model_path=model_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
arr = np.expand_dims(preprocess_input(
    np.array(img, dtype=np.float32)), axis=0)

interpreter.set_tensor(input_details[0]["index"], arr)
interpreter.invoke()
preds = interpreter.get_tensor(output_details[0]["index"])[0]

top3_idx = np.argsort(preds)[::-1][:3]
for i in top3_idx:
    print(f"  {LABELS[i]}: {preds[i]*100:.1f}%")
```

### Python — Keras (convenient)

```python
import build_model  # registers custom classes via @register_keras_serializable
import tensorflow as tf
from huggingface_hub import hf_hub_download
try:
    from tensorflow.keras.applications.efficientnet_v2 import preprocess_input
except (ImportError, ModuleNotFoundError):
    from tensorflow.keras.applications.efficientnet import preprocess_input
from PIL import Image
import numpy as np
import json

model_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "fine_tuning_swa.keras")
labels_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "label_mapping.json")

model = tf.keras.models.load_model(model_path, compile=False)  # custom_objects not needed

with open(labels_path) as f:
    LABELS = json.load(f)["labels"]

img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
arr = np.expand_dims(preprocess_input(np.array(img, dtype=np.float32)), axis=0)
preds = model.predict(arr, verbose=0)[0]
print(f"Predicted: {LABELS[np.argmax(preds)]} ({np.max(preds)*100:.1f}%)")
```

### Python — SavedModel (TF Serving)

```python
from huggingface_hub import snapshot_download
import tensorflow as tf
import numpy as np
from PIL import Image

try:
    from tensorflow.keras.applications.efficientnet_v2 import preprocess_input
except (ImportError, ModuleNotFoundError):
    from tensorflow.keras.applications.efficientnet import preprocess_input

snapshot_download("0xgr3y/Arch-Building-Image-Classification", allow_patterns=["saved_model/*"], local_dir=".")

# Load SavedModel (created via model.export() — inference-only, no custom_objects needed)
loaded = tf.saved_model.load("saved_model")

img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
arr = tf.constant(np.expand_dims(preprocess_input(np.array(img, dtype=np.float32)), axis=0))
preds = loaded(arr).numpy()[0]

top3_idx = np.argsort(preds)[::-1][:3]
for i in top3_idx:
    print(f"  Class {i}: {preds[i]*100:.1f}%")
```

### Python — safetensors (HF standard, cross-framework)

> **Note:** safetensors stores raw weight tensors without architecture metadata. To load, reconstruct the architecture with `build_model.py` first, then map tensors manually. For most use cases, `.weights.h5` (via `ArchBuildingClassifier.from_weights()`) is simpler and equally clean.

```python
from safetensors.numpy import load_file
from build_model import ArchBuildingClassifier
from PIL import Image

# Reconstruct architecture
clf = ArchBuildingClassifier.build()

# Load safetensors tensors
tensors = load_file("fine_tuning_swa.safetensors")

# Map tensors to model weights (iterate layers, not .variables — Keras 3 compatible)
for layer in clf.keras_model.layers:
    for w in layer.weights:
        name = w.name.replace(':', '_').replace('/', '_')
        if name in tensors:
            w.assign(tensors[name])

# Inference
label, confidence, top3 = clf.predict(Image.open("skyscraper_00000.jpg"))
```

## Inference Verification

Keras vs TFLite consistency was verified on 8 random test samples (1 per class):

| Metric | Result |
|--------|--------|
| Keras correct | 7/8 (88%) |
| TFLite correct | 7/8 (88%) |
| Keras vs TFLite match | **8/8 (100%)** — identical predictions |
| Keras inference speed | 358.0 ms |
| TFLite inference speed | 170.0 ms |

> The 1 misclassification (castle→barn, 65% confidence) is consistent with the 97.77% test accuracy. The 8/8 match confirms TFLite conversion preserves model behavior exactly.

![TFLite Inference](results/inference_tflite.png)

## Security Notice (PAIT-KERAS-301)

The `.keras` files in this repository are flagged **"Unsafe"** by [Protect AI Guardian](https://protectai.com/insights/models/0xgr3y/Arch-Building-Image-Classification) (threat: PAIT-KERAS-301). This is a **structural false positive**, not a malware detection:

- **What the scanner checks:** String-matching of `class_name` fields in the Keras v3 config against a whitelist of built-in Keras layers.
- **Why flagged:** The model contains a custom layer (`GeMPooling`) — a non-standard class name triggers the flag.
- **What it does NOT check:** The scanner does not analyze the Python code of the custom class, does not look for `eval()`/`exec()`/`os.system()`, and does not detect actual malware.
- **Other scanners:** VirusTotal, JFrog, HF Picklescan — all clean. Only Protect AI flags this file.

**The custom classes are safe and open source:**
- `GeMPooling` — Generalized Mean Pooling (Radenovic et al., CVPR 2018). Pure tensor ops: `tf.pow`, `tf.reduce_mean`, `tf.maximum`.
- `FocalLoss` — Focal Loss (Lin et al., ICCV 2017). Pure tensor ops.
- `DiscriminativeAdamW` — AdamW subclass with gradient scaling. No file I/O, no network calls, no arbitrary code.

Full source code for all custom classes is available in [`build_model.py`](https://huggingface.co/0xgr3y/Arch-Building-Image-Classification/blob/main/build_model.py) and the training notebook for public audit.

## Multi-Format Deployment Guide

With model is provided in multiple formats to suit different deployment scenarios. Formats marked ✓ are **not flagged** by Protect AI (no custom class serialization).

| Format | File | Size | Protect AI | Inference Speed | Best For |
|--------|------|------|------------|-----------------|----------|
| **TF-Lite** ✓ | `tflite/model.tflite` | ~88 MB | ✓ Safe | **170.0 ms** (fastest) | Mobile, edge, embedded, HF Space |
| **SavedModel** ✓ | `saved_model/` | ~183 MB | ✓ Safe | — | TensorFlow Serving, cloud backend |
| **TFJS** ✓ | `tfjs_model/` | ~90 MB | ✓ Safe | — | Browser, Node.js (no backend) |
| **Weights H5** ✓ | `fine_tuning_swa.weights.h5` | ~158 MB | ✓ Safe | — | Programmatic load via `build_model.py` |
| **safetensors** ✓ | `fine_tuning_swa.safetensors` | ~157 MB | ✓ Safe | — | HF standard, cross-framework |
| **Build Script** ✓ | `build_model.py` | ~21 KB | ✓ Safe | — | Architecture reconstruction + `load_weights()` |
| **Keras** ℹ | `fine_tuning_swa.keras` | ~227 MB | ℹ Flagged | 358.0 ms | Developer reference, fine-tuning |

### Load Examples

See **Usage** section above for complete load + inference examples for each format.

## Intended Use

- Architectural style classification from building photographs
- Educational tool for architecture recognition
- Research baseline for fine-grained image classification (FGIC)
- Transfer learning experiments on architectural imagery

## Limitations

- Trained on Pexels stock photography — performance may differ on user-generated or field photographs
- Limited to 8 architectural classes (barn, bridge, castle, mosque, skyscraper, stadium, temple, windmill)
- Confusion pair analysis found **0 significant pairs** (threshold >5%) — all 8 classes are well-distinguished by the model. see `confusion_pairs.json` for details
- Barn and windmill share 3 cross-class duplicates (0.02% of dataset) — left as-is due to negligible impact
- Inference confidence can be low on atypical examples

![Misclassification Examples](results/misclassification_examples.png)

## Ethical Considerations

- All training images sourced from [Pexels.com](https://www.pexels.com) under the Pexels License (free for commercial use, no attribution required). No copyrighted or personally identifiable images were used.
- The dataset contains only photographs of buildings and structures — no people, faces, or private property are the subject of classification.
- The model reflects the visual distribution of Pexels stock photography, which may over-represent Western and iconic architectural styles and under-represent vernacular or regional architecture.
- The 8 class categories are broad and do not capture the full diversity of world architecture. Results should not be used to make definitive claims about architectural categorization.
- URL pattern filtering during dataset collection explicitly excluded AI-generated art, illustrations, and non-photographic content to ensure authenticity.

## Links

- **Gradio Space (Live):** [arch-building-classifier Space](https://huggingface.co/spaces/0xgr3y/arch-building-classifier)
- **Dataset Studio:** [0xgr3y/arch-building-dataset](https://huggingface.co/datasets/0xgr3y/arch-building-dataset)
- **GitHub Repository:** [arcxteam/building-architectural-image-classifier](https://github.com/arcxteam/building-architectural-image-classifier)

## References

1. Tan, M., & Le, Q. V. (2021). EfficientNetV2: Smaller Models and Faster Training. *ICML 2021*. [arXiv:2104.00298](https://arxiv.org/abs/2104.00298)
2. Radenovic, F., Tolias, G., & Chum, O. (2018). Fine-Tuning CNN Image Retrieval with No Human Annotation. *IEEE TPAMI*. [arXiv:1711.02512](https://arxiv.org/abs/1711.02512)
3. Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal Loss for Dense Object Detection. *ICCV 2017*. [arXiv:1708.02002](https://arxiv.org/abs/1708.02002)
4. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging Weights Leads to Wider Optima and Better Generalization. *UAI 2018*. [arXiv:1803.05407](https://arxiv.org/abs/1803.05407)
5. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). mixup: Beyond Empirical Risk Minimization. *ICLR 2018*. [arXiv:1710.09412](https://arxiv.org/abs/1710.09412)
6. Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. *ICCV 2019*. [arXiv:1905.04899](https://arxiv.org/abs/1905.04899)
7. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the Inception Architecture for Computer Vision. *CVPR 2016*. [arXiv:1512.00567](https://arxiv.org/abs/1512.00567)
8. Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How Transferable Are Features in Deep Neural Networks? *NeurIPS 2014*. [arXiv:1411.1792](https://arxiv.org/abs/1411.1792)
9. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. *ACL 2018*. [arXiv:1801.06146](https://arxiv.org/abs/1801.06146)
10. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. *JMLR*, 15(56), 1929–1958. [http://jmlr.org/papers/v15/srivastava14a.html](http://jmlr.org/papers/v15/srivastava14a.html)
11. Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. *arXiv preprint*. [arXiv:1502.03167](https://arxiv.org/abs/1502.03167)
12. Tarvainen, A., & Valpola, H. (2017). Mean Teachers are Better Role Models: Weight-averaged Consistency Targets Improve Semi-supervised Deep Learning Results. *NeurIPS 2017*. [arXiv:1703.01780](https://arxiv.org/abs/1703.01780)
13. Perez, L., & Wang, J. (2017). The Effectiveness of Data Augmentation in Image Classification using Deep Learning. *arXiv preprint*. [arXiv:1712.04621](https://arxiv.org/abs/1712.04621)
14. Shanmugam, D., Blalock, D., Balakrishnan, G., Guttag, J., & Sarma, A. (2020). Towards Principled Test-Time Augmentation. *ICML 2020*. [PDF](https://dmshanmugam.github.io/pdfs/icml_2020_testaug.pdf)
15. Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. *ICLR 2017*. [arXiv:1608.03983](https://arxiv.org/abs/1608.03983)
16. Prechelt, L. (1998). Automatic Early Stopping Using Cross Validation: Quantifying the Criteria. *Neural Networks*, 11(4), 761–767. [https://doi.org/10.1016/S0893-6080(98)00010-0](https://doi.org/10.1016/S0893-6080(98)00010-0)
17. Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. *ICML 2017*. [arXiv:1706.04599](https://arxiv.org/abs/1706.04599)
18. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. *ICCV 2017*. [arXiv:1610.02391](https://arxiv.org/abs/1610.02391)
19. van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. *JMLR*, 9(Nov), 2579–2605. [http://jmlr.org/papers/v9/vandermaaten08a.html](http://jmlr.org/papers/v9/vandermaaten08a.html)
20. Hand, D. J., & Till, R. J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. *Machine Learning*, 45(2), 171–186. [https://doi.org/10.1023/A:1010920819831](https://doi.org/10.1023/A:1010920819831)
21. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. *IJCV*, 115(3), 211–252. [arXiv:1409.0575](https://arxiv.org/abs/1409.0575)
22. Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. *NeurIPS 2017*. [arXiv:1612.01474](https://arxiv.org/abs/1612.01474)

## Citation

```bibtex
@misc{saugani2026_arch_building,
  title={Fine-Grained Image Classification of World Architecture:
         An EfficientNetV2-S Transfer Learning Approach with Layered Regularization},
  author={Saugani},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/0xgr3y/Arch-Building-Image-Classification}
}
```
Architecture	EfficientNetV2-S + GeM Pooling + Focal Loss + SWA
Task	Fine-Grained Image Classification (FGIC)
Test Accuracy	97.77%
Classes	8 (barn, bridge, castle, mosque, skyscraper, stadium, temple, windmill)
Input Size	320 × 320 pixels
Parameters	23,350,633
Framework	TensorFlow / Keras 3
License	Apache-2.0