Update README.md

f9344ba verified 2 days ago

31.4 kB

	---
	license: apache-2.0
	pipeline_tag: image-classification
	tags:
	- efficientnetv2
	- fgic
	- safetensors
	- transfer-learning
	- gem-pooling
	- focal-loss
	- swa
	- grad-cam
	- calibration
	- temperature-scaling
	- computer-vision
	- tensorflow.js
	library_name: keras
	language: en
	datasets:
	- 0xgr3y/arch-building-dataset
	model-index:
	- name: Architectural Building Image Classifier
	results:
	- task:
	type: image-classification
	name: Fine-Grained Image Classification
	dataset:
	type: imagefolder
	name: arch-building-dataset
	split: test
	metrics:
	- type: accuracy
	value: 0.9777
	name: Test Accuracy
	- type: accuracy
	value: 0.9836
	name: Validation Accuracy (SWA)
	- type: accuracy
	value: 0.9799
	name: TTA Accuracy
	- type: f1
	value: 0.9777
	name: Macro F1
	- type: precision
	value: 0.9777
	name: Macro Precision
	- type: recall
	value: 0.9777
	name: Macro Recall
	- type: roc_auc
	value: 0.9985
	name: Macro ROC-AUC (OvR)
	---

	![Arch-Building-Image-Classification](results/greyscope-labs-architecture-classification-efficientnetv2.jpg)

	# Fine-Grained Image Classification of World Architecture: An EfficientNetV2-S Transfer Learning Approach with Layered Regularization

	### Architectural Building Image Classifier

	Fine-Grained Image Classification (FGIC) of world architectural buildings using CNN transfer learning with EfficientNetV2-S, enhanced with GeM Pooling, Focal Loss, Discriminative AdamW (LR), Stochastic Weight Averaging (SWA), Grad-CAM explainability, and calibration analysis.

	<table>
	<tr><td><strong>Architecture</strong></td><td>EfficientNetV2-S + GeM Pooling + Focal Loss + SWA</td></tr>
	<tr><td><strong>Task</strong></td><td>Fine-Grained Image Classification (FGIC)</td></tr>
	<tr><td><strong>Test Accuracy</strong></td><td>97.77%</td></tr>
	<tr><td><strong>Classes</strong></td><td>8 (barn, bridge, castle, mosque, skyscraper, stadium, temple, windmill)</td></tr>
	<tr><td><strong>Input Size</strong></td><td>320 × 320 pixels</td></tr>
	<tr><td><strong>Parameters</strong></td><td>23,350,633</td></tr>
	<tr><td><strong>Framework</strong></td><td>TensorFlow / Keras 3</td></tr>
	<tr><td><strong>License</strong></td><td><a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0</a></td></tr>
	</table>

	## Model Description

	A fine-grained image classification model for world architectural buildings. Built on EfficientNetV2-S pretrained on ImageNet, enhanced with GeM Pooling (learnable generalized mean pooling), Focal Loss, Discriminative AdamW and Stochastic Weight Averaging (SWA). Extended with Grad-CAM explainability visualization, ROC-AUC evaluation, ECE calibration analysis, and t-SNE embedding visualization.

	Key architectural contributions:

	- GeM Pooling (Radenovic et al., CVPR 2018) — replaces global average pooling with a learnable power parameter (p=3.0) that emphasizes high-activation features, yielding stronger discriminative representations for FGIC tasks
	- Focal Loss (Lin et al., ICCV 2017, gamma=2.0) — down-weights well-classified examples to focus gradient updates on hard-to-classify building pairs
	- DiscriminativeAdamW LR — extends AdamW with per-variable LR scaling on block6 (×0.1) via (update_step) override, combined with selective fine-tuning (block6+top_conv unfrozen, BN frozen). LR scaling produces truly discriminative updates — block6 variables receive 10× smaller learning rate than head variables (117 total: 105 block6 + 12 head)
	- Mixup + CutMix (Zhang et al., ICLR 2018. Yun et al., ICCV 2019) — alternating per-batch (50/50): Mixup (alpha=0.2, linear interpolation) and CutMix (alpha=1.0, spatial patch). Applied only in Phase 1 training to regularize head learning
	- Selective Unfreeze (Yosinski et al., 2014) — Phase 2 unfreezes block6+top_conv layers (180/513 EfficientNetV2-S layers) while keeping BatchNormalization frozen to preserve pretrained statistics
	- SWA with BN re-estimation (Izmailov et al., UAI 2018) — 10-epoch post-training weight averaging with constant LR 1e-4, followed by 100-step batch normalization statistics re-estimation (3,200 images)
	- Test-Time Augmentation — 6 variations averaged at inference: original, horizontal flip, center crop 85%, center crop 70%, corner crop top-left 80%, corner crop bottom-right 80%. Yields +0.22% accuracy improvement (97.77% → 97.99%)
	- Grad-CAM (Selvaraju et al., ICCV 2017) — gradient-weighted class activation mapping for explainability, targeting top_conv (last Conv2D layer of EfficientNetV2-S)
	- ECE Calibration (Guo et al., ICML 2017) — Expected Calibration Error with 15-bin reliability diagram to assess prediction confidence reliability
	- Temperature Scaling (Guo et al., ICML 2017) — post-hoc calibration via scalar temperature parameter T optimized on validation set (NLL minimization). T=0.54 reduces ECE from 12.04% (underconfident due to Label Smoothing) to 0.53% — applied at inference via (softmax(log(probs) / T)) trick

	## Architecture

	```
	Input (320, 320, 3)
	│
	EfficientNetV2-S (ImageNet pretrained, 513 layers, 20.33M params)
	│
	Conv2D(256, 3×3, ReLU, padding=same) → 2,949,376 params
	BatchNormalization → 1,024 params
	MaxPooling2D(2×2) → 0 params
	│
	GeM Pooling(p=3.0, eps=1e-6, learnable) → 1 param
	│
	Dense(256, ReLU) → 65,792 params
	BatchNormalization → 1,024 params
	Dropout(0.4) → 0 params
	│
	Dense(8, Softmax) → 2,056 params
	│
	Output (8 classes)
	```

	\| Component \| Output Shape \| Parameters \|
	\|-----------\|-------------\|------------\|
	\| EfficientNetV2-S (Functional) \| (None, 10, 10, 1280) \| 20,331,360 \|
	\| Conv2D 256 3×3 \| (None, 10, 10, 256) \| 2,949,376 \|
	\| BatchNormalization \| (None, 10, 10, 256) \| 1,024 \|
	\| MaxPooling2D 2×2 \| (None, 5, 5, 256) \| 0 \|
	\| GeM Pooling p=3.0 \| (None, 256) \| 1 \|
	\| Dense 256 ReLU \| (None, 256) \| 65,792 \|
	\| BatchNormalization \| (None, 256) \| 1,024 \|
	\| Dropout 0.4 \| (None, 256) \| 0 \|
	\| Dense 8 Softmax \| (None, 8) \| 2,056 \|
	\| Total \| \| 23,350,633 \|
	\| Trainable (Phase 1) \| \| 3,018,249 (11.51 MB) \|
	\| Trainable (Phase 2) \| \| 17,810,225 (67.94 MB) \|
	\| Non-trainable (Phase 1) \| \| 20,332,384 (77.56 MB) \|

	## Performance

	### Overall Metrics

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Test Accuracy \| 97.77% \|
	\| Validation Accuracy (SWA) \| 98.36% \|
	\| Test-Time Augmentation \| 97.99% \|
	\| Test Loss \| 0.4262 \|
	\| Overfitting Gap (Train − Test) \| 2.11% \|
	\| Macro Avg Precision \| 0.9777 \|
	\| Macro Avg Recall \| 0.9777 \|
	\| Macro Avg F1-Score \| 0.9777 \|
	\| Top-2 Accuracy \| 99.26% \|
	\| Top-3 Accuracy \| 99.70% \|
	\| Macro ROC-AUC (OvR) \| 0.9985 \|
	\| ECE (15 bins) \| 0.1204 (pre-T-scaling. post-T-scaling: 0.0053, T=0.54) \|

	### Per-Class Results

	\| Class \| Precision \| Recall \| F1-Score \| AUC (OvR) \| Support \|
	\|-------\|-----------\|--------\|----------\|-----------\|---------\|
	\| barn \| 0.9760 \| 0.9702 \| 0.9731 \| 0.9950 \| 168 \|
	\| bridge \| 0.9591 \| 0.9762 \| 0.9676 \| 0.9983 \| 168 \|
	\| castle \| 0.9763 \| 0.9821 \| 0.9792 \| 0.9996 \| 168 \|
	\| mosque \| 0.9763 \| 0.9821 \| 0.9792 \| 0.9987 \| 168 \|
	\| skyscraper \| 0.9940 \| 0.9940 \| 0.9940 \| 0.9999 \| 168 \|
	\| stadium \| 0.9820 \| 0.9762 \| 0.9791 \| 0.9999 \| 168 \|
	\| temple \| 0.9816 \| 0.9524 \| 0.9668 \| 0.9976 \| 168 \|
	\| windmill \| 0.9765 \| 0.9881 \| 0.9822 \| 0.9987 \| 168 \|
	\| Macro Avg \| 0.9777 \| 0.9777 \| 0.9777 \| 0.9985 \| 1,344 \|

	### Model Selection

	Four candidate models were evaluated on the validation set:

	\| Checkpoint \| Val Accuracy \| Val Loss \| Description \|
	\|------------\|-------------\|----------\|-------------\|
	\| `head_training.keras` \| 92.34% \| 1.0109 \| Phase 1 checkpoint (backbone frozen) \|
	\| `fine_tuning.keras` \| 96.28% \| 0.5655 \| Phase 2 checkpoint (block6+top_conv unfrozen) \|
	\| `fine_tuning_ema.keras` \| 93.53% \| 0.6007 \| Phase 2 EMA (per-step Polyak averaging) \|
	\| `fine_tuning_swa.keras` \| 98.36% \| 0.4109 \| SWA averaged weights ← SELECTED \|

	### Training Progression

	\| Phase \| Epoch \| Train Acc \| Val Accuracy \| Val Loss \|
	\|-------\|-------\|-----------\|-------------\|----------\|
	\| Phase 1 (Head Training) \| 1 \| 56.96% \| 92.19% \| 1.0079 \|
	\| Phase 2 (Selective Fine-Tuning) \| 1 \| 84.96% \| 96.21% \| 0.5656 \|
	\| SWA \| 1 \| 90.83% \| 95.76% \| 0.5831 \|
	\| SWA \| 2 \| 94.07% \| 97.62% \| 0.5116 \|
	\| SWA \| 3 \| 95.36% \| 97.69% \| 0.4748 \|
	\| SWA \| 4 \| 96.56% \| 96.95% \| 0.4390 \|
	\| SWA \| 5 \| 97.18% \| 97.47% \| 0.4490 \|
	\| SWA \| 6 \| 97.76% \| 97.84% \| 0.4416 \|
	\| SWA \| 7 \| 97.91% \| 98.14% \| 0.4055 \|
	\| SWA \| 8 \| 98.19% \| 97.32% \| 0.4359 \|
	\| SWA \| 9 \| 98.14% \| 97.02% \| 0.4519 \|
	\| SWA \| 10 \| 98.59% \| 97.54% \| 0.4226 \|
	\| SWA + BN (final) \| — \| — \| 98.36% \| 0.4109 \|

	> Phase 1 and Phase 2 each stopped after 1 epoch via `myCallback` (custom early stopping at target accuracy: 85% Phase 1, 92% Phase 2). SWA ran 10 epochs with constant LR 1e-4, followed by BN re-estimation (100 steps, 3,200 images). Values shown are training-time metrics from progress bar. checkpoint evaluation values may differ slightly (see Model Selection table above).

	![Training Curves](results/training_curves.png)

	![Confusion Matrix](results/confusion_matrix.png)

	![Per-Class Accuracy](results/per_class_accuracy.png)

	![Confidence Per Class](results/confidence_per_class.png)

	![t-SNE Embedding](results/tsne_embedding.png)

	![Grad-CAM Heatmaps](results/gradcam_heatmaps.png)

	## Training Details

	### Training Strategy

	Two-phase progressive training with SWA post-processing:

	\| Phase \| Description \| Backbone \| Optimizer \| LR \| Max Epochs \| Actual Epochs \| CutMix+Mixup \| FocalLoss LS \|
	\|-------\|-------------\|----------\|-----------\|-----\|-----------\|---------------\|---------------\|-------------\|
	\| Phase 1 — Feature Extraction \| Train custom head only \| Frozen (all) \| AdamW (wd=2e-5) \| 0.001 + CosineDecay + Warmup 3ep \| 25 \| 1 \| Yes (50/50 alternation) \| 0.1 \|
	\| Phase 2 — Selective Fine-Tuning \| Load head_training → fine-tune \| block6 + top_conv unfrozen (BN frozen) \| DiscriminativeAdamW (block6=0.1×) \| 3e-4 + CosineDecay + Warmup 5ep \| 50 \| 1 + 10 SWA \| No \| 0.05 \|

	> ¹ Phase 1 stops when `val_accuracy ≥ 85%` threshold (myCallback).

	> ² Phase 2 stops when `val_accuracy ≥ 92%` threshold (myCallback), followed by 10 SWA epochs (constant LR 1e-4).

	### Hyperparameters

	\| Parameter \| Phase 1 \| Phase 2 \|
	\|-----------\|---------\|---------\|
	\| Optimizer \| AdamW \| DiscriminativeAdamW \|
	\| Learning Rate \| 0.001 \| 3×10⁻⁴ \|
	\| LR Schedule \| WarmupCosineDecay (warmup=3) \| WarmupCosineDecay (warmup=5) \|
	\| Weight Decay \| 2×10⁻⁵ \| 2×10⁻⁵ \|
	\| LR Multiplier (block6) \| — \| 0.1× (LR scaling via update_step, truly discriminative) \|
	\| LR Multiplier (top_conv+head) \| — \| 1.0× \|
	\| Loss \| FocalLoss (gamma=2.0, LS=0.1) \| FocalLoss (gamma=2.0, LS=0.05) \|
	\| Batch Size \| 32 \| 32 \|
	\| Early Stopping Patience \| 7 \| 12 \|
	\| myCallback Threshold \| val_acc ≥ 0.85 \| val_acc ≥ 0.92 \|
	\| EMA Decay (per-step) \| 0.999 \| 0.999 \|
	\| SWA Epochs \| — \| 10 (post-training) \|
	\| SWA LR \| — \| 1×10⁻⁴ (constant) \|
	\| BN Re-estimation Steps \| — \| 100 \|
	\| CutMix (alpha=1.0) \| Yes (50% batches) \| No \|
	\| Mixup (alpha=0.2) \| Yes (50% batches) \| No \|
	\| Hardware \| 2× Tesla T4 (MirroredStrategy) \| 2× Tesla T4 (MirroredStrategy) \|

	### Regularization Strategy

	\| Technique \| Implementation \| Reference \|
	\|-----------\|---------------\|-----------\|
	\| Transfer Learning \| EfficientNetV2-S backbone frozen in Phase 1 \| Yosinski et al., NeurIPS 2014 \|
	\| Selective Fine-Tuning \| Unfreeze block6+top_conv only, BN stays frozen \| Howard & Ruder, ACL 2018 \|
	\| Discriminative LR Scaling \| block6 LR×0.1 via update_step (truly discriminative — 10× smaller updates for pretrained features) \| Howard & Ruder, ACL 2018 \|
	\| CutMix + Mixup \| Alternation per batch (50/50), Phase 1 only \| Yun et al., ICCV 2019. Zhang et al., ICLR 2018 \|
	\| Focal Loss \| gamma=2.0, down-weights easy examples \| Lin et al., ICCV 2017 \|
	\| Label Smoothing \| 0.1 (Phase 1) → 0.05 (Phase 2) \| Szegedy et al., CVPR 2016 \|
	\| GeM Pooling \| p=3.0 learnable, replaces GAP \| Radenovic et al., CVPR 2018 \|
	\| Dropout \| 0.4 after Dense(256)+BN \| Srivastava et al., JMLR 2014 \|
	\| Batch Normalization \| After Conv2D and Dense. frozen during fine-tuning \| Ioffe & Szegedy, arXiv 2015 \|
	\| EMA (per-step) \| Shadow weights, decay=0.999, Polyak averaging \| Tarvainen & Valpola, NeurIPS 2017 \|
	\| SWA \| 10-epoch post-training, constant LR 1e-4 \| Izmailov et al., UAI 2018 \|
	\| Data Augmentation \| Rotation ±15°, shift ±10%, shear ±0.1 rad, zoom ±20%, brightness 0.75–1.15, channel shift ±10.0, horizontal flip \| Perez & Wang, arXiv 2017 \|
	\| Random Erasing \| p=0.5, area [0.02–0.15], aspect [0.3–3.3], applied pre-normalization \| Zhong et al., AAAI 2020 \|
	\| Test-Time Augmentation \| 6 augmentation variants, averaged \| Shanmugam et al., ICML 2020 \|
	\| WarmupCosineDecay \| Linear warmup + cosine annealing \| Loshchilov & Hutter, ICLR 2017 (SGDR) \|
	\| Early Stopping \| Patience 7 (Phase 1) / 12 (Phase 2) \| Prechelt, Neural Networks 1998 \|

	### Dataset

	See the dataset curation page for [World Architectural Buildings Dataset for Multi‑Class Image Classification](https://huggingface.co/datasets/0xgr3y/arch-building-dataset) — 13,440 images (8 classes × 1,680, balanced) sourced from Pexels with perceptual (pHash) and exact (SHA256) deduplication.

	\| Split \| Images \| Percentage \|
	\|-------\|--------\|------------\|
	\| Train \| 10,752 \| 80% \|
	\| Validation \| 1,344 \| 10% \|
	\| Test \| 1,344 \| 10% \|

	### Data Preprocessing

	- Normalization: `preprocess_input` from `tf.keras.applications.efficientnet_v2` (ImageNet distribution)
	- Input resolution: 320×320 (higher than ImageNet default 224×224 to capture fine-grained architectural details — textures, ornaments, facade patterns)
	- Augmentation: Applied to training set only. validation and test sets use clean preprocessing
	- Split method: `splitfolders.ratio` from `dataset/`, seed=42

	## Files

	\| Category \| Files \|
	\|----------\|-------\|
	\| Model (best) \| `fine_tuning_swa.keras` (227 MB) · `.weights.h5` (158 MB) · `.safetensors` (157 MB) \|
	\| Code \| `build_model.py` (21 KB) — architecture + CLI inference \|
	\| Config \| `config.json` · `label_mapping.json` · `preprocessor_config.json` \|
	\| Evaluation \| `calibration_data.json` · `model_benchmark.json` · `confusion_pairs.json` · `class_confidence_stats.json` · `temperature_config.json` \|
	\| Deployment \| `saved_model/` (183 MB) · `tflite/` (88 MB) · `tfjs_model/` (90 MB, 23 shards) \|
	\| Results \| `results/` — 12 PNG (augmentation, reliability-diagram, training curves, confusion matrix, ROC, t-SNE, Grad-CAM, etc.) \|
	\| Archive \| `models_keras/` — 3 checkpoints (head_training, fine_tuning, fine_tuning_ema) \|

	## Usage

	### Gradio Space

	Try the live building classify: [Architecture Building Image Classifier with Space](https://huggingface.co/spaces/0xgr3y/arch-building-classifier)

	### Python — build_model.py (recommended)

	`build_model.py` is a standalone module that provides:
	- Custom class definitions (`GeMPooling`, `FocalLoss`, `DiscriminativeAdamW`) with `@register_keras_serializable` — importing the module registers all custom classes globally, so `load_model()` works without explicit `custom_objects`.
	- `ArchBuildingClassifier` — high-level wrapper class with `build()`, `from_weights()`, `from_keras()`, `predict()`, `predict_batch()` methods.
	- `CUSTOM_OBJECTS` dict — fallback for explicit `custom_objects=` in `load_model()`.
	- `build_model()` — backward-compatible function that returns a raw `tf.keras.Model`.

	Upload `build_model.py` to the same directory as your script or add it to `PYTHONPATH`.

	> Note: Filenames below use `fine_tuning_swa` as an example. The actual best checkpoint filename depends on training results — check the repo for the actual `.keras`, `.weights.h5`, and `.safetensors` filenames.

	```python
	from build_model import ArchBuildingClassifier
	from huggingface_hub import hf_hub_download

	# Download weights (clean format)
	weights_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "fine_tuning_swa.weights.h5")

	# Load model: architecture + weights
	clf = ArchBuildingClassifier.from_weights(weights_path)

	# Inference
	from PIL import Image
	import numpy as np
	label, confidence, top3 = clf.predict(Image.open("skyscraper_00000.jpg"))
	print(f"Predicted: {label} ({confidence:.1%})")
	for cls, prob in top3:
	print(f" {cls}: {prob:.1%}")
	```

	### Python — TF-Lite (fastest inference)

	```python
	import numpy as np
	import tensorflow as tf
	from huggingface_hub import hf_hub_download
	from PIL import Image
	import json

	try:
	from tensorflow.keras.applications.efficientnet_v2 import preprocess_input
	except (ImportError, ModuleNotFoundError):
	from tensorflow.keras.applications.efficientnet import preprocess_input

	# Download
	model_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "tflite/model.tflite")
	labels_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "label_mapping.json")

	with open(labels_path) as f:
	LABELS = json.load(f)["labels"]

	interpreter = tf.lite.Interpreter(model_path=model_path)
	interpreter.allocate_tensors()
	input_details = interpreter.get_input_details()
	output_details = interpreter.get_output_details()

	img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
	arr = np.expand_dims(preprocess_input(
	np.array(img, dtype=np.float32)), axis=0)

	interpreter.set_tensor(input_details[0]["index"], arr)
	interpreter.invoke()
	preds = interpreter.get_tensor(output_details[0]["index"])[0]

	top3_idx = np.argsort(preds)[::-1][:3]
	for i in top3_idx:
	print(f" {LABELS[i]}: {preds[i]*100:.1f}%")
	```

	### Python — Keras (convenient)

	```python
	import build_model # registers custom classes via @register_keras_serializable
	import tensorflow as tf
	from huggingface_hub import hf_hub_download
	try:
	from tensorflow.keras.applications.efficientnet_v2 import preprocess_input
	except (ImportError, ModuleNotFoundError):
	from tensorflow.keras.applications.efficientnet import preprocess_input
	from PIL import Image
	import numpy as np
	import json

	model_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "fine_tuning_swa.keras")
	labels_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "label_mapping.json")

	model = tf.keras.models.load_model(model_path, compile=False) # custom_objects not needed

	with open(labels_path) as f:
	LABELS = json.load(f)["labels"]

	img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
	arr = np.expand_dims(preprocess_input(np.array(img, dtype=np.float32)), axis=0)
	preds = model.predict(arr, verbose=0)[0]
	print(f"Predicted: {LABELS[np.argmax(preds)]} ({np.max(preds)*100:.1f}%)")
	```

	### Python — SavedModel (TF Serving)

	```python
	from huggingface_hub import snapshot_download
	import tensorflow as tf
	import numpy as np
	from PIL import Image

	try:
	from tensorflow.keras.applications.efficientnet_v2 import preprocess_input
	except (ImportError, ModuleNotFoundError):
	from tensorflow.keras.applications.efficientnet import preprocess_input

	snapshot_download("0xgr3y/Arch-Building-Image-Classification", allow_patterns=["saved_model/*"], local_dir=".")

	# Load SavedModel (created via model.export() — inference-only, no custom_objects needed)
	loaded = tf.saved_model.load("saved_model")

	img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
	arr = tf.constant(np.expand_dims(preprocess_input(np.array(img, dtype=np.float32)), axis=0))
	preds = loaded(arr).numpy()[0]

	top3_idx = np.argsort(preds)[::-1][:3]
	for i in top3_idx:
	print(f" Class {i}: {preds[i]*100:.1f}%")
	```

	### Python — safetensors (HF standard, cross-framework)

	> Note: safetensors stores raw weight tensors without architecture metadata. To load, reconstruct the architecture with `build_model.py` first, then map tensors manually. For most use cases, `.weights.h5` (via `ArchBuildingClassifier.from_weights()`) is simpler and equally clean.

	```python
	from safetensors.numpy import load_file
	from build_model import ArchBuildingClassifier
	from PIL import Image

	# Reconstruct architecture
	clf = ArchBuildingClassifier.build()

	# Load safetensors tensors
	tensors = load_file("fine_tuning_swa.safetensors")

	# Map tensors to model weights (iterate layers, not .variables — Keras 3 compatible)
	for layer in clf.keras_model.layers:
	for w in layer.weights:
	name = w.name.replace(':', '_').replace('/', '_')
	if name in tensors:
	w.assign(tensors[name])

	# Inference
	label, confidence, top3 = clf.predict(Image.open("skyscraper_00000.jpg"))
	```

	## Inference Verification

	Keras vs TFLite consistency was verified on 8 random test samples (1 per class):

	\| Metric \| Result \|
	\|--------\|--------\|
	\| Keras correct \| 7/8 (88%) \|
	\| TFLite correct \| 7/8 (88%) \|
	\| Keras vs TFLite match \| 8/8 (100%) — identical predictions \|
	\| Keras inference speed \| 358.0 ms \|
	\| TFLite inference speed \| 170.0 ms \|

	> The 1 misclassification (castle→barn, 65% confidence) is consistent with the 97.77% test accuracy. The 8/8 match confirms TFLite conversion preserves model behavior exactly.

	![TFLite Inference](results/inference_tflite.png)

	## Security Notice (PAIT-KERAS-301)

	The `.keras` files in this repository are flagged "Unsafe" by [Protect AI Guardian](https://protectai.com/insights/models/0xgr3y/Arch-Building-Image-Classification) (threat: PAIT-KERAS-301). This is a structural false positive, not a malware detection:

	- What the scanner checks: String-matching of `class_name` fields in the Keras v3 config against a whitelist of built-in Keras layers.
	- Why flagged: The model contains a custom layer (`GeMPooling`) — a non-standard class name triggers the flag.
	- What it does NOT check: The scanner does not analyze the Python code of the custom class, does not look for `eval()`/`exec()`/`os.system()`, and does not detect actual malware.
	- Other scanners: VirusTotal, JFrog, HF Picklescan — all clean. Only Protect AI flags this file.

	The custom classes are safe and open source:
	- `GeMPooling` — Generalized Mean Pooling (Radenovic et al., CVPR 2018). Pure tensor ops: `tf.pow`, `tf.reduce_mean`, `tf.maximum`.
	- `FocalLoss` — Focal Loss (Lin et al., ICCV 2017). Pure tensor ops.
	- `DiscriminativeAdamW` — AdamW subclass with gradient scaling. No file I/O, no network calls, no arbitrary code.

	Full source code for all custom classes is available in [`build_model.py`](https://huggingface.co/0xgr3y/Arch-Building-Image-Classification/blob/main/build_model.py) and the training notebook for public audit.

	## Multi-Format Deployment Guide

	With model is provided in multiple formats to suit different deployment scenarios. Formats marked ✓ are not flagged by Protect AI (no custom class serialization).

	\| Format \| File \| Size \| Protect AI \| Inference Speed \| Best For \|
	\|--------\|------\|------\|------------\|-----------------\|----------\|
	\| TF-Lite ✓ \| `tflite/model.tflite` \| ~88 MB \| ✓ Safe \| 170.0 ms (fastest) \| Mobile, edge, embedded, HF Space \|
	\| SavedModel ✓ \| `saved_model/` \| ~183 MB \| ✓ Safe \| — \| TensorFlow Serving, cloud backend \|
	\| TFJS ✓ \| `tfjs_model/` \| ~90 MB \| ✓ Safe \| — \| Browser, Node.js (no backend) \|
	\| Weights H5 ✓ \| `fine_tuning_swa.weights.h5` \| ~158 MB \| ✓ Safe \| — \| Programmatic load via `build_model.py` \|
	\| safetensors ✓ \| `fine_tuning_swa.safetensors` \| ~157 MB \| ✓ Safe \| — \| HF standard, cross-framework \|
	\| Build Script ✓ \| `build_model.py` \| ~21 KB \| ✓ Safe \| — \| Architecture reconstruction + `load_weights()` \|
	\| Keras ℹ \| `fine_tuning_swa.keras` \| ~227 MB \| ℹ Flagged \| 358.0 ms \| Developer reference, fine-tuning \|

	### Load Examples

	See Usage section above for complete load + inference examples for each format.

	## Intended Use

	- Architectural style classification from building photographs
	- Educational tool for architecture recognition
	- Research baseline for fine-grained image classification (FGIC)
	- Transfer learning experiments on architectural imagery

	## Limitations

	- Trained on Pexels stock photography — performance may differ on user-generated or field photographs
	- Limited to 8 architectural classes (barn, bridge, castle, mosque, skyscraper, stadium, temple, windmill)
	- Confusion pair analysis found 0 significant pairs (threshold >5%) — all 8 classes are well-distinguished by the model. see `confusion_pairs.json` for details
	- Barn and windmill share 3 cross-class duplicates (0.02% of dataset) — left as-is due to negligible impact
	- Inference confidence can be low on atypical examples

	![Misclassification Examples](results/misclassification_examples.png)

	## Ethical Considerations

	- All training images sourced from [Pexels.com](https://www.pexels.com) under the Pexels License (free for commercial use, no attribution required). No copyrighted or personally identifiable images were used.
	- The dataset contains only photographs of buildings and structures — no people, faces, or private property are the subject of classification.
	- The model reflects the visual distribution of Pexels stock photography, which may over-represent Western and iconic architectural styles and under-represent vernacular or regional architecture.
	- The 8 class categories are broad and do not capture the full diversity of world architecture. Results should not be used to make definitive claims about architectural categorization.
	- URL pattern filtering during dataset collection explicitly excluded AI-generated art, illustrations, and non-photographic content to ensure authenticity.

	## Links

	- Gradio Space (Live): [arch-building-classifier Space](https://huggingface.co/spaces/0xgr3y/arch-building-classifier)
	- Dataset Studio: [0xgr3y/arch-building-dataset](https://huggingface.co/datasets/0xgr3y/arch-building-dataset)
	- GitHub Repository: [arcxteam/building-architectural-image-classifier](https://github.com/arcxteam/building-architectural-image-classifier)

	## References

	1. Tan, M., & Le, Q. V. (2021). EfficientNetV2: Smaller Models and Faster Training. ICML 2021. [arXiv:2104.00298](https://arxiv.org/abs/2104.00298)
	2. Radenovic, F., Tolias, G., & Chum, O. (2018). Fine-Tuning CNN Image Retrieval with No Human Annotation. IEEE TPAMI. [arXiv:1711.02512](https://arxiv.org/abs/1711.02512)
	3. Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal Loss for Dense Object Detection. ICCV 2017. [arXiv:1708.02002](https://arxiv.org/abs/1708.02002)
	4. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging Weights Leads to Wider Optima and Better Generalization. UAI 2018. [arXiv:1803.05407](https://arxiv.org/abs/1803.05407)
	5. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). mixup: Beyond Empirical Risk Minimization. ICLR 2018. [arXiv:1710.09412](https://arxiv.org/abs/1710.09412)
	6. Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. ICCV 2019. [arXiv:1905.04899](https://arxiv.org/abs/1905.04899)
	7. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the Inception Architecture for Computer Vision. CVPR 2016. [arXiv:1512.00567](https://arxiv.org/abs/1512.00567)
	8. Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How Transferable Are Features in Deep Neural Networks? NeurIPS 2014. [arXiv:1411.1792](https://arxiv.org/abs/1411.1792)
	9. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. ACL 2018. [arXiv:1801.06146](https://arxiv.org/abs/1801.06146)
	10. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15(56), 1929–1958. [http://jmlr.org/papers/v15/srivastava14a.html](http://jmlr.org/papers/v15/srivastava14a.html)
	11. Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint. [arXiv:1502.03167](https://arxiv.org/abs/1502.03167)
	12. Tarvainen, A., & Valpola, H. (2017). Mean Teachers are Better Role Models: Weight-averaged Consistency Targets Improve Semi-supervised Deep Learning Results. NeurIPS 2017. [arXiv:1703.01780](https://arxiv.org/abs/1703.01780)
	13. Perez, L., & Wang, J. (2017). The Effectiveness of Data Augmentation in Image Classification using Deep Learning. arXiv preprint. [arXiv:1712.04621](https://arxiv.org/abs/1712.04621)
	14. Shanmugam, D., Blalock, D., Balakrishnan, G., Guttag, J., & Sarma, A. (2020). Towards Principled Test-Time Augmentation. ICML 2020. [PDF](https://dmshanmugam.github.io/pdfs/icml_2020_testaug.pdf)
	15. Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017. [arXiv:1608.03983](https://arxiv.org/abs/1608.03983)
	16. Prechelt, L. (1998). Automatic Early Stopping Using Cross Validation: Quantifying the Criteria. Neural Networks, 11(4), 761–767. [https://doi.org/10.1016/S0893-6080(98)00010-0](https://doi.org/10.1016/S0893-6080(98)00010-0)
	17. Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML 2017. [arXiv:1706.04599](https://arxiv.org/abs/1706.04599)
	18. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. ICCV 2017. [arXiv:1610.02391](https://arxiv.org/abs/1610.02391)
	19. van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. JMLR, 9(Nov), 2579–2605. [http://jmlr.org/papers/v9/vandermaaten08a.html](http://jmlr.org/papers/v9/vandermaaten08a.html)
	20. Hand, D. J., & Till, R. J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 45(2), 171–186. [https://doi.org/10.1023/A:1010920819831](https://doi.org/10.1023/A:1010920819831)
	21. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3), 211–252. [arXiv:1409.0575](https://arxiv.org/abs/1409.0575)
	22. Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NeurIPS 2017. [arXiv:1612.01474](https://arxiv.org/abs/1612.01474)

	## Citation

	```bibtex
	@misc{saugani2026_arch_building,
	title={Fine-Grained Image Classification of World Architecture:
	An EfficientNetV2-S Transfer Learning Approach with Layered Regularization},
	author={Saugani},
	year={2026},
	publisher={Hugging Face},
	url={https://huggingface.co/0xgr3y/Arch-Building-Image-Classification}
	}
	```