0xgr3y commited on
Commit
1fedd85
Β·
verified Β·
1 Parent(s): a270794

Upload V16 README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -85
README.md CHANGED
@@ -8,13 +8,15 @@ tags:
8
  - densenet121
9
  - architecture
10
  - building
11
- - cnn
12
  - fgvc
13
  - transfer-learning
14
  - gem-pooling
15
  - focal-loss
16
- - swa
17
  - discriminative-learning-rate
 
 
 
 
18
  library_name: keras
19
  language: en
20
  datasets:
@@ -34,41 +36,45 @@ model-index:
34
  split: test
35
  metrics:
36
  - type: accuracy
37
- value: 96.23
38
  name: Test Accuracy
39
  - type: accuracy
40
- value: 95.93
41
  name: Validation Accuracy (SWA)
42
  - type: accuracy
43
- value: 96.33
44
  name: TTA Accuracy
45
  ---
46
 
47
- # Architectural Building Image Classifier
48
 
49
- Fine-Grained Visual Categorization (FGVC) of world architectural buildings using CNN transfer learning with DenseNet121, enhanced with GeM Pooling, Focal Loss, and Stochastic Weight Averaging (SWA).
 
 
50
 
51
  <table>
52
- <tr><td><strong>Architecture</strong></td><td>DenseNet121 + GeM Pooling (p=3.0) + SWA</td></tr>
53
- <tr><td><strong>Task</strong></td><td>Fine-Grained Visual Categorization (FGVC)</td></tr>
54
- <tr><td><strong>Test Accuracy</strong></td><td>96.23% (970/1,008)</td></tr>
55
- <tr><td><strong>Classes</strong></td><td>6 (Bridge, Castle, Mosque, Skyscraper, Stadium, Temple)</td></tr>
56
  <tr><td><strong>Input Size</strong></td><td>320 Γ— 320 pixels</td></tr>
57
- <tr><td><strong>Parameters</strong></td><td>9,466,439 (9.27M trainable in Phase 2)</td></tr>
58
  <tr><td><strong>Framework</strong></td><td>TensorFlow / Keras 3</td></tr>
59
  <tr><td><strong>License</strong></td><td><a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0</a></td></tr>
60
  </table>
61
 
62
  ## Model Description
63
 
64
- A fine-grained image classification model for world architectural buildings. Built on DenseNet121 pretrained on ImageNet, enhanced with GeM Pooling (learnable generalized mean pooling), Focal Loss, and Stochastic Weight Averaging (SWA).
65
 
66
  **Key architectural contributions:**
67
 
68
  - **GeM Pooling** (Radenovic et al., CVPR 2018) β€” replaces global average pooling with a learnable power parameter (p=3.0) that emphasizes high-activation features, yielding stronger discriminative representations for FGVC tasks
69
  - **Focal Loss** (Lin et al., ICCV 2017, gamma=2.0) β€” down-weights well-classified examples to focus gradient updates on hard-to-classify building pairs
70
  - **DiscriminativeAdamW** β€” extends AdamW with per-layer learning rate multipliers: conv4_block receives LR Γ— 0.1 (pretrained features require smaller updates), while conv5_block and the custom head receive LR Γ— 1.0
71
- - **SWA with BN re-estimation** (Izmailov et al., UAI 2018) β€” 5-epoch post-training weight averaging with constant LR 1e-4, followed by 100-step batch normalization statistics re-estimation, improving validation accuracy from 93.35% to 95.93%
 
 
72
 
73
  ## Architecture
74
 
@@ -87,9 +93,9 @@ Input (320, 320, 3)
87
  BatchNormalization β†’ 1,024 params
88
  Dropout(0.4) β†’ 0 params
89
  β”‚
90
- Dense(6, Softmax) β†’ 1,542 params
91
  β”‚
92
- Output (6 classes)
93
  ```
94
 
95
  | Component | Output Shape | Parameters |
@@ -102,10 +108,10 @@ Output (6 classes)
102
  | Dense 256 ReLU | (None, 256) | 65,792 |
103
  | BatchNormalization | (None, 256) | 1,024 |
104
  | Dropout 0.4 | (None, 256) | 0 |
105
- | Dense 6 Softmax | (None, 6) | 1,542 |
106
- | **Total** | | **9,466,439** |
107
- | Trainable (Phase 1) | | **2,427,911** (9.26 MB) |
108
- | Trainable (Phase 2) | | **7,883,783** (30.07 MB) |
109
  | Non-trainable (Phase 1) | | **7,038,528** (26.85 MB) |
110
 
111
  ## Performance
@@ -114,29 +120,32 @@ Output (6 classes)
114
 
115
  | Metric | Value |
116
  |--------|-------|
117
- | Test Accuracy | 96.23% (970/1,008) |
118
- | Validation Accuracy (SWA) | 95.93% |
119
- | Test-Time Augmentation | 96.33% (+0.10%) |
120
- | Test Loss | 0.3974 |
121
- | Overfitting Gap (Train βˆ’ Test) | 3.22% |
122
- | Macro Avg Precision | 96.29% |
123
- | Macro Avg Recall | 96.23% |
124
- | Macro Avg F1-Score | 96.21% |
 
 
 
 
125
 
126
  ### Per-Class Results
127
 
128
  | Class | Precision | Recall | F1-Score | Support |
129
  |-------|-----------|--------|----------|---------|
130
- | Bridge | 94.19% | 96.43% | 95.29% | 168 |
131
- | Castle | 97.63% | 98.21% | 97.92% | 168 |
132
- | Mosque | 93.75% | 98.21% | 95.93% | 168 |
133
- | Skyscraper | 96.53% | 99.40% | 97.95% | 168 |
134
- | Stadium | 98.06% | 90.48% | 94.12% | 168 |
135
- | Temple | 97.55% | 94.64% | 96.07% | 168 |
136
- | **Macro Avg** | **96.29%** | **96.23%** | **96.21%** | **1,008** |
137
-
138
- **Highest performing classes:** Skyscraper (recall=99.40%), Castle (F1=97.92%)
139
- **Most challenging classes:** Stadium (recall=90.48%), Bridge (precision=94.19%)
140
 
141
  ### Model Selection
142
 
@@ -144,27 +153,42 @@ Four candidate models were evaluated on the validation set:
144
 
145
  | Checkpoint | Val Accuracy | Val Loss | Description |
146
  |------------|-------------|----------|-------------|
147
- | `best_phase1.keras` | 86.71% | 1.0042 | Phase 1 checkpoint (backbone frozen) |
148
- | `best_phase2.keras` | 93.35% | 0.4877 | Phase 2 checkpoint (conv4+conv5 unfrozen) |
149
- | `best_phase2_ema.keras` | 87.20% | 0.6621 | Phase 2 EMA shadow weights |
150
- | **`best_phase2_swa.keras`** | **95.93%** | **0.3981** | **SWA averaged weights ← SELECTED** |
151
 
152
  ### SWA Progression
153
 
154
  | SWA Epoch | Val Accuracy | Val Loss |
155
  |-----------|-------------|----------|
156
- | 1 | 92.56% | 0.5182 |
157
- | 2 | 94.74% | 0.4691 |
158
- | 3 | 93.75% | 0.4694 |
159
- | 4 | 95.93% | 0.4297 |
160
- | 5 | 95.83% | 0.4130 |
161
- | **SWA Average (final)** | **95.93%** | **0.3981** |
 
 
 
 
 
 
 
 
 
 
 
162
 
163
- ![Training Curves](training_curves.png)
164
 
165
- ![Confusion Matrix](confusion_matrix.png)
166
 
167
- ![Per-Class Accuracy](per_class_accuracy.png)
 
 
 
 
168
 
169
  ## Training Details
170
 
@@ -174,12 +198,12 @@ Two-phase progressive training with SWA post-processing:
174
 
175
  | Phase | Description | Backbone | Optimizer | LR | Max Epochs | Actual Epochs | CutMix+Mixup | FocalLoss LS |
176
  |-------|-------------|----------|-----------|-----|-----------|---------------|---------------|-------------|
177
- | **Phase 1** β€” Feature Extraction | Train custom head only | Frozen (all) | AdamW (wd=2e-5) | 0.001 + CosineDecay + Warmup 3ep | 25 | 1 ΒΉ | Yes (50/50 alternation) | 0.1 |
178
- | **Phase 2** β€” Selective Fine-Tuning | Load best_phase1 β†’ fine-tune | conv4_block + conv5_block unfrozen (BN frozen) | DiscriminativeAdamW (conv4=0.1Γ—) | 3e-4 + CosineDecay + Warmup 5ep | 50 | 6 + 5 SWA Β² | No | 0.05 |
179
 
180
- > ΒΉ Phase 1 stopped at epoch 1 because `val_accuracy = 86.71% β‰₯ 85%` threshold (myCallback). This demonstrates the effectiveness of ImageNet transfer learning β€” a single epoch of head training exceeds the target.
181
 
182
- > Β² Phase 2 stopped at epoch 6 with `val_accuracy = 93.35%` (β‰₯ 92% threshold), followed by 5 SWA epochs (constant LR 1e-4) improving to 95.93%.
183
 
184
  ### Hyperparameters
185
 
@@ -196,7 +220,7 @@ Two-phase progressive training with SWA post-processing:
196
  | Early Stopping Patience | 7 | 12 |
197
  | myCallback Threshold | val_acc β‰₯ 0.85 | val_acc β‰₯ 0.92 |
198
  | EMA Decay | 0.999 | 0.999 |
199
- | SWA Epochs | β€” | 5 (post-training) |
200
  | SWA LR | β€” | 1Γ—10⁻⁴ (constant) |
201
  | BN Re-estimation Steps | β€” | 100 |
202
  | CutMix (alpha=1.0) | Yes (50% batches) | No |
@@ -217,7 +241,7 @@ Two-phase progressive training with SWA post-processing:
217
  | Dropout | 0.4 after Dense(256)+BN | Srivastava et al., JMLR 2014 |
218
  | Batch Normalization | After Conv2D and Dense; frozen during fine-tuning | Ioffe & Szegedy, arXiv 2015 |
219
  | EMA | Shadow weights, decay=0.999 | Tarvainen & Valpola, NeurIPS 2017 |
220
- | SWA | 5-epoch post-training, constant LR 1e-4 | Izmailov et al., UAI 2018 |
221
  | Data Augmentation | Rotation Β±15Β°, shift Β±10%, zoom Β±20%, brightness 0.75–1.15, horizontal flip | Perez & Wang, arXiv 2017 |
222
  | Test-Time Augmentation | 6 augmentation variants, averaged | Shanmugam et al., ICML 2020 |
223
  | WarmupCosineDecay | Linear warmup + cosine annealing | Loshchilov & Hutter, ICLR 2017 (SGDR) |
@@ -225,13 +249,13 @@ Two-phase progressive training with SWA post-processing:
225
 
226
  ### Dataset
227
 
228
- [0xgr3y/arch-building-dataset](https://huggingface.co/datasets/0xgr3y/arch-building-dataset) β€” 10,080 images (6 classes Γ— 1,680, balanced) sourced from Pexels with perceptual (pHash) and exact (SHA256) deduplication.
229
 
230
  | Split | Images | Percentage |
231
  |-------|--------|------------|
232
- | Train | 8,064 | 80% |
233
- | Validation | 1,008 | 10% |
234
- | Test | 1,008 | 10% |
235
 
236
  ### Data Preprocessing
237
 
@@ -244,21 +268,37 @@ Two-phase progressive training with SWA post-processing:
244
 
245
  | File | Description |
246
  |------|-------------|
247
- | `best_phase2_swa.keras` | Best model β€” SWA averaged weights (val_acc=95.93%) |
248
- | `best_phase2.keras` | Phase 2 checkpoint (val_acc=93.35%) |
249
  | `saved_model/` | TensorFlow SavedModel format (portable, for TF Serving) |
250
  | `tflite/model.tflite` | TensorFlow Lite model (mobile/embedded) |
251
  | `tflite/label.txt` | Class label names for TF-Lite |
252
  | `tfjs_model/` | TensorFlow.js model (browser, 10 weight shards + model.json) |
253
  | `config.json` | Model configuration and evaluation metrics |
254
- | `label_mapping.json` | Class name ↔ ID mapping (integer values) |
255
  | `preprocessor_config.json` | Input preprocessing specification (320Γ—320) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
256
 
257
  ## Usage
258
 
259
- ### Gradio Demo
260
 
261
- Try the live demo: [arch-building-classifier Space](https://huggingface.co/spaces/0xgr3y/arch-building-classifier)
262
 
263
  ### Python β€” Keras
264
 
@@ -282,7 +322,7 @@ class GeMPooling(Layer):
282
  initializer=tf.keras.initializers.Constant(self.p_init), trainable=True)
283
  super().build(input_shape)
284
  def call(self, x):
285
- x = tf.clip_by_value(x, self.eps, tf.reduce_max(x))
286
  x = tf.pow(x, self.p)
287
  x = tf.reduce_mean(x, axis=[1, 2], keepdims=False)
288
  return tf.pow(x, 1.0 / self.p)
@@ -347,7 +387,12 @@ class DiscriminativeAdamW(tf.keras.optimizers.AdamW):
347
 
348
  # =====================---Load Model---==========================
349
 
350
- LABELS = ["bridge", "castle", "mosque", "skyscraper", "stadium", "temple"]
 
 
 
 
 
351
  custom_objects = {
352
  "GeMPooling": GeMPooling,
353
  "FocalLoss": FocalLoss,
@@ -359,7 +404,7 @@ model = tf.keras.models.load_model(model_path, custom_objects=custom_objects, co
359
 
360
  # =======================---Inference---==========================
361
 
362
- img = Image.open("building.jpg").convert("RGB").resize((320, 320))
363
  arr = np.expand_dims(preprocess_input(np.array(img, dtype=np.float32)), axis=0)
364
  preds = model.predict(arr, verbose=0)[0]
365
  print(f"Predicted: {LABELS[np.argmax(preds)]} ({np.max(preds)*100:.1f}%)")
@@ -377,9 +422,13 @@ hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "tflite/label.txt",
377
  ```python
378
  import numpy as np
379
  import tensorflow as tf
 
380
  from PIL import Image
 
381
 
382
- LABELS = ["bridge", "castle", "mosque", "skyscraper", "stadium", "temple"]
 
 
383
 
384
  interpreter = tf.lite.Interpreter(model_path="tflite/model.tflite")
385
  interpreter.allocate_tensors()
@@ -387,7 +436,7 @@ interpreter.allocate_tensors()
387
  input_details = interpreter.get_input_details()
388
  output_details = interpreter.get_output_details()
389
 
390
- img = Image.open("building.jpg").convert("RGB").resize((320, 320))
391
  arr = np.expand_dims(np.array(img, dtype=np.float32), axis=0)
392
  arr = tf.keras.applications.densenet.preprocess_input(arr)
393
 
@@ -413,15 +462,20 @@ Requires the custom layer definitions from the Keras section above.
413
  ```python
414
  import tensorflow as tf
415
  import numpy as np
 
416
  from PIL import Image
 
 
 
 
 
417
 
418
- LABELS = ["bridge", "castle", "mosque", "skyscraper", "stadium", "temple"]
419
  custom_objects = {"GeMPooling": GeMPooling, "FocalLoss": FocalLoss,
420
  "DiscriminativeAdamW": DiscriminativeAdamW}
421
 
422
  model = tf.keras.models.load_model("saved_model", custom_objects=custom_objects, compile=False)
423
 
424
- img = Image.open("building.jpg").convert("RGB").resize((320, 320))
425
  arr = np.expand_dims(tf.keras.applications.densenet.preprocess_input(
426
  np.array(img, dtype=np.float32)), axis=0)
427
  preds = model.predict(arr, verbose=0)[0]
@@ -432,31 +486,30 @@ print(f"Predicted: {LABELS[np.argmax(preds)]} ({np.max(preds)*100:.1f}%)")
432
 
433
  - Architectural style classification from building photographs
434
  - Educational tool for architecture recognition
435
- - Research baseline for fine-grained visual categorization (FGVC)
436
  - Transfer learning experiments on architectural imagery
437
 
438
  ## Limitations
439
 
440
  - Trained on Pexels stock photography β€” performance may differ on user-generated or field photographs
441
- - Limited to 6 architectural classes (bridge, castle, mosque, skyscraper, stadium, temple)
442
- - **Stadium** has the lowest recall (90.48%) β€” often confused with temple due to similar structural outlines
443
- - **Temple** is the most frequently misclassified class β€” confused with mosque (shared domes, minarets, and ornamental features)
444
- - **Skyscraper↔Bridge** confusion occurs when bridge structural elements resemble tall buildings against skyline
445
- - Inference confidence can be low on atypical examples (e.g., skyscraper predicted at 57.0% confidence with bridge at 36.2%)
446
 
447
- ![Misclassification Examples](misclassification_examples.png)
448
 
449
  ## Ethical Considerations
450
 
451
  - All training images sourced from [Pexels.com](https://www.pexels.com) under the Pexels License (free for commercial use, no attribution required). No copyrighted or personally identifiable images were used.
452
  - The dataset contains only photographs of buildings and structures β€” no people, faces, or private property are the subject of classification.
453
  - The model reflects the visual distribution of Pexels stock photography, which may over-represent Western and iconic architectural styles and under-represent vernacular or regional architecture.
454
- - The 6 class categories are broad and do not capture the full diversity of world architecture. Results should not be used to make definitive claims about architectural categorization.
455
  - URL pattern filtering during dataset collection explicitly excluded AI-generated art, illustrations, and non-photographic content to ensure authenticity.
456
 
457
  ## Links
458
 
459
- - **Gradio Demo:** [arch-building-classifier Space](https://huggingface.co/spaces/0xgr3y/arch-building-classifier)
460
  - **Dataset:** [0xgr3y/arch-building-dataset](https://huggingface.co/datasets/0xgr3y/arch-building-dataset)
461
  - **GitHub:** [arcxteam/arch-building-classifier](https://github.com/arcxteam/arch-building-classifier)
462
 
@@ -478,14 +531,19 @@ print(f"Predicted: {LABELS[np.argmax(preds)]} ({np.max(preds)*100:.1f}%)")
478
  14. Shanmugam, D., Blalock, D., Balakrishnan, G., Guttag, J., & Sarma, A. (2020). Towards Principled Test-Time Augmentation. *ICML 2020*. [PDF](https://dmshanmugam.github.io/pdfs/icml_2020_testaug.pdf)
479
  15. Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. *ICLR 2017*. [arXiv:1608.03983](https://arxiv.org/abs/1608.03983)
480
  16. Prechelt, L. (1998). Automatic Early Stopping Using Cross Validation: Quantifying the Criteria. *Neural Networks*, 11(4), 761–767. [https://doi.org/10.1016/S0893-6080(98)00010-0](https://doi.org/10.1016/S0893-6080(98)00010-0)
 
 
 
 
 
 
481
 
482
  ## Citation
483
 
484
  ```bibtex
485
  @misc{saugani2026_arch_building,
486
- title={Fine-Grained Visual Categorization (FGVC) of World Architectural Buildings
487
- Using CNN Transfer Learning DenseNet121 with Fine-Tuning and
488
- Multi-Layer Regularization Strategy},
489
  author={Saugani},
490
  year={2026},
491
  publisher={Hugging Face},
 
8
  - densenet121
9
  - architecture
10
  - building
 
11
  - fgvc
12
  - transfer-learning
13
  - gem-pooling
14
  - focal-loss
 
15
  - discriminative-learning-rate
16
+ - swa
17
+ - grad-cam
18
+ - calibration
19
+ - roc-auc
20
  library_name: keras
21
  language: en
22
  datasets:
 
36
  split: test
37
  metrics:
38
  - type: accuracy
39
+ value: 0.9688
40
  name: Test Accuracy
41
  - type: accuracy
42
+ value: 0.9658
43
  name: Validation Accuracy (SWA)
44
  - type: accuracy
45
+ value: 0.968
46
  name: TTA Accuracy
47
  ---
48
 
49
+ # Fine-Grained World Architecture Image Classification: A DenseNet121 Transfer Learning Approach with Layered Regularization
50
 
51
+ ### Architectural Building Image Classifier
52
+
53
+ Fine-Grained Visual Classification (FGVC) of world architectural buildings using CNN transfer learning with DenseNet121, enhanced with GeM Pooling, Focal Loss, Discriminative AdamW (LR), Stochastic Weight Averaging (SWA), Grad-CAM explainability, and calibration analysis.
54
 
55
  <table>
56
+ <tr><td><strong>Architecture</strong></td><td>DenseNet121 + GeM Pooling + Focal Loss + SWA</td></tr>
57
+ <tr><td><strong>Task</strong></td><td>Fine-Grained Visual Classification (FGVC)</td></tr>
58
+ <tr><td><strong>Test Accuracy</strong></td><td>96.88%</td></tr>
59
+ <tr><td><strong>Classes</strong></td><td>8 (Barn, Bridge, Castle, Mosque, Skyscraper, Stadium, Temple, Windmill)</td></tr>
60
  <tr><td><strong>Input Size</strong></td><td>320 Γ— 320 pixels</td></tr>
61
+ <tr><td><strong>Parameters</strong></td><td>9,466,953</td></tr>
62
  <tr><td><strong>Framework</strong></td><td>TensorFlow / Keras 3</td></tr>
63
  <tr><td><strong>License</strong></td><td><a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0</a></td></tr>
64
  </table>
65
 
66
  ## Model Description
67
 
68
+ A fine-grained image classification model for world architectural buildings. Built on DenseNet121 pretrained on ImageNet, enhanced with GeM Pooling (learnable generalized mean pooling), Focal Loss, Discriminative AdamW and Stochastic Weight Averaging (SWA). Extended with Grad-CAM explainability visualization, ROC-AUC evaluation, ECE calibration analysis, and t-SNE embedding visualization.
69
 
70
  **Key architectural contributions:**
71
 
72
  - **GeM Pooling** (Radenovic et al., CVPR 2018) β€” replaces global average pooling with a learnable power parameter (p=3.0) that emphasizes high-activation features, yielding stronger discriminative representations for FGVC tasks
73
  - **Focal Loss** (Lin et al., ICCV 2017, gamma=2.0) β€” down-weights well-classified examples to focus gradient updates on hard-to-classify building pairs
74
  - **DiscriminativeAdamW** β€” extends AdamW with per-layer learning rate multipliers: conv4_block receives LR Γ— 0.1 (pretrained features require smaller updates), while conv5_block and the custom head receive LR Γ— 1.0
75
+ - **SWA with BN re-estimation** (Izmailov et al., UAI 2018) β€” 10-epoch post-training weight averaging with constant LR 1e-4, followed by 100-step batch normalization statistics re-estimation
76
+ - **Grad-CAM** (Selvaraju et al., ICCV 2017) β€” gradient-weighted class activation mapping for explainability, targeting *conv5_block16_concat*
77
+ - **ECE Calibration** (Guo et al., ICML 2017) β€” Expected Calibration Error with 15-bin reliability diagram to assess prediction confidence reliability
78
 
79
  ## Architecture
80
 
 
93
  BatchNormalization β†’ 1,024 params
94
  Dropout(0.4) β†’ 0 params
95
  β”‚
96
+ Dense(8, Softmax) β†’ 2,056 params
97
  β”‚
98
+ Output (8 classes)
99
  ```
100
 
101
  | Component | Output Shape | Parameters |
 
108
  | Dense 256 ReLU | (None, 256) | 65,792 |
109
  | BatchNormalization | (None, 256) | 1,024 |
110
  | Dropout 0.4 | (None, 256) | 0 |
111
+ | Dense 8 Softmax | (None, 8) | 2,056 |
112
+ | **Total** | | **9,466,953** |
113
+ | Trainable (Phase 1) | | **2,428,425** (9.27 MB) |
114
+ | Trainable (Phase 2) | | **7,884,297** (30.07 MB) |
115
  | Non-trainable (Phase 1) | | **7,038,528** (26.85 MB) |
116
 
117
  ## Performance
 
120
 
121
  | Metric | Value |
122
  |--------|-------|
123
+ | Test Accuracy | 96.88% |
124
+ | Validation Accuracy (SWA) | 96.58% |
125
+ | Test-Time Augmentation | 96.80% |
126
+ | Test Loss | 0.4485 |
127
+ | Overfitting Gap (Train βˆ’ Test) | 3.00% |
128
+ | Macro Avg Precision | 0.9691 |
129
+ | Macro Avg Recall | 0.9688 |
130
+ | Macro Avg F1-Score | 0.9687 |
131
+ | Top-2 Accuracy | 98.59% |
132
+ | Top-3 Accuracy | 99.33% |
133
+ | Macro ROC-AUC (OvR) | 0.9986 |
134
+ | ECE (15 bins) | 0.1438 |
135
 
136
  ### Per-Class Results
137
 
138
  | Class | Precision | Recall | F1-Score | Support |
139
  |-------|-----------|--------|----------|---------|
140
+ | Barn | 0.9645 | 0.9702 | 0.9674 | 168 |
141
+ | Bridge | 0.9588 | 0.9702 | 0.9645 | 168 |
142
+ | Castle | 0.9649 | 0.9821 | 0.9735 | 168 |
143
+ | Mosque | 0.9649 | 0.9821 | 0.9735 | 168 |
144
+ | Skyscraper | 0.9708 | 0.9881 | 0.9794 | 168 |
145
+ | Stadium | 0.9936 | 0.9286 | 0.9600 | 168 |
146
+ | Temple | 0.9816 | 0.9524 | 0.9668 | 168 |
147
+ | Windmill | 0.9535 | 0.9762 | 0.9647 | 168 |
148
+ | **Macro Avg** | **0.9691** | **0.9688** | **0.9687** | **1,344** |
 
149
 
150
  ### Model Selection
151
 
 
153
 
154
  | Checkpoint | Val Accuracy | Val Loss | Description |
155
  |------------|-------------|----------|-------------|
156
+ | `best_phase1.keras` | 89.21% | 1.2231 | Phase 1 checkpoint (backbone frozen) |
157
+ | `best_phase2.keras` | 92.04% | 0.6171 | Phase 2 checkpoint (conv4+conv5 unfrozen) |
158
+ | `best_phase2_ema.keras` | 89.36% | 0.8183 | Phase 2 EMA shadow weights |
159
+ | **`best_phase2_swa.keras`** | **96.58%** | **0.4256** | **SWA averaged weights ← SELECTED** |
160
 
161
  ### SWA Progression
162
 
163
  | SWA Epoch | Val Accuracy | Val Loss |
164
  |-----------|-------------|----------|
165
+ | 1 | 93.38% | 0.5580 |
166
+ | 2 | 93.60% | 0.5738 |
167
+ | 3 | 92.86% | 0.5725 |
168
+ | 4 | 95.24% | 0.4806 |
169
+ | 5 | 95.68% | 0.4529 |
170
+ | 6 | 96.35% | 0.4548 |
171
+ | 7 | 94.27% | 0.5141 |
172
+ | 8 | 94.12% | 0.5147 |
173
+ | 9 | 94.49% | 0.5243 |
174
+ | 10 | 96.50% | 0.4424 |
175
+ | **SWA + BN (final)** | **96.58%** | **0.4256** |
176
+
177
+ ![Training Curves](results/training_curves.png)
178
+
179
+ ![Confusion Matrix](results/confusion_matrix.png)
180
+
181
+ ![Per-Class Accuracy](results/per_class_accuracy.png)
182
 
183
+ ![Reliability Diagram](results/reliability_diagram.png)
184
 
185
+ ![ROC Curves](results/roc_curves.png)
186
 
187
+ ![t-SNE Embedding](results/tsne_embedding.png)
188
+
189
+ ![Grad-CAM Heatmaps](results/gradcam_heatmaps.png)
190
+
191
+ ![Confidence Per Class](results/confidence_per_class.png)
192
 
193
  ## Training Details
194
 
 
198
 
199
  | Phase | Description | Backbone | Optimizer | LR | Max Epochs | Actual Epochs | CutMix+Mixup | FocalLoss LS |
200
  |-------|-------------|----------|-----------|-----|-----------|---------------|---------------|-------------|
201
+ | **Phase 1** β€” Feature Extraction | Train custom head only | Frozen (all) | AdamW (wd=2e-5) | 0.001 + CosineDecay + Warmup 3ep | 25 | 1 | Yes (50/50 alternation) | 0.1 |
202
+ | **Phase 2** β€” Selective Fine-Tuning | Load best_phase1 β†’ fine-tune | conv4_block + conv5_block unfrozen (BN frozen) | DiscriminativeAdamW (conv4=0.1Γ—) | 3e-4 + CosineDecay + Warmup 5ep | 50 | 6 + 10 SWA | No | 0.05 |
203
 
204
+ > ΒΉ Phase 1 stops when `val_accuracy β‰₯ 85%` threshold (myCallback).
205
 
206
+ > Β² Phase 2 stops when `val_accuracy β‰₯ 92%` threshold (myCallback), followed by 10 SWA epochs (constant LR 1e-4).
207
 
208
  ### Hyperparameters
209
 
 
220
  | Early Stopping Patience | 7 | 12 |
221
  | myCallback Threshold | val_acc β‰₯ 0.85 | val_acc β‰₯ 0.92 |
222
  | EMA Decay | 0.999 | 0.999 |
223
+ | SWA Epochs | β€” | 10 (post-training) |
224
  | SWA LR | β€” | 1Γ—10⁻⁴ (constant) |
225
  | BN Re-estimation Steps | β€” | 100 |
226
  | CutMix (alpha=1.0) | Yes (50% batches) | No |
 
241
  | Dropout | 0.4 after Dense(256)+BN | Srivastava et al., JMLR 2014 |
242
  | Batch Normalization | After Conv2D and Dense; frozen during fine-tuning | Ioffe & Szegedy, arXiv 2015 |
243
  | EMA | Shadow weights, decay=0.999 | Tarvainen & Valpola, NeurIPS 2017 |
244
+ | SWA | 10-epoch post-training, constant LR 1e-4 | Izmailov et al., UAI 2018 |
245
  | Data Augmentation | Rotation Β±15Β°, shift Β±10%, zoom Β±20%, brightness 0.75–1.15, horizontal flip | Perez & Wang, arXiv 2017 |
246
  | Test-Time Augmentation | 6 augmentation variants, averaged | Shanmugam et al., ICML 2020 |
247
  | WarmupCosineDecay | Linear warmup + cosine annealing | Loshchilov & Hutter, ICLR 2017 (SGDR) |
 
249
 
250
  ### Dataset
251
 
252
+ [0xgr3y/arch-building-dataset](https://huggingface.co/datasets/0xgr3y/arch-building-dataset) β€” 13,440 images (8 classes Γ— 1,680, balanced) sourced from Pexels with perceptual (pHash) and exact (SHA256) deduplication.
253
 
254
  | Split | Images | Percentage |
255
  |-------|--------|------------|
256
+ | Train | 10,752 | 80% |
257
+ | Validation | 1,344 | 10% |
258
+ | Test | 1,344 | 10% |
259
 
260
  ### Data Preprocessing
261
 
 
268
 
269
  | File | Description |
270
  |------|-------------|
271
+ | `best_phase2_swa.keras` | Best model β€” SWA averaged weights |
272
+ | `best_phase2.keras` | Phase 2 checkpoint |
273
  | `saved_model/` | TensorFlow SavedModel format (portable, for TF Serving) |
274
  | `tflite/model.tflite` | TensorFlow Lite model (mobile/embedded) |
275
  | `tflite/label.txt` | Class label names for TF-Lite |
276
  | `tfjs_model/` | TensorFlow.js model (browser, 10 weight shards + model.json) |
277
  | `config.json` | Model configuration and evaluation metrics |
278
+ | `label_mapping.json` | Class name ↔ ID mapping with training config and architecture info |
279
  | `preprocessor_config.json` | Input preprocessing specification (320Γ—320) |
280
+ | `confusion_pairs.json` | Auto-detected confusion pairs from confusion matrix (threshold >5%) |
281
+ | `class_confidence_stats.json` | Per-class mean/std/p5/p95 confidence distribution |
282
+ | `model_benchmark.json` | Model parameters, sizes, speed, Top-K, AUC, ECE, TTA metrics |
283
+ | `calibration_data.json` | ECE, bin accuracies/confidences, per-class AUC for calibration analysis |
284
+ | `results/training_curves.png` | Training/validation accuracy and loss curves |
285
+ | `results/confusion_matrix.png` | Confusion matrix on test set |
286
+ | `results/per_class_accuracy.png` | Per-class accuracy bar chart |
287
+ | `results/reliability_diagram.png` | ECE calibration reliability diagram (15 bins) |
288
+ | `results/roc_curves.png` | Per-class ROC curves (One-vs-Rest) |
289
+ | `results/tsne_embedding.png` | t-SNE 2D scatter plot from GeM Pooling features |
290
+ | `results/gradcam_heatmaps.png` | Grad-CAM heatmap visualization per class |
291
+ | `results/confidence_per_class.png` | Per-class confidence distribution bar chart |
292
+ | `results/misclassification_examples.png` | Misclassified test samples |
293
+ | `results/augmentation_examples.png` | Example images after augmentation (training set) |
294
+ | `results/inference_keras.png` | Keras inference grid β€” 1 sample per class |
295
+ | `results/inference_tflite.png` | TF-Lite inference grid β€” 1 sample per class |
296
 
297
  ## Usage
298
 
299
+ ### Gradio Space
300
 
301
+ Try the live inference: [arch-building-classifier Space](https://huggingface.co/spaces/0xgr3y/arch-building-classifier)
302
 
303
  ### Python β€” Keras
304
 
 
322
  initializer=tf.keras.initializers.Constant(self.p_init), trainable=True)
323
  super().build(input_shape)
324
  def call(self, x):
325
+ x = tf.maximum(x, self.eps)
326
  x = tf.pow(x, self.p)
327
  x = tf.reduce_mean(x, axis=[1, 2], keepdims=False)
328
  return tf.pow(x, 1.0 / self.p)
 
387
 
388
  # =====================---Load Model---==========================
389
 
390
+ LABELS_PATH = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "label_mapping.json")
391
+ import json
392
+ with open(LABELS_PATH) as f:
393
+ LABELS = json.load(f)["labels"]
394
+ # ["barn","bridge","castle","mosque","skyscraper","stadium","temple","windmill"]
395
+
396
  custom_objects = {
397
  "GeMPooling": GeMPooling,
398
  "FocalLoss": FocalLoss,
 
404
 
405
  # =======================---Inference---==========================
406
 
407
+ img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
408
  arr = np.expand_dims(preprocess_input(np.array(img, dtype=np.float32)), axis=0)
409
  preds = model.predict(arr, verbose=0)[0]
410
  print(f"Predicted: {LABELS[np.argmax(preds)]} ({np.max(preds)*100:.1f}%)")
 
422
  ```python
423
  import numpy as np
424
  import tensorflow as tf
425
+ from huggingface_hub import hf_hub_download
426
  from PIL import Image
427
+ import json
428
 
429
+ labels_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "label_mapping.json")
430
+ with open(labels_path) as f:
431
+ LABELS = json.load(f)["labels"]
432
 
433
  interpreter = tf.lite.Interpreter(model_path="tflite/model.tflite")
434
  interpreter.allocate_tensors()
 
436
  input_details = interpreter.get_input_details()
437
  output_details = interpreter.get_output_details()
438
 
439
+ img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
440
  arr = np.expand_dims(np.array(img, dtype=np.float32), axis=0)
441
  arr = tf.keras.applications.densenet.preprocess_input(arr)
442
 
 
462
  ```python
463
  import tensorflow as tf
464
  import numpy as np
465
+ from huggingface_hub import hf_hub_download
466
  from PIL import Image
467
+ import json
468
+
469
+ labels_path = hf_hub_download("0xgr3y/Arch-Building-Image-Classification", "label_mapping.json")
470
+ with open(labels_path) as f:
471
+ LABELS = json.load(f)["labels"]
472
 
 
473
  custom_objects = {"GeMPooling": GeMPooling, "FocalLoss": FocalLoss,
474
  "DiscriminativeAdamW": DiscriminativeAdamW}
475
 
476
  model = tf.keras.models.load_model("saved_model", custom_objects=custom_objects, compile=False)
477
 
478
+ img = Image.open("skyscraper_00000.jpg").convert("RGB").resize((320, 320))
479
  arr = np.expand_dims(tf.keras.applications.densenet.preprocess_input(
480
  np.array(img, dtype=np.float32)), axis=0)
481
  preds = model.predict(arr, verbose=0)[0]
 
486
 
487
  - Architectural style classification from building photographs
488
  - Educational tool for architecture recognition
489
+ - Research baseline for fine-grained visual classification (FGVC)
490
  - Transfer learning experiments on architectural imagery
491
 
492
  ## Limitations
493
 
494
  - Trained on Pexels stock photography β€” performance may differ on user-generated or field photographs
495
+ - Limited to 8 architectural classes (barn, bridge, castle, mosque, skyscraper, stadium, temple, windmill)
496
+ - Confusion pair analysis found **0 significant pairs** (threshold >5%) β€” all 8 classes are well-distinguished by the model; see `confusion_pairs.json` for details
497
+ - Barn and windmill share 3 cross-class duplicates (0.02% of dataset) β€” left as-is due to negligible impact
498
+ - Inference confidence can be low on atypical examples
 
499
 
500
+ ![Misclassification Examples](results/misclassification_examples.png)
501
 
502
  ## Ethical Considerations
503
 
504
  - All training images sourced from [Pexels.com](https://www.pexels.com) under the Pexels License (free for commercial use, no attribution required). No copyrighted or personally identifiable images were used.
505
  - The dataset contains only photographs of buildings and structures β€” no people, faces, or private property are the subject of classification.
506
  - The model reflects the visual distribution of Pexels stock photography, which may over-represent Western and iconic architectural styles and under-represent vernacular or regional architecture.
507
+ - The 8 class categories are broad and do not capture the full diversity of world architecture. Results should not be used to make definitive claims about architectural categorization.
508
  - URL pattern filtering during dataset collection explicitly excluded AI-generated art, illustrations, and non-photographic content to ensure authenticity.
509
 
510
  ## Links
511
 
512
+ - **Gradio Space (Live):** [arch-building-classifier Space](https://huggingface.co/spaces/0xgr3y/arch-building-classifier)
513
  - **Dataset:** [0xgr3y/arch-building-dataset](https://huggingface.co/datasets/0xgr3y/arch-building-dataset)
514
  - **GitHub:** [arcxteam/arch-building-classifier](https://github.com/arcxteam/arch-building-classifier)
515
 
 
531
  14. Shanmugam, D., Blalock, D., Balakrishnan, G., Guttag, J., & Sarma, A. (2020). Towards Principled Test-Time Augmentation. *ICML 2020*. [PDF](https://dmshanmugam.github.io/pdfs/icml_2020_testaug.pdf)
532
  15. Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. *ICLR 2017*. [arXiv:1608.03983](https://arxiv.org/abs/1608.03983)
533
  16. Prechelt, L. (1998). Automatic Early Stopping Using Cross Validation: Quantifying the Criteria. *Neural Networks*, 11(4), 761–767. [https://doi.org/10.1016/S0893-6080(98)00010-0](https://doi.org/10.1016/S0893-6080(98)00010-0)
534
+ 17. Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. *ICML 2017*. [arXiv:1706.04599](https://arxiv.org/abs/1706.04599)
535
+ 18. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. *ICCV 2017*. [arXiv:1610.02391](https://arxiv.org/abs/1610.02391)
536
+ 19. van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. *JMLR*, 9(Nov), 2579–2605. [http://jmlr.org/papers/v9/vandermaaten08a.html](http://jmlr.org/papers/v9/vandermaaten08a.html)
537
+ 20. Hand, D. J., & Till, R. J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. *Machine Learning*, 45(2), 171–186. [https://doi.org/10.1023/A:1010920819831](https://doi.org/10.1023/A:1010920819831)
538
+ 21. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. *IJCV*, 115(3), 211–252. [arXiv:1409.0575](https://arxiv.org/abs/1409.0575)
539
+ 22. Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. *NeurIPS 2017*. [arXiv:1612.01474](https://arxiv.org/abs/1612.01474)
540
 
541
  ## Citation
542
 
543
  ```bibtex
544
  @misc{saugani2026_arch_building,
545
+ title={Fine-Grained World Architecture Image Classification:
546
+ A DenseNet121 Transfer Learning Approach with Layered Regularization},
 
547
  author={Saugani},
548
  year={2026},
549
  publisher={Hugging Face},