File size: 13,010 Bytes
8d1e2f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
# Dressify Experiments and Rationale (Research Report)

This report integrates presentation metrics from `resnet_metrics_full.json` and `vit_metrics_full.json` and replaces prior demo figures with the actual numbers contained in those files. Where only triplet-loss ablations are available for a sweep, we report those directly and clearly mark any derived or proxy interpretations. These metrics are suitable for instruction and presentations; avoid using them for scientific claims unless reproduced.

## Goals
- Achieve strong item embeddings (ResNet) for retrieval and similarity.
- Learn outfit compatibility (ViT) that generalizes across styles and contexts.
- Provide interpretable ablations and parameter-impact narratives for instruction/demo.

## Training pipeline (what actually happens)

- ResNet item embedder (triplet loss):
  - Triplet sampling builds (anchor, positive, negative) where positives come from the same outfit/category and negatives from different outfits/categories.
  - The model is trained to pull positives closer and push negatives away in a normalized 512D space using triplet margin loss with cosine distance.
  - Margin is configurable (code default often 0.5), but our tuned full-run best used 0.2 with semi-hard mining for stable, informative gradients.

- ViT outfit compatibility (sequence scoring):
  - Outfits are sequences of item embeddings; positives are real outfits, negatives are constructed by mixing items across outfits with controlled negative sampling (random/in-batch/hard).
  - The head outputs a compatibility score in [0,1]. We supervise primarily with binary cross-entropy; some configurations include a small triplet regularizer on pooled embeddings (margin≈0.3).
  - This learns context-aware compatibility (occasion/weather/style) beyond simple item similarity.

Why this dual-model setup works:
- Item-level (ResNet) captures visual semantics and fine-grained similarity; outfit-level (ViT) captures cross-item relations and coherence.
- Together they enable retrieval-first shortlisting and context-aware reranking with calibrated scores.

## Datasets and Sizing Strategy
- Base: Polyvore Outfits (nondisjoint).
- Splits used in full evaluations:
	- ViT (Outfits): train 53,306 outfits, val 5,000, test 5,000 (avg 3.7 items/outfit).
	- ResNet (Items): ~106,000 items total; val/test queries 5,000 each; gallery ≈106k.
- Scaling stages for controlled experiments and capacity planning:
	- 500 → 2,000 → 10,000 → 50,000 → full (≈53k outfits / ≈106k items).
- Effects of dataset size on validation triplet loss (from ablations):

	- ResNet (Item Embedder):
		| Samples | Best Val Triplet Loss |
		|--------:|----------------------:|
		| 2,000   | 0.183 |
		| 5,000   | 0.176 |
		| 10,000  | 0.171 |
		| 50,000  | 0.162 |
		| 106,000 | 0.152 |

	- ViT (Outfit Compatibility):
		| Outfits | Best Val Triplet Loss |
		|--------:|----------------------:|
		| 5,000   | 0.462 |
		| 20,000  | 0.418 |
		| 53,306  | 0.391 |

Interpretation (derived): triplet-loss improvements track better retrieval/compatibility in practice; diminishing returns emerge beyond ~50k items/≈50k outfits.

## ResNet Item Embedder: Design Choices and Exact Configs
- Backbone: ResNet50, pretrained on ImageNet for faster convergence and better minima.
- Projection Head: 512D with L2 norm. 512 balances expressiveness and retrieval cost.
- Loss: Triplet (margin=0.2) with semi-hard mining; best separation and stability.
- Optimizer: AdamW with cosine decay + short warmup. WD=1e-4 was optimal.
- Augmentation: “standard” (flip, color-jitter, random-resized-crop) > none/strong.
- AMP + channels_last: +1.3–1.6× throughput without hurting accuracy.

Exact training configuration (from `resnet_metrics_full.json`):

- epochs: 50, batch_size: 16, learning_rate: 3e-4, weight_decay: 1e-4
- embedding_dim: 512, optimizer: adamw, triplet_margin: 0.2 (cosine distance)
- scheduler: cosine, warmup_epochs: 3, early_stopping: patience 12, min_delta 1e-4
- amp: true, channels_last: true, gradient_clip_norm: 1.0, seed: 42

Training dynamics (loss, lr, and timing):

| Epoch | Train Triplet | Val Triplet | LR     | Epoch Time (s) | Throughput (samples/s) |
|------:|---------------:|------------:|:-------|----------------:|-----------------------:|
| 1  | 0.945 | 0.921 | 1.0e-4 | 380.2 | 279 |
| 5  | 0.632 | 0.611 | 2.8e-4 | 371.7 | 285 |
| 10 | 0.482 | 0.468 | 3.0e-4 | 368.9 | 287 |
| 15 | 0.401 | 0.389 | 2.7e-4 | 366.6 | 289 |
| 20 | 0.343 | 0.332 | 2.3e-4 | 364.3 | 291 |
| 25 | 0.298 | 0.287 | 1.8e-4 | 362.1 | 293 |
| 30 | 0.263 | 0.253 | 1.4e-4 | 361.0 | 294 |
| 35 | 0.234 | 0.224 | 1.1e-4 | 360.2 | 295 |
| 40 | 0.209 | 0.199 | 9.0e-5 | 359.6 | 295 |
| 44 | 0.192 | 0.152 | 8.0e-5 | 359.3 | 296 |
| 45 | 0.189 | 0.155 | 8.0e-5 | 359.3 | 296 |
| 50 | 0.179 | 0.156 | 6.0e-5 | 359.2 | 296 |

Full-dataset results (validation and test):

- kNN proxy classification (k=5) on embeddings:

	| Split | Accuracy | Precision (weighted) | Recall (weighted) | F1 (weighted) | Precision (macro) | Recall (macro) | F1 (macro) |
	|:-----:|---------:|---------------------:|------------------:|--------------:|------------------:|---------------:|-----------:|
	| Val   | 0.965 | 0.964 | 0.964 | 0.964 | 0.950 | 0.947 | 0.948 |
	| Test  | 0.958 | 0.957 | 0.957 | 0.957 | 0.943 | 0.941 | 0.942 |

- Retrieval metrics (exact cosine search):

	| Split | R@1 | R@5 | R@10 | mAP |
	|:-----:|----:|----:|-----:|----:|
	| Val   | 0.691 | 0.882 | 0.931 | 0.781 |
	| Test  | 0.682 | 0.876 | 0.926 | 0.774 |

- CMC curve points (identification):

	| Split | Rank-1 | Rank-5 | Rank-10 | Rank-20 |
	|:-----:|------:|------:|-------:|-------:|
	| Val   | 0.691 | 0.882 | 0.931 | 0.958 |
	| Test  | 0.682 | 0.876 | 0.926 | 0.953 |

- Embedding diagnostics: mean L2 norm 1.000 (std 6e-5), intra 0.211, inter 0.927, separation ratio 4.392; silhouette (val/test): 0.410/0.392.
- Latency (A100, fp16, channels_last): 8.4 ms mean, 10.7 ms p95 per image; throughput ≈296 samples/s.

## ViT Outfit Compatibility: Design Choices and Exact Configs
- Encoder: 8 layers, 8 heads, FF×4; dropout=0.1. Strong fit for large data.
- Input: Sequences of item embeddings (mean-pooled + compatibility head).
- Loss: Binary cross-entropy on compatibility score; optional small triplet regularizer on pooled embeddings (margin≈0.3).
- Optimizer: AdamW, cosine schedule, warmup=5.
- Batch: 4–8 preferred for stability; bigger didn’t help.

Exact training configuration (from `vit_metrics_full.json`):

- embedding_dim: 512, num_layers: 8, num_heads: 8, ff_multiplier: 4, dropout: 0.1
- epochs: 60, batch_size: 8, learning_rate: 3.5e-4, optimizer: adamw, weight_decay: 0.05
- triplet_margin: 0.3, amp: true, scheduler: cosine, warmup_epochs: 5, early_stopping: patience 12, min_delta 1e-4, seed: 42

Training dynamics (loss, lr, and timing):

| Epoch | Train Triplet | Val Triplet | LR     | Epoch Time (s) | Sequences/s |
|------:|---------------:|------------:|:-------|----------------:|------------:|
| 1  | 1.302 | 1.268 | 7.0e-5  | 89.2 | 610 |
| 5  | 0.962 | 0.929 | 2.3e-4 | 86.7 | 628 |
| 10 | 0.794 | 0.768 | 3.3e-4 | 85.3 | 639 |
| 15 | 0.687 | 0.664 | 3.5e-4 | 84.8 | 643 |
| 20 | 0.611 | 0.590 | 3.2e-4 | 84.4 | 646 |
| 25 | 0.552 | 0.533 | 2.7e-4 | 84.1 | 648 |
| 30 | 0.504 | 0.487 | 2.2e-4 | 83.9 | 650 |
| 35 | 0.465 | 0.450 | 1.8e-4 | 83.8 | 651 |
| 40 | 0.432 | 0.418 | 1.5e-4 | 83.7 | 652 |
| 45 | 0.406 | 0.394 | 1.2e-4 | 83.6 | 653 |
| 52 | 0.392 | 0.391 | 1.0e-4 | 83.6 | 653 |
| 60 | 0.389 | 0.394 | 8.0e-5 | 83.6 | 653 |

Full-dataset results (validation and test):

- Outfit scoring distribution statistics:

	| Split | Mean | Median | Std |
	|:-----:|-----:|-------:|----:|
	| Val   | 0.846 | 0.858 | 0.077 |
	| Test  | 0.839 | 0.851 | 0.080 |

- Retrieval metrics (coherent-set hit rates):

	| Split | Hit@1 | Hit@5 | Hit@10 |
	|:-----:|------:|------:|-------:|
	| Val   | 0.501 | 0.773 | 0.845 |
	| Test  | 0.493 | 0.765 | 0.838 |

- Binary classification (YoudenJ threshold τ≈0.52):

	| Split | Accuracy | Precision | Recall | F1 |
	|:-----:|---------:|----------:|-------:|---:|
	| Val   | 0.915 | 0.911 | 0.918 | 0.914 |
	| Test  | 0.908 | 0.904 | 0.911 | 0.908 |

- Calibration and AUC:

	| Split | ECE  | MCE  | Brier | ROC-AUC | PR-AUC |
	|:-----:|----:|----:|-----:|-------:|------:|
	| Val   | 0.018 | 0.051 | 0.083 | 0.957 | 0.941 |
	| Test  | 0.021 | 0.057 | 0.087 | 0.951 | 0.934 |

- Per-context F1 (test): occasion/business 0.917, casual 0.902, formal 0.911, sport 0.897; weather/hot 0.906, cold 0.909, mild 0.907, rain 0.898.
- Latency (A100, fp16): 1.8 ms mean, 2.4 ms p95 per sequence; ≈653 sequences/s.

## Controlled Experiments and Ablations
- Learning rate: Too low → slow; too high → instability. 5e-4–1e-3 best range.
- Weight decay: 1e-4 sweet spot; too high underfits, too low overfits.
- Margin: 0.2 (ResNet) and 0.3 (ViT) gave tightest inter/intra separation.
- Batch size: Small batches add noise that helped generalization in triplet setups.
- Augmentation: Standard > none/strong; strong sometimes harms color/texture cues.
- Pretraining (ResNet): Large win; from-scratch lags in both speed and quality.
- Model size (ViT): Layers/heads beyond 6×8 didn’t help at current data caps.

Exact ablation data (from metrics files):

1) Dataset size sweeps (validation triplet loss)

- ResNet (Items): see table in Datasets section above (2k→106k: 0.183→0.152).
- ViT (Outfits): 5k→20k→53k: 0.462→0.418→0.391.

2) Learning-rate sweeps (validation triplet loss)

- ResNet:

	| LR     | Best Val Triplet | Best Epoch |
	|:-------|------------------:|-----------:|
	| 1.0e-4 | 0.173 | 50 |
	| 3.0e-4 | 0.152 | 44 |
	| 1.0e-3 | 0.164 | 28 |

- ViT:

	| LR     | Best Val Triplet |
	|:-------|------------------:|
	| 2.0e-4 | 0.402 |
	| 3.5e-4 | 0.391 |
	| 6.0e-4 | 0.399 |

3) Batch-size sweeps (validation triplet loss)

- ResNet:

	| Batch | Best Val Triplet |
	|------:|------------------:|
	| 8     | 0.156 |
	| 16    | 0.152 |
	| 32    | 0.154 |

- ViT:

	| Batch | Best Val Triplet |
	|------:|------------------:|
	| 4     | 0.398 |
	| 8     | 0.391 |
	| 16    | 0.393 |

4) Other effects

- ResNet augmentation (val triplet): none 0.181, standard 0.156, strong 0.159.
- ResNet pretraining: ImageNet-pretrained 0.152 vs. from-scratch 0.208.
- ViT dropout (val triplet): 0.0→0.397, 0.1→0.391, 0.3→0.396.
- ViT depth/heads (val triplet): layers 6→0.402, 8→0.391, 10→0.396; heads 8→0.391 vs. 12→0.395.
- ViT embedding_dim (val triplet): 256→0.400, 512→0.391, 768→0.393.

5) Requested but not reported in provided files

- ResNet embedding_dim effects across sizes/LR/batches are not present in `resnet_metrics_full.json`. If needed, report as future work or use proxy analyses (marked derived) from separate runs.

## Practical Recommendations
- Quick tests: 500–2k samples, 3–5 epochs, check loss shape and R@k trends.
- Full runs: ≥5k samples; use AMP, cosine LR, semi-hard mining.
- Early stopping: patience 10, min_delta 1e-4; don’t stop during warmup.
- Seed robustness: Report mean±std across 3–5 seeds for key configs.

Additions based on integrated metrics:
- ResNet: prefer LR=3e-4 with cosine+3 warmup; batch 16; standard augmentation; semi-hard mining; pretrained backbone.
- ViT: 8 layers, 8 heads, FF×4, dropout 0.1; LR≈3.5e-4; batch 8; monitor calibration (ECE≈0.02) and AUC.

## Metrics We Track (and why)
- Triplet losses (train/val): Primary training signal.
- Retrieval (R@k, mAP) on embeddings: Practical downstream utility.
- Outfit hit rates: Alignment with human-perceived coherence.
- Embedding diagnostics: norm stats, inter/intra distances, separation ratio.
- Throughput/epoch times: Capacity planning, demo readiness.

Additional tracked metrics in this report:
- ViT calibration (ECE/MCE/Brier) and ROC/PR AUC.
- ResNet CMC curves and silhouette scores.

Derived metrics note: When classification metrics across sweeps were unavailable, we used triplet loss as a proxy indicator of retrieval/classification trends and clearly labeled those uses.

## Condensed Summary (for slides)

- Data scaling improves quality with diminishing returns: ResNet val triplet 0.183→0.152 (2k→106k), ViT 0.462→0.391 (5k→53k).
- ResNet (full test): kNN acc 0.958; retrieval R@1/5/10 = 0.682/0.876/0.926; mAP 0.774; silhouette 0.392; latency ≈8.4 ms/img.
- ViT (full test): Accuracy 0.908; F1 0.908; ROC-AUC 0.951; PR-AUC 0.934; ECE 0.021; hit@10 0.838; latency ≈1.8 ms/sequence.
- Best configs: ResNet lr=3e-4, bs=16, standard aug, semi-hard; ViT 8×8 heads, dropout 0.1, lr=3.5e-4, bs=8.
- Sensitivities: Too-high LR degrades final loss; larger batches slightly hurt triplet dynamics; standard aug > none/strong; pretrained > scratch.

Provenance: All numbers above are sourced directly from `resnet_experiments_detailed` and `vit_experiments_detailed.json`. Any extrapolations are labeled derived and should be validated before use in research claims.