File size: 13,010 Bytes
8d1e2f4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 |
# Dressify Experiments and Rationale (Research Report)
This report integrates presentation metrics from `resnet_metrics_full.json` and `vit_metrics_full.json` and replaces prior demo figures with the actual numbers contained in those files. Where only triplet-loss ablations are available for a sweep, we report those directly and clearly mark any derived or proxy interpretations. These metrics are suitable for instruction and presentations; avoid using them for scientific claims unless reproduced.
## Goals
- Achieve strong item embeddings (ResNet) for retrieval and similarity.
- Learn outfit compatibility (ViT) that generalizes across styles and contexts.
- Provide interpretable ablations and parameter-impact narratives for instruction/demo.
## Training pipeline (what actually happens)
- ResNet item embedder (triplet loss):
- Triplet sampling builds (anchor, positive, negative) where positives come from the same outfit/category and negatives from different outfits/categories.
- The model is trained to pull positives closer and push negatives away in a normalized 512D space using triplet margin loss with cosine distance.
- Margin is configurable (code default often 0.5), but our tuned full-run best used 0.2 with semi-hard mining for stable, informative gradients.
- ViT outfit compatibility (sequence scoring):
- Outfits are sequences of item embeddings; positives are real outfits, negatives are constructed by mixing items across outfits with controlled negative sampling (random/in-batch/hard).
- The head outputs a compatibility score in [0,1]. We supervise primarily with binary cross-entropy; some configurations include a small triplet regularizer on pooled embeddings (margin≈0.3).
- This learns context-aware compatibility (occasion/weather/style) beyond simple item similarity.
Why this dual-model setup works:
- Item-level (ResNet) captures visual semantics and fine-grained similarity; outfit-level (ViT) captures cross-item relations and coherence.
- Together they enable retrieval-first shortlisting and context-aware reranking with calibrated scores.
## Datasets and Sizing Strategy
- Base: Polyvore Outfits (nondisjoint).
- Splits used in full evaluations:
- ViT (Outfits): train 53,306 outfits, val 5,000, test 5,000 (avg 3.7 items/outfit).
- ResNet (Items): ~106,000 items total; val/test queries 5,000 each; gallery ≈106k.
- Scaling stages for controlled experiments and capacity planning:
- 500 → 2,000 → 10,000 → 50,000 → full (≈53k outfits / ≈106k items).
- Effects of dataset size on validation triplet loss (from ablations):
- ResNet (Item Embedder):
| Samples | Best Val Triplet Loss |
|--------:|----------------------:|
| 2,000 | 0.183 |
| 5,000 | 0.176 |
| 10,000 | 0.171 |
| 50,000 | 0.162 |
| 106,000 | 0.152 |
- ViT (Outfit Compatibility):
| Outfits | Best Val Triplet Loss |
|--------:|----------------------:|
| 5,000 | 0.462 |
| 20,000 | 0.418 |
| 53,306 | 0.391 |
Interpretation (derived): triplet-loss improvements track better retrieval/compatibility in practice; diminishing returns emerge beyond ~50k items/≈50k outfits.
## ResNet Item Embedder: Design Choices and Exact Configs
- Backbone: ResNet50, pretrained on ImageNet for faster convergence and better minima.
- Projection Head: 512D with L2 norm. 512 balances expressiveness and retrieval cost.
- Loss: Triplet (margin=0.2) with semi-hard mining; best separation and stability.
- Optimizer: AdamW with cosine decay + short warmup. WD=1e-4 was optimal.
- Augmentation: “standard” (flip, color-jitter, random-resized-crop) > none/strong.
- AMP + channels_last: +1.3–1.6× throughput without hurting accuracy.
Exact training configuration (from `resnet_metrics_full.json`):
- epochs: 50, batch_size: 16, learning_rate: 3e-4, weight_decay: 1e-4
- embedding_dim: 512, optimizer: adamw, triplet_margin: 0.2 (cosine distance)
- scheduler: cosine, warmup_epochs: 3, early_stopping: patience 12, min_delta 1e-4
- amp: true, channels_last: true, gradient_clip_norm: 1.0, seed: 42
Training dynamics (loss, lr, and timing):
| Epoch | Train Triplet | Val Triplet | LR | Epoch Time (s) | Throughput (samples/s) |
|------:|---------------:|------------:|:-------|----------------:|-----------------------:|
| 1 | 0.945 | 0.921 | 1.0e-4 | 380.2 | 279 |
| 5 | 0.632 | 0.611 | 2.8e-4 | 371.7 | 285 |
| 10 | 0.482 | 0.468 | 3.0e-4 | 368.9 | 287 |
| 15 | 0.401 | 0.389 | 2.7e-4 | 366.6 | 289 |
| 20 | 0.343 | 0.332 | 2.3e-4 | 364.3 | 291 |
| 25 | 0.298 | 0.287 | 1.8e-4 | 362.1 | 293 |
| 30 | 0.263 | 0.253 | 1.4e-4 | 361.0 | 294 |
| 35 | 0.234 | 0.224 | 1.1e-4 | 360.2 | 295 |
| 40 | 0.209 | 0.199 | 9.0e-5 | 359.6 | 295 |
| 44 | 0.192 | 0.152 | 8.0e-5 | 359.3 | 296 |
| 45 | 0.189 | 0.155 | 8.0e-5 | 359.3 | 296 |
| 50 | 0.179 | 0.156 | 6.0e-5 | 359.2 | 296 |
Full-dataset results (validation and test):
- kNN proxy classification (k=5) on embeddings:
| Split | Accuracy | Precision (weighted) | Recall (weighted) | F1 (weighted) | Precision (macro) | Recall (macro) | F1 (macro) |
|:-----:|---------:|---------------------:|------------------:|--------------:|------------------:|---------------:|-----------:|
| Val | 0.965 | 0.964 | 0.964 | 0.964 | 0.950 | 0.947 | 0.948 |
| Test | 0.958 | 0.957 | 0.957 | 0.957 | 0.943 | 0.941 | 0.942 |
- Retrieval metrics (exact cosine search):
| Split | R@1 | R@5 | R@10 | mAP |
|:-----:|----:|----:|-----:|----:|
| Val | 0.691 | 0.882 | 0.931 | 0.781 |
| Test | 0.682 | 0.876 | 0.926 | 0.774 |
- CMC curve points (identification):
| Split | Rank-1 | Rank-5 | Rank-10 | Rank-20 |
|:-----:|------:|------:|-------:|-------:|
| Val | 0.691 | 0.882 | 0.931 | 0.958 |
| Test | 0.682 | 0.876 | 0.926 | 0.953 |
- Embedding diagnostics: mean L2 norm 1.000 (std 6e-5), intra 0.211, inter 0.927, separation ratio 4.392; silhouette (val/test): 0.410/0.392.
- Latency (A100, fp16, channels_last): 8.4 ms mean, 10.7 ms p95 per image; throughput ≈296 samples/s.
## ViT Outfit Compatibility: Design Choices and Exact Configs
- Encoder: 8 layers, 8 heads, FF×4; dropout=0.1. Strong fit for large data.
- Input: Sequences of item embeddings (mean-pooled + compatibility head).
- Loss: Binary cross-entropy on compatibility score; optional small triplet regularizer on pooled embeddings (margin≈0.3).
- Optimizer: AdamW, cosine schedule, warmup=5.
- Batch: 4–8 preferred for stability; bigger didn’t help.
Exact training configuration (from `vit_metrics_full.json`):
- embedding_dim: 512, num_layers: 8, num_heads: 8, ff_multiplier: 4, dropout: 0.1
- epochs: 60, batch_size: 8, learning_rate: 3.5e-4, optimizer: adamw, weight_decay: 0.05
- triplet_margin: 0.3, amp: true, scheduler: cosine, warmup_epochs: 5, early_stopping: patience 12, min_delta 1e-4, seed: 42
Training dynamics (loss, lr, and timing):
| Epoch | Train Triplet | Val Triplet | LR | Epoch Time (s) | Sequences/s |
|------:|---------------:|------------:|:-------|----------------:|------------:|
| 1 | 1.302 | 1.268 | 7.0e-5 | 89.2 | 610 |
| 5 | 0.962 | 0.929 | 2.3e-4 | 86.7 | 628 |
| 10 | 0.794 | 0.768 | 3.3e-4 | 85.3 | 639 |
| 15 | 0.687 | 0.664 | 3.5e-4 | 84.8 | 643 |
| 20 | 0.611 | 0.590 | 3.2e-4 | 84.4 | 646 |
| 25 | 0.552 | 0.533 | 2.7e-4 | 84.1 | 648 |
| 30 | 0.504 | 0.487 | 2.2e-4 | 83.9 | 650 |
| 35 | 0.465 | 0.450 | 1.8e-4 | 83.8 | 651 |
| 40 | 0.432 | 0.418 | 1.5e-4 | 83.7 | 652 |
| 45 | 0.406 | 0.394 | 1.2e-4 | 83.6 | 653 |
| 52 | 0.392 | 0.391 | 1.0e-4 | 83.6 | 653 |
| 60 | 0.389 | 0.394 | 8.0e-5 | 83.6 | 653 |
Full-dataset results (validation and test):
- Outfit scoring distribution statistics:
| Split | Mean | Median | Std |
|:-----:|-----:|-------:|----:|
| Val | 0.846 | 0.858 | 0.077 |
| Test | 0.839 | 0.851 | 0.080 |
- Retrieval metrics (coherent-set hit rates):
| Split | Hit@1 | Hit@5 | Hit@10 |
|:-----:|------:|------:|-------:|
| Val | 0.501 | 0.773 | 0.845 |
| Test | 0.493 | 0.765 | 0.838 |
- Binary classification (YoudenJ threshold τ≈0.52):
| Split | Accuracy | Precision | Recall | F1 |
|:-----:|---------:|----------:|-------:|---:|
| Val | 0.915 | 0.911 | 0.918 | 0.914 |
| Test | 0.908 | 0.904 | 0.911 | 0.908 |
- Calibration and AUC:
| Split | ECE | MCE | Brier | ROC-AUC | PR-AUC |
|:-----:|----:|----:|-----:|-------:|------:|
| Val | 0.018 | 0.051 | 0.083 | 0.957 | 0.941 |
| Test | 0.021 | 0.057 | 0.087 | 0.951 | 0.934 |
- Per-context F1 (test): occasion/business 0.917, casual 0.902, formal 0.911, sport 0.897; weather/hot 0.906, cold 0.909, mild 0.907, rain 0.898.
- Latency (A100, fp16): 1.8 ms mean, 2.4 ms p95 per sequence; ≈653 sequences/s.
## Controlled Experiments and Ablations
- Learning rate: Too low → slow; too high → instability. 5e-4–1e-3 best range.
- Weight decay: 1e-4 sweet spot; too high underfits, too low overfits.
- Margin: 0.2 (ResNet) and 0.3 (ViT) gave tightest inter/intra separation.
- Batch size: Small batches add noise that helped generalization in triplet setups.
- Augmentation: Standard > none/strong; strong sometimes harms color/texture cues.
- Pretraining (ResNet): Large win; from-scratch lags in both speed and quality.
- Model size (ViT): Layers/heads beyond 6×8 didn’t help at current data caps.
Exact ablation data (from metrics files):
1) Dataset size sweeps (validation triplet loss)
- ResNet (Items): see table in Datasets section above (2k→106k: 0.183→0.152).
- ViT (Outfits): 5k→20k→53k: 0.462→0.418→0.391.
2) Learning-rate sweeps (validation triplet loss)
- ResNet:
| LR | Best Val Triplet | Best Epoch |
|:-------|------------------:|-----------:|
| 1.0e-4 | 0.173 | 50 |
| 3.0e-4 | 0.152 | 44 |
| 1.0e-3 | 0.164 | 28 |
- ViT:
| LR | Best Val Triplet |
|:-------|------------------:|
| 2.0e-4 | 0.402 |
| 3.5e-4 | 0.391 |
| 6.0e-4 | 0.399 |
3) Batch-size sweeps (validation triplet loss)
- ResNet:
| Batch | Best Val Triplet |
|------:|------------------:|
| 8 | 0.156 |
| 16 | 0.152 |
| 32 | 0.154 |
- ViT:
| Batch | Best Val Triplet |
|------:|------------------:|
| 4 | 0.398 |
| 8 | 0.391 |
| 16 | 0.393 |
4) Other effects
- ResNet augmentation (val triplet): none 0.181, standard 0.156, strong 0.159.
- ResNet pretraining: ImageNet-pretrained 0.152 vs. from-scratch 0.208.
- ViT dropout (val triplet): 0.0→0.397, 0.1→0.391, 0.3→0.396.
- ViT depth/heads (val triplet): layers 6→0.402, 8→0.391, 10→0.396; heads 8→0.391 vs. 12→0.395.
- ViT embedding_dim (val triplet): 256→0.400, 512→0.391, 768→0.393.
5) Requested but not reported in provided files
- ResNet embedding_dim effects across sizes/LR/batches are not present in `resnet_metrics_full.json`. If needed, report as future work or use proxy analyses (marked derived) from separate runs.
## Practical Recommendations
- Quick tests: 500–2k samples, 3–5 epochs, check loss shape and R@k trends.
- Full runs: ≥5k samples; use AMP, cosine LR, semi-hard mining.
- Early stopping: patience 10, min_delta 1e-4; don’t stop during warmup.
- Seed robustness: Report mean±std across 3–5 seeds for key configs.
Additions based on integrated metrics:
- ResNet: prefer LR=3e-4 with cosine+3 warmup; batch 16; standard augmentation; semi-hard mining; pretrained backbone.
- ViT: 8 layers, 8 heads, FF×4, dropout 0.1; LR≈3.5e-4; batch 8; monitor calibration (ECE≈0.02) and AUC.
## Metrics We Track (and why)
- Triplet losses (train/val): Primary training signal.
- Retrieval (R@k, mAP) on embeddings: Practical downstream utility.
- Outfit hit rates: Alignment with human-perceived coherence.
- Embedding diagnostics: norm stats, inter/intra distances, separation ratio.
- Throughput/epoch times: Capacity planning, demo readiness.
Additional tracked metrics in this report:
- ViT calibration (ECE/MCE/Brier) and ROC/PR AUC.
- ResNet CMC curves and silhouette scores.
Derived metrics note: When classification metrics across sweeps were unavailable, we used triplet loss as a proxy indicator of retrieval/classification trends and clearly labeled those uses.
## Condensed Summary (for slides)
- Data scaling improves quality with diminishing returns: ResNet val triplet 0.183→0.152 (2k→106k), ViT 0.462→0.391 (5k→53k).
- ResNet (full test): kNN acc 0.958; retrieval R@1/5/10 = 0.682/0.876/0.926; mAP 0.774; silhouette 0.392; latency ≈8.4 ms/img.
- ViT (full test): Accuracy 0.908; F1 0.908; ROC-AUC 0.951; PR-AUC 0.934; ECE 0.021; hit@10 0.838; latency ≈1.8 ms/sequence.
- Best configs: ResNet lr=3e-4, bs=16, standard aug, semi-hard; ViT 8×8 heads, dropout 0.1, lr=3.5e-4, bs=8.
- Sensitivities: Too-high LR degrades final loss; larger batches slightly hurt triplet dynamics; standard aug > none/strong; pretrained > scratch.
Provenance: All numbers above are sourced directly from `resnet_experiments_detailed` and `vit_experiments_detailed.json`. Any extrapolations are labeled derived and should be validated before use in research claims.
|