Dressify Experiments and Rationale (Research Report)
This report integrates presentation metrics from resnet_metrics_full.json and vit_metrics_full.json and replaces prior demo figures with the actual numbers contained in those files. Where only triplet-loss ablations are available for a sweep, we report those directly and clearly mark any derived or proxy interpretations. These metrics are suitable for instruction and presentations; avoid using them for scientific claims unless reproduced.
Goals
- Achieve strong item embeddings (ResNet) for retrieval and similarity.
- Learn outfit compatibility (ViT) that generalizes across styles and contexts.
- Provide interpretable ablations and parameter-impact narratives for instruction/demo.
Training pipeline (what actually happens)
ResNet item embedder (triplet loss):
- Triplet sampling builds (anchor, positive, negative) where positives come from the same outfit/category and negatives from different outfits/categories.
- The model is trained to pull positives closer and push negatives away in a normalized 512D space using triplet margin loss with cosine distance.
- Margin is configurable (code default often 0.5), but our tuned full-run best used 0.2 with semi-hard mining for stable, informative gradients.
ViT outfit compatibility (sequence scoring):
- Outfits are sequences of item embeddings; positives are real outfits, negatives are constructed by mixing items across outfits with controlled negative sampling (random/in-batch/hard).
- The head outputs a compatibility score in [0,1]. We supervise primarily with binary cross-entropy; some configurations include a small triplet regularizer on pooled embeddings (margin≈0.3).
- This learns context-aware compatibility (occasion/weather/style) beyond simple item similarity.
Why this dual-model setup works:
- Item-level (ResNet) captures visual semantics and fine-grained similarity; outfit-level (ViT) captures cross-item relations and coherence.
- Together they enable retrieval-first shortlisting and context-aware reranking with calibrated scores.
Datasets and Sizing Strategy
Base: Polyvore Outfits (nondisjoint).
Splits used in full evaluations:
- ViT (Outfits): train 53,306 outfits, val 5,000, test 5,000 (avg 3.7 items/outfit).
- ResNet (Items): ~106,000 items total; val/test queries 5,000 each; gallery ≈106k.
Scaling stages for controlled experiments and capacity planning:
- 500 → 2,000 → 10,000 → 50,000 → full (≈53k outfits / ≈106k items).
Effects of dataset size on validation triplet loss (from ablations):
ResNet (Item Embedder):
Samples Best Val Triplet Loss 2,000 0.183 5,000 0.176 10,000 0.171 50,000 0.162 106,000 0.152 ViT (Outfit Compatibility):
Outfits Best Val Triplet Loss 5,000 0.462 20,000 0.418 53,306 0.391
Interpretation (derived): triplet-loss improvements track better retrieval/compatibility in practice; diminishing returns emerge beyond ~50k items/≈50k outfits.
ResNet Item Embedder: Design Choices and Exact Configs
- Backbone: ResNet50, pretrained on ImageNet for faster convergence and better minima.
- Projection Head: 512D with L2 norm. 512 balances expressiveness and retrieval cost.
- Loss: Triplet (margin=0.2) with semi-hard mining; best separation and stability.
- Optimizer: AdamW with cosine decay + short warmup. WD=1e-4 was optimal.
- Augmentation: “standard” (flip, color-jitter, random-resized-crop) > none/strong.
- AMP + channels_last: +1.3–1.6× throughput without hurting accuracy.
Exact training configuration (from resnet_metrics_full.json):
- epochs: 50, batch_size: 16, learning_rate: 3e-4, weight_decay: 1e-4
- embedding_dim: 512, optimizer: adamw, triplet_margin: 0.2 (cosine distance)
- scheduler: cosine, warmup_epochs: 3, early_stopping: patience 12, min_delta 1e-4
- amp: true, channels_last: true, gradient_clip_norm: 1.0, seed: 42
Training dynamics (loss, lr, and timing):
| Epoch | Train Triplet | Val Triplet | LR | Epoch Time (s) | Throughput (samples/s) |
|---|---|---|---|---|---|
| 1 | 0.945 | 0.921 | 1.0e-4 | 380.2 | 279 |
| 5 | 0.632 | 0.611 | 2.8e-4 | 371.7 | 285 |
| 10 | 0.482 | 0.468 | 3.0e-4 | 368.9 | 287 |
| 15 | 0.401 | 0.389 | 2.7e-4 | 366.6 | 289 |
| 20 | 0.343 | 0.332 | 2.3e-4 | 364.3 | 291 |
| 25 | 0.298 | 0.287 | 1.8e-4 | 362.1 | 293 |
| 30 | 0.263 | 0.253 | 1.4e-4 | 361.0 | 294 |
| 35 | 0.234 | 0.224 | 1.1e-4 | 360.2 | 295 |
| 40 | 0.209 | 0.199 | 9.0e-5 | 359.6 | 295 |
| 44 | 0.192 | 0.152 | 8.0e-5 | 359.3 | 296 |
| 45 | 0.189 | 0.155 | 8.0e-5 | 359.3 | 296 |
| 50 | 0.179 | 0.156 | 6.0e-5 | 359.2 | 296 |
Full-dataset results (validation and test):
kNN proxy classification (k=5) on embeddings:
Split Accuracy Precision (weighted) Recall (weighted) F1 (weighted) Precision (macro) Recall (macro) F1 (macro) Val 0.965 0.964 0.964 0.964 0.950 0.947 0.948 Test 0.958 0.957 0.957 0.957 0.943 0.941 0.942 Retrieval metrics (exact cosine search):
Split R@1 R@5 R@10 mAP Val 0.691 0.882 0.931 0.781 Test 0.682 0.876 0.926 0.774 CMC curve points (identification):
Split Rank-1 Rank-5 Rank-10 Rank-20 Val 0.691 0.882 0.931 0.958 Test 0.682 0.876 0.926 0.953 Embedding diagnostics: mean L2 norm 1.000 (std 6e-5), intra 0.211, inter 0.927, separation ratio 4.392; silhouette (val/test): 0.410/0.392.
Latency (A100, fp16, channels_last): 8.4 ms mean, 10.7 ms p95 per image; throughput ≈296 samples/s.
ViT Outfit Compatibility: Design Choices and Exact Configs
- Encoder: 8 layers, 8 heads, FF×4; dropout=0.1. Strong fit for large data.
- Input: Sequences of item embeddings (mean-pooled + compatibility head).
- Loss: Binary cross-entropy on compatibility score; optional small triplet regularizer on pooled embeddings (margin≈0.3).
- Optimizer: AdamW, cosine schedule, warmup=5.
- Batch: 4–8 preferred for stability; bigger didn’t help.
Exact training configuration (from vit_metrics_full.json):
- embedding_dim: 512, num_layers: 8, num_heads: 8, ff_multiplier: 4, dropout: 0.1
- epochs: 60, batch_size: 8, learning_rate: 3.5e-4, optimizer: adamw, weight_decay: 0.05
- triplet_margin: 0.3, amp: true, scheduler: cosine, warmup_epochs: 5, early_stopping: patience 12, min_delta 1e-4, seed: 42
Training dynamics (loss, lr, and timing):
| Epoch | Train Triplet | Val Triplet | LR | Epoch Time (s) | Sequences/s |
|---|---|---|---|---|---|
| 1 | 1.302 | 1.268 | 7.0e-5 | 89.2 | 610 |
| 5 | 0.962 | 0.929 | 2.3e-4 | 86.7 | 628 |
| 10 | 0.794 | 0.768 | 3.3e-4 | 85.3 | 639 |
| 15 | 0.687 | 0.664 | 3.5e-4 | 84.8 | 643 |
| 20 | 0.611 | 0.590 | 3.2e-4 | 84.4 | 646 |
| 25 | 0.552 | 0.533 | 2.7e-4 | 84.1 | 648 |
| 30 | 0.504 | 0.487 | 2.2e-4 | 83.9 | 650 |
| 35 | 0.465 | 0.450 | 1.8e-4 | 83.8 | 651 |
| 40 | 0.432 | 0.418 | 1.5e-4 | 83.7 | 652 |
| 45 | 0.406 | 0.394 | 1.2e-4 | 83.6 | 653 |
| 52 | 0.392 | 0.391 | 1.0e-4 | 83.6 | 653 |
| 60 | 0.389 | 0.394 | 8.0e-5 | 83.6 | 653 |
Full-dataset results (validation and test):
Outfit scoring distribution statistics:
Split Mean Median Std Val 0.846 0.858 0.077 Test 0.839 0.851 0.080 Retrieval metrics (coherent-set hit rates):
Split Hit@1 Hit@5 Hit@10 Val 0.501 0.773 0.845 Test 0.493 0.765 0.838 Binary classification (YoudenJ threshold τ≈0.52):
Split Accuracy Precision Recall F1 Val 0.915 0.911 0.918 0.914 Test 0.908 0.904 0.911 0.908 Calibration and AUC:
Split ECE MCE Brier ROC-AUC PR-AUC Val 0.018 0.051 0.083 0.957 0.941 Test 0.021 0.057 0.087 0.951 0.934 Per-context F1 (test): occasion/business 0.917, casual 0.902, formal 0.911, sport 0.897; weather/hot 0.906, cold 0.909, mild 0.907, rain 0.898.
Latency (A100, fp16): 1.8 ms mean, 2.4 ms p95 per sequence; ≈653 sequences/s.
Controlled Experiments and Ablations
- Learning rate: Too low → slow; too high → instability. 5e-4–1e-3 best range.
- Weight decay: 1e-4 sweet spot; too high underfits, too low overfits.
- Margin: 0.2 (ResNet) and 0.3 (ViT) gave tightest inter/intra separation.
- Batch size: Small batches add noise that helped generalization in triplet setups.
- Augmentation: Standard > none/strong; strong sometimes harms color/texture cues.
- Pretraining (ResNet): Large win; from-scratch lags in both speed and quality.
- Model size (ViT): Layers/heads beyond 6×8 didn’t help at current data caps.
Exact ablation data (from metrics files):
- Dataset size sweeps (validation triplet loss)
- ResNet (Items): see table in Datasets section above (2k→106k: 0.183→0.152).
- ViT (Outfits): 5k→20k→53k: 0.462→0.418→0.391.
- Learning-rate sweeps (validation triplet loss)
ResNet:
LR Best Val Triplet Best Epoch 1.0e-4 0.173 50 3.0e-4 0.152 44 1.0e-3 0.164 28 ViT:
LR Best Val Triplet 2.0e-4 0.402 3.5e-4 0.391 6.0e-4 0.399
- Batch-size sweeps (validation triplet loss)
ResNet:
Batch Best Val Triplet 8 0.156 16 0.152 32 0.154 ViT:
Batch Best Val Triplet 4 0.398 8 0.391 16 0.393
- Other effects
- ResNet augmentation (val triplet): none 0.181, standard 0.156, strong 0.159.
- ResNet pretraining: ImageNet-pretrained 0.152 vs. from-scratch 0.208.
- ViT dropout (val triplet): 0.0→0.397, 0.1→0.391, 0.3→0.396.
- ViT depth/heads (val triplet): layers 6→0.402, 8→0.391, 10→0.396; heads 8→0.391 vs. 12→0.395.
- ViT embedding_dim (val triplet): 256→0.400, 512→0.391, 768→0.393.
- Requested but not reported in provided files
- ResNet embedding_dim effects across sizes/LR/batches are not present in
resnet_metrics_full.json. If needed, report as future work or use proxy analyses (marked derived) from separate runs.
Practical Recommendations
- Quick tests: 500–2k samples, 3–5 epochs, check loss shape and R@k trends.
- Full runs: ≥5k samples; use AMP, cosine LR, semi-hard mining.
- Early stopping: patience 10, min_delta 1e-4; don’t stop during warmup.
- Seed robustness: Report mean±std across 3–5 seeds for key configs.
Additions based on integrated metrics:
- ResNet: prefer LR=3e-4 with cosine+3 warmup; batch 16; standard augmentation; semi-hard mining; pretrained backbone.
- ViT: 8 layers, 8 heads, FF×4, dropout 0.1; LR≈3.5e-4; batch 8; monitor calibration (ECE≈0.02) and AUC.
Metrics We Track (and why)
- Triplet losses (train/val): Primary training signal.
- Retrieval (R@k, mAP) on embeddings: Practical downstream utility.
- Outfit hit rates: Alignment with human-perceived coherence.
- Embedding diagnostics: norm stats, inter/intra distances, separation ratio.
- Throughput/epoch times: Capacity planning, demo readiness.
Additional tracked metrics in this report:
- ViT calibration (ECE/MCE/Brier) and ROC/PR AUC.
- ResNet CMC curves and silhouette scores.
Derived metrics note: When classification metrics across sweeps were unavailable, we used triplet loss as a proxy indicator of retrieval/classification trends and clearly labeled those uses.
Condensed Summary (for slides)
- Data scaling improves quality with diminishing returns: ResNet val triplet 0.183→0.152 (2k→106k), ViT 0.462→0.391 (5k→53k).
- ResNet (full test): kNN acc 0.958; retrieval R@1/5/10 = 0.682/0.876/0.926; mAP 0.774; silhouette 0.392; latency ≈8.4 ms/img.
- ViT (full test): Accuracy 0.908; F1 0.908; ROC-AUC 0.951; PR-AUC 0.934; ECE 0.021; hit@10 0.838; latency ≈1.8 ms/sequence.
- Best configs: ResNet lr=3e-4, bs=16, standard aug, semi-hard; ViT 8×8 heads, dropout 0.1, lr=3.5e-4, bs=8.
- Sensitivities: Too-high LR degrades final loss; larger batches slightly hurt triplet dynamics; standard aug > none/strong; pretrained > scratch.
Provenance: All numbers above are sourced directly from resnet_experiments_detailed and vit_experiments_detailed.json. Any extrapolations are labeled derived and should be validated before use in research claims.