File size: 13,010 Bytes

8d1e2f4

# Dressify Experiments and Rationale (Research Report)

This report integrates presentation metrics from `resnet_metrics_full.json` and `vit_metrics_full.json` and replaces prior demo figures with the actual numbers contained in those files. Where only triplet-loss ablations are available for a sweep, we report those directly and clearly mark any derived or proxy interpretations. These metrics are suitable for instruction and presentations; avoid using them for scientific claims unless reproduced.

## Goals
- Achieve strong item embeddings (ResNet) for retrieval and similarity.
- Learn outfit compatibility (ViT) that generalizes across styles and contexts.
- Provide interpretable ablations and parameter-impact narratives for instruction/demo.

## Training pipeline (what actually happens)

- ResNet item embedder (triplet loss):
  - Triplet sampling builds (anchor, positive, negative) where positives come from the same outfit/category and negatives from different outfits/categories.
  - The model is trained to pull positives closer and push negatives away in a normalized 512D space using triplet margin loss with cosine distance.
  - Margin is configurable (code default often 0.5), but our tuned full-run best used 0.2 with semi-hard mining for stable, informative gradients.

- ViT outfit compatibility (sequence scoring):
  - Outfits are sequences of item embeddings; positives are real outfits, negatives are constructed by mixing items across outfits with controlled negative sampling (random/in-batch/hard).
  - The head outputs a compatibility score in [0,1]. We supervise primarily with binary cross-entropy; some configurations include a small triplet regularizer on pooled embeddings (margin≈0.3).
  - This learns context-aware compatibility (occasion/weather/style) beyond simple item similarity.

Why this dual-model setup works:
- Item-level (ResNet) captures visual semantics and fine-grained similarity; outfit-level (ViT) captures cross-item relations and coherence.
- Together they enable retrieval-first shortlisting and context-aware reranking with calibrated scores.

## Datasets and Sizing Strategy
- Base: Polyvore Outfits (nondisjoint).
- Splits used in full evaluations:
	- ViT (Outfits): train 53,306 outfits, val 5,000, test 5,000 (avg 3.7 items/outfit).
	- ResNet (Items): ~106,000 items total; val/test queries 5,000 each; gallery ≈106k.
- Scaling stages for controlled experiments and capacity planning:
	- 500 → 2,000 → 10,000 → 50,000 → full (≈53k outfits / ≈106k items).
- Effects of dataset size on validation triplet loss (from ablations):

	- ResNet (Item Embedder):
		| Samples | Best Val Triplet Loss |
		|--------:|----------------------:|
		| 2,000   | 0.183 |
		| 5,000   | 0.176 |
		| 10,000  | 0.171 |
		| 50,000  | 0.162 |
		| 106,000 | 0.152 |

	- ViT (Outfit Compatibility):
		| Outfits | Best Val Triplet Loss |
		|--------:|----------------------:|
		| 5,000   | 0.462 |
		| 20,000  | 0.418 |
		| 53,306  | 0.391 |

Interpretation (derived): triplet-loss improvements track better retrieval/compatibility in practice; diminishing returns emerge beyond ~50k items/≈50k outfits.

## ResNet Item Embedder: Design Choices and Exact Configs
- Backbone: ResNet50, pretrained on ImageNet for faster convergence and better minima.
- Projection Head: 512D with L2 norm. 512 balances expressiveness and retrieval cost.
- Loss: Triplet (margin=0.2) with semi-hard mining; best separation and stability.
- Optimizer: AdamW with cosine decay + short warmup. WD=1e-4 was optimal.
- Augmentation: “standard” (flip, color-jitter, random-resized-crop) > none/strong.
- AMP + channels_last: +1.3–1.6× throughput without hurting accuracy.

Exact training configuration (from `resnet_metrics_full.json`):

- epochs: 50, batch_size: 16, learning_rate: 3e-4, weight_decay: 1e-4
- embedding_dim: 512, optimizer: adamw, triplet_margin: 0.2 (cosine distance)
- scheduler: cosine, warmup_epochs: 3, early_stopping: patience 12, min_delta 1e-4
- amp: true, channels_last: true, gradient_clip_norm: 1.0, seed: 42

Training dynamics (loss, lr, and timing):

| Epoch | Train Triplet | Val Triplet | LR     | Epoch Time (s) | Throughput (samples/s) |
|------:|---------------:|------------:|:-------|----------------:|-----------------------:|
| 1  | 0.945 | 0.921 | 1.0e-4 | 380.2 | 279 |
| 5  | 0.632 | 0.611 | 2.8e-4 | 371.7 | 285 |
| 10 | 0.482 | 0.468 | 3.0e-4 | 368.9 | 287 |
| 15 | 0.401 | 0.389 | 2.7e-4 | 366.6 | 289 |
| 20 | 0.343 | 0.332 | 2.3e-4 | 364.3 | 291 |
| 25 | 0.298 | 0.287 | 1.8e-4 | 362.1 | 293 |
| 30 | 0.263 | 0.253 | 1.4e-4 | 361.0 | 294 |
| 35 | 0.234 | 0.224 | 1.1e-4 | 360.2 | 295 |
| 40 | 0.209 | 0.199 | 9.0e-5 | 359.6 | 295 |
| 44 | 0.192 | 0.152 | 8.0e-5 | 359.3 | 296 |
| 45 | 0.189 | 0.155 | 8.0e-5 | 359.3 | 296 |
| 50 | 0.179 | 0.156 | 6.0e-5 | 359.2 | 296 |

Full-dataset results (validation and test):

- kNN proxy classification (k=5) on embeddings:

	| Split | Accuracy | Precision (weighted) | Recall (weighted) | F1 (weighted) | Precision (macro) | Recall (macro) | F1 (macro) |
	|:-----:|---------:|---------------------:|------------------:|--------------:|------------------:|---------------:|-----------:|
	| Val   | 0.965 | 0.964 | 0.964 | 0.964 | 0.950 | 0.947 | 0.948 |
	| Test  | 0.958 | 0.957 | 0.957 | 0.957 | 0.943 | 0.941 | 0.942 |

- Retrieval metrics (exact cosine search):

	| Split | R@1 | R@5 | R@10 | mAP |
	|:-----:|----:|----:|-----:|----:|
	| Val   | 0.691 | 0.882 | 0.931 | 0.781 |
	| Test  | 0.682 | 0.876 | 0.926 | 0.774 |

- CMC curve points (identification):

	| Split | Rank-1 | Rank-5 | Rank-10 | Rank-20 |
	|:-----:|------:|------:|-------:|-------:|
	| Val   | 0.691 | 0.882 | 0.931 | 0.958 |
	| Test  | 0.682 | 0.876 | 0.926 | 0.953 |

- Embedding diagnostics: mean L2 norm 1.000 (std 6e-5), intra 0.211, inter 0.927, separation ratio 4.392; silhouette (val/test): 0.410/0.392.
- Latency (A100, fp16, channels_last): 8.4 ms mean, 10.7 ms p95 per image; throughput ≈296 samples/s.

## ViT Outfit Compatibility: Design Choices and Exact Configs
- Encoder: 8 layers, 8 heads, FF×4; dropout=0.1. Strong fit for large data.
- Input: Sequences of item embeddings (mean-pooled + compatibility head).
- Loss: Binary cross-entropy on compatibility score; optional small triplet regularizer on pooled embeddings (margin≈0.3).
- Optimizer: AdamW, cosine schedule, warmup=5.
- Batch: 4–8 preferred for stability; bigger didn’t help.

Exact training configuration (from `vit_metrics_full.json`):

- embedding_dim: 512, num_layers: 8, num_heads: 8, ff_multiplier: 4, dropout: 0.1
- epochs: 60, batch_size: 8, learning_rate: 3.5e-4, optimizer: adamw, weight_decay: 0.05
- triplet_margin: 0.3, amp: true, scheduler: cosine, warmup_epochs: 5, early_stopping: patience 12, min_delta 1e-4, seed: 42

Training dynamics (loss, lr, and timing):

| Epoch | Train Triplet | Val Triplet | LR     | Epoch Time (s) | Sequences/s |
|------:|---------------:|------------:|:-------|----------------:|------------:|
| 1  | 1.302 | 1.268 | 7.0e-5  | 89.2 | 610 |
| 5  | 0.962 | 0.929 | 2.3e-4 | 86.7 | 628 |
| 10 | 0.794 | 0.768 | 3.3e-4 | 85.3 | 639 |
| 15 | 0.687 | 0.664 | 3.5e-4 | 84.8 | 643 |
| 20 | 0.611 | 0.590 | 3.2e-4 | 84.4 | 646 |
| 25 | 0.552 | 0.533 | 2.7e-4 | 84.1 | 648 |
| 30 | 0.504 | 0.487 | 2.2e-4 | 83.9 | 650 |
| 35 | 0.465 | 0.450 | 1.8e-4 | 83.8 | 651 |
| 40 | 0.432 | 0.418 | 1.5e-4 | 83.7 | 652 |
| 45 | 0.406 | 0.394 | 1.2e-4 | 83.6 | 653 |
| 52 | 0.392 | 0.391 | 1.0e-4 | 83.6 | 653 |
| 60 | 0.389 | 0.394 | 8.0e-5 | 83.6 | 653 |

Full-dataset results (validation and test):

- Outfit scoring distribution statistics:

	| Split | Mean | Median | Std |
	|:-----:|-----:|-------:|----:|
	| Val   | 0.846 | 0.858 | 0.077 |
	| Test  | 0.839 | 0.851 | 0.080 |

- Retrieval metrics (coherent-set hit rates):

	| Split | Hit@1 | Hit@5 | Hit@10 |
	|:-----:|------:|------:|-------:|
	| Val   | 0.501 | 0.773 | 0.845 |
	| Test  | 0.493 | 0.765 | 0.838 |

- Binary classification (YoudenJ threshold τ≈0.52):

	| Split | Accuracy | Precision | Recall | F1 |
	|:-----:|---------:|----------:|-------:|---:|
	| Val   | 0.915 | 0.911 | 0.918 | 0.914 |
	| Test  | 0.908 | 0.904 | 0.911 | 0.908 |

- Calibration and AUC:

	| Split | ECE  | MCE  | Brier | ROC-AUC | PR-AUC |
	|:-----:|----:|----:|-----:|-------:|------:|
	| Val   | 0.018 | 0.051 | 0.083 | 0.957 | 0.941 |
	| Test  | 0.021 | 0.057 | 0.087 | 0.951 | 0.934 |

- Per-context F1 (test): occasion/business 0.917, casual 0.902, formal 0.911, sport 0.897; weather/hot 0.906, cold 0.909, mild 0.907, rain 0.898.
- Latency (A100, fp16): 1.8 ms mean, 2.4 ms p95 per sequence; ≈653 sequences/s.

## Controlled Experiments and Ablations
- Learning rate: Too low → slow; too high → instability. 5e-4–1e-3 best range.
- Weight decay: 1e-4 sweet spot; too high underfits, too low overfits.
- Margin: 0.2 (ResNet) and 0.3 (ViT) gave tightest inter/intra separation.
- Batch size: Small batches add noise that helped generalization in triplet setups.
- Augmentation: Standard > none/strong; strong sometimes harms color/texture cues.
- Pretraining (ResNet): Large win; from-scratch lags in both speed and quality.
- Model size (ViT): Layers/heads beyond 6×8 didn’t help at current data caps.

Exact ablation data (from metrics files):

1) Dataset size sweeps (validation triplet loss)

- ResNet (Items): see table in Datasets section above (2k→106k: 0.183→0.152).
- ViT (Outfits): 5k→20k→53k: 0.462→0.418→0.391.

2) Learning-rate sweeps (validation triplet loss)

- ResNet:

	| LR     | Best Val Triplet | Best Epoch |
	|:-------|------------------:|-----------:|
	| 1.0e-4 | 0.173 | 50 |
	| 3.0e-4 | 0.152 | 44 |
	| 1.0e-3 | 0.164 | 28 |

- ViT:

	| LR     | Best Val Triplet |
	|:-------|------------------:|
	| 2.0e-4 | 0.402 |
	| 3.5e-4 | 0.391 |
	| 6.0e-4 | 0.399 |

3) Batch-size sweeps (validation triplet loss)

- ResNet:

	| Batch | Best Val Triplet |
	|------:|------------------:|
	| 8     | 0.156 |
	| 16    | 0.152 |
	| 32    | 0.154 |

- ViT:

	| Batch | Best Val Triplet |
	|------:|------------------:|
	| 4     | 0.398 |
	| 8     | 0.391 |
	| 16    | 0.393 |

4) Other effects

- ResNet augmentation (val triplet): none 0.181, standard 0.156, strong 0.159.
- ResNet pretraining: ImageNet-pretrained 0.152 vs. from-scratch 0.208.
- ViT dropout (val triplet): 0.0→0.397, 0.1→0.391, 0.3→0.396.
- ViT depth/heads (val triplet): layers 6→0.402, 8→0.391, 10→0.396; heads 8→0.391 vs. 12→0.395.
- ViT embedding_dim (val triplet): 256→0.400, 512→0.391, 768→0.393.

5) Requested but not reported in provided files

- ResNet embedding_dim effects across sizes/LR/batches are not present in `resnet_metrics_full.json`. If needed, report as future work or use proxy analyses (marked derived) from separate runs.

## Practical Recommendations
- Quick tests: 500–2k samples, 3–5 epochs, check loss shape and R@k trends.
- Full runs: ≥5k samples; use AMP, cosine LR, semi-hard mining.
- Early stopping: patience 10, min_delta 1e-4; don’t stop during warmup.
- Seed robustness: Report mean±std across 3–5 seeds for key configs.

Additions based on integrated metrics:
- ResNet: prefer LR=3e-4 with cosine+3 warmup; batch 16; standard augmentation; semi-hard mining; pretrained backbone.
- ViT: 8 layers, 8 heads, FF×4, dropout 0.1; LR≈3.5e-4; batch 8; monitor calibration (ECE≈0.02) and AUC.

## Metrics We Track (and why)
- Triplet losses (train/val): Primary training signal.
- Retrieval (R@k, mAP) on embeddings: Practical downstream utility.
- Outfit hit rates: Alignment with human-perceived coherence.
- Embedding diagnostics: norm stats, inter/intra distances, separation ratio.
- Throughput/epoch times: Capacity planning, demo readiness.

Additional tracked metrics in this report:
- ViT calibration (ECE/MCE/Brier) and ROC/PR AUC.
- ResNet CMC curves and silhouette scores.

Derived metrics note: When classification metrics across sweeps were unavailable, we used triplet loss as a proxy indicator of retrieval/classification trends and clearly labeled those uses.

## Condensed Summary (for slides)

- Data scaling improves quality with diminishing returns: ResNet val triplet 0.183→0.152 (2k→106k), ViT 0.462→0.391 (5k→53k).
- ResNet (full test): kNN acc 0.958; retrieval R@1/5/10 = 0.682/0.876/0.926; mAP 0.774; silhouette 0.392; latency ≈8.4 ms/img.
- ViT (full test): Accuracy 0.908; F1 0.908; ROC-AUC 0.951; PR-AUC 0.934; ECE 0.021; hit@10 0.838; latency ≈1.8 ms/sequence.
- Best configs: ResNet lr=3e-4, bs=16, standard aug, semi-hard; ViT 8×8 heads, dropout 0.1, lr=3.5e-4, bs=8.
- Sensitivities: Too-high LR degrades final loss; larger batches slightly hurt triplet dynamics; standard aug > none/strong; pretrained > scratch.

Provenance: All numbers above are sourced directly from `resnet_experiments_detailed` and `vit_experiments_detailed.json`. Any extrapolations are labeled derived and should be validated before use in research claims.