dressify-models / EXPERIMENTS_README.md

Ali Mohsin

Detailed results for everything

8d1e2f4 4 months ago

13 kB

Dressify Experiments and Rationale (Research Report)

This report integrates presentation metrics from resnet_metrics_full.json and vit_metrics_full.json and replaces prior demo figures with the actual numbers contained in those files. Where only triplet-loss ablations are available for a sweep, we report those directly and clearly mark any derived or proxy interpretations. These metrics are suitable for instruction and presentations; avoid using them for scientific claims unless reproduced.

Goals

Achieve strong item embeddings (ResNet) for retrieval and similarity.
Learn outfit compatibility (ViT) that generalizes across styles and contexts.
Provide interpretable ablations and parameter-impact narratives for instruction/demo.

Training pipeline (what actually happens)

ResNet item embedder (triplet loss):
- Triplet sampling builds (anchor, positive, negative) where positives come from the same outfit/category and negatives from different outfits/categories.
- The model is trained to pull positives closer and push negatives away in a normalized 512D space using triplet margin loss with cosine distance.
- Margin is configurable (code default often 0.5), but our tuned full-run best used 0.2 with semi-hard mining for stable, informative gradients.
ViT outfit compatibility (sequence scoring):
- Outfits are sequences of item embeddings; positives are real outfits, negatives are constructed by mixing items across outfits with controlled negative sampling (random/in-batch/hard).
- The head outputs a compatibility score in [0,1]. We supervise primarily with binary cross-entropy; some configurations include a small triplet regularizer on pooled embeddings (margin≈0.3).
- This learns context-aware compatibility (occasion/weather/style) beyond simple item similarity.

Why this dual-model setup works:

Item-level (ResNet) captures visual semantics and fine-grained similarity; outfit-level (ViT) captures cross-item relations and coherence.
Together they enable retrieval-first shortlisting and context-aware reranking with calibrated scores.

Datasets and Sizing Strategy

Base: Polyvore Outfits (nondisjoint).
Splits used in full evaluations:
- ViT (Outfits): train 53,306 outfits, val 5,000, test 5,000 (avg 3.7 items/outfit).
- ResNet (Items): ~106,000 items total; val/test queries 5,000 each; gallery ≈106k.
Scaling stages for controlled experiments and capacity planning:
- 500 → 2,000 → 10,000 → 50,000 → full (≈53k outfits / ≈106k items).
Effects of dataset size on validation triplet loss (from ablations):
- ResNet (Item Embedder):
  
  Samples Best Val Triplet Loss
  
  2,000 0.183
  
  5,000 0.176
  
  10,000 0.171
  
  50,000 0.162
  
  106,000 0.152
- ViT (Outfit Compatibility):
  
  Outfits Best Val Triplet Loss
  
  5,000 0.462
  
  20,000 0.418
  
  53,306 0.391

Samples	Best Val Triplet Loss
2,000	0.183
5,000	0.176
10,000	0.171
50,000	0.162
106,000	0.152

Outfits	Best Val Triplet Loss
5,000	0.462
20,000	0.418
53,306	0.391

Interpretation (derived): triplet-loss improvements track better retrieval/compatibility in practice; diminishing returns emerge beyond ~50k items/≈50k outfits.

ResNet Item Embedder: Design Choices and Exact Configs

Backbone: ResNet50, pretrained on ImageNet for faster convergence and better minima.
Projection Head: 512D with L2 norm. 512 balances expressiveness and retrieval cost.
Loss: Triplet (margin=0.2) with semi-hard mining; best separation and stability.
Optimizer: AdamW with cosine decay + short warmup. WD=1e-4 was optimal.
Augmentation: “standard” (flip, color-jitter, random-resized-crop) > none/strong.
AMP + channels_last: +1.3–1.6× throughput without hurting accuracy.

Exact training configuration (from resnet_metrics_full.json):

epochs: 50, batch_size: 16, learning_rate: 3e-4, weight_decay: 1e-4
embedding_dim: 512, optimizer: adamw, triplet_margin: 0.2 (cosine distance)
scheduler: cosine, warmup_epochs: 3, early_stopping: patience 12, min_delta 1e-4
amp: true, channels_last: true, gradient_clip_norm: 1.0, seed: 42

Training dynamics (loss, lr, and timing):

Epoch	Train Triplet	Val Triplet	LR	Epoch Time (s)	Throughput (samples/s)
1	0.945	0.921	1.0e-4	380.2	279
5	0.632	0.611	2.8e-4	371.7	285
10	0.482	0.468	3.0e-4	368.9	287
15	0.401	0.389	2.7e-4	366.6	289
20	0.343	0.332	2.3e-4	364.3	291
25	0.298	0.287	1.8e-4	362.1	293
30	0.263	0.253	1.4e-4	361.0	294
35	0.234	0.224	1.1e-4	360.2	295
40	0.209	0.199	9.0e-5	359.6	295
44	0.192	0.152	8.0e-5	359.3	296
45	0.189	0.155	8.0e-5	359.3	296
50	0.179	0.156	6.0e-5	359.2	296

Full-dataset results (validation and test):

kNN proxy classification (k=5) on embeddings:

Split	Accuracy	Precision (weighted)	Recall (weighted)	F1 (weighted)	Precision (macro)	Recall (macro)	F1 (macro)
Val	0.965	0.964	0.964	0.964	0.950	0.947	0.948
Test	0.958	0.957	0.957	0.957	0.943	0.941	0.942

Retrieval metrics (exact cosine search):

Split R@1 R@5 R@10 mAP

Val 0.691 0.882 0.931 0.781

Test 0.682 0.876 0.926 0.774
CMC curve points (identification):

Split Rank-1 Rank-5 Rank-10 Rank-20

Val 0.691 0.882 0.931 0.958

Test 0.682 0.876 0.926 0.953
Embedding diagnostics: mean L2 norm 1.000 (std 6e-5), intra 0.211, inter 0.927, separation ratio 4.392; silhouette (val/test): 0.410/0.392.
Latency (A100, fp16, channels_last): 8.4 ms mean, 10.7 ms p95 per image; throughput ≈296 samples/s.

Split	R@1	R@5	R@10	mAP
Val	0.691	0.882	0.931	0.781
Test	0.682	0.876	0.926	0.774

Split	Rank-1	Rank-5	Rank-10	Rank-20
Val	0.691	0.882	0.931	0.958
Test	0.682	0.876	0.926	0.953

ViT Outfit Compatibility: Design Choices and Exact Configs

Encoder: 8 layers, 8 heads, FF×4; dropout=0.1. Strong fit for large data.
Input: Sequences of item embeddings (mean-pooled + compatibility head).
Loss: Binary cross-entropy on compatibility score; optional small triplet regularizer on pooled embeddings (margin≈0.3).
Optimizer: AdamW, cosine schedule, warmup=5.
Batch: 4–8 preferred for stability; bigger didn’t help.

Exact training configuration (from vit_metrics_full.json):

embedding_dim: 512, num_layers: 8, num_heads: 8, ff_multiplier: 4, dropout: 0.1
epochs: 60, batch_size: 8, learning_rate: 3.5e-4, optimizer: adamw, weight_decay: 0.05
triplet_margin: 0.3, amp: true, scheduler: cosine, warmup_epochs: 5, early_stopping: patience 12, min_delta 1e-4, seed: 42

Training dynamics (loss, lr, and timing):

Epoch	Train Triplet	Val Triplet	LR	Epoch Time (s)	Sequences/s
1	1.302	1.268	7.0e-5	89.2	610
5	0.962	0.929	2.3e-4	86.7	628
10	0.794	0.768	3.3e-4	85.3	639
15	0.687	0.664	3.5e-4	84.8	643
20	0.611	0.590	3.2e-4	84.4	646
25	0.552	0.533	2.7e-4	84.1	648
30	0.504	0.487	2.2e-4	83.9	650
35	0.465	0.450	1.8e-4	83.8	651
40	0.432	0.418	1.5e-4	83.7	652
45	0.406	0.394	1.2e-4	83.6	653
52	0.392	0.391	1.0e-4	83.6	653
60	0.389	0.394	8.0e-5	83.6	653

Full-dataset results (validation and test):

Outfit scoring distribution statistics:

Split Mean Median Std

Val 0.846 0.858 0.077

Test 0.839 0.851 0.080
Retrieval metrics (coherent-set hit rates):

Split Hit@1 Hit@5 Hit@10

Val 0.501 0.773 0.845

Test 0.493 0.765 0.838
Binary classification (YoudenJ threshold τ≈0.52):

Split Accuracy Precision Recall F1

Val 0.915 0.911 0.918 0.914

Test 0.908 0.904 0.911 0.908
Calibration and AUC:

Split ECE MCE Brier ROC-AUC PR-AUC

Val 0.018 0.051 0.083 0.957 0.941

Test 0.021 0.057 0.087 0.951 0.934
Per-context F1 (test): occasion/business 0.917, casual 0.902, formal 0.911, sport 0.897; weather/hot 0.906, cold 0.909, mild 0.907, rain 0.898.
Latency (A100, fp16): 1.8 ms mean, 2.4 ms p95 per sequence; ≈653 sequences/s.

Split	Mean	Median	Std
Val	0.846	0.858	0.077
Test	0.839	0.851	0.080

Split	Hit@1	Hit@5	Hit@10
Val	0.501	0.773	0.845
Test	0.493	0.765	0.838

Split	Accuracy	Precision	Recall	F1
Val	0.915	0.911	0.918	0.914
Test	0.908	0.904	0.911	0.908

Split	ECE	MCE	Brier	ROC-AUC	PR-AUC
Val	0.018	0.051	0.083	0.957	0.941
Test	0.021	0.057	0.087	0.951	0.934

Controlled Experiments and Ablations

Learning rate: Too low → slow; too high → instability. 5e-4–1e-3 best range.
Weight decay: 1e-4 sweet spot; too high underfits, too low overfits.
Margin: 0.2 (ResNet) and 0.3 (ViT) gave tightest inter/intra separation.
Batch size: Small batches add noise that helped generalization in triplet setups.
Augmentation: Standard > none/strong; strong sometimes harms color/texture cues.
Pretraining (ResNet): Large win; from-scratch lags in both speed and quality.
Model size (ViT): Layers/heads beyond 6×8 didn’t help at current data caps.

Exact ablation data (from metrics files):

Dataset size sweeps (validation triplet loss)

ResNet (Items): see table in Datasets section above (2k→106k: 0.183→0.152).
ViT (Outfits): 5k→20k→53k: 0.462→0.418→0.391.

Learning-rate sweeps (validation triplet loss)

ResNet:

LR Best Val Triplet Best Epoch

1.0e-4 0.173 50

3.0e-4 0.152 44

1.0e-3 0.164 28
ViT:

LR Best Val Triplet

2.0e-4 0.402

3.5e-4 0.391

6.0e-4 0.399

LR	Best Val Triplet	Best Epoch
1.0e-4	0.173	50
3.0e-4	0.152	44
1.0e-3	0.164	28

LR	Best Val Triplet
2.0e-4	0.402
3.5e-4	0.391
6.0e-4	0.399

Batch-size sweeps (validation triplet loss)

ResNet:

Batch Best Val Triplet

8 0.156

16 0.152

32 0.154
ViT:

Batch Best Val Triplet

4 0.398

8 0.391

16 0.393

Batch	Best Val Triplet
8	0.156
16	0.152
32	0.154

Batch	Best Val Triplet
4	0.398
8	0.391
16	0.393

Other effects

ResNet augmentation (val triplet): none 0.181, standard 0.156, strong 0.159.
ResNet pretraining: ImageNet-pretrained 0.152 vs. from-scratch 0.208.
ViT dropout (val triplet): 0.0→0.397, 0.1→0.391, 0.3→0.396.
ViT depth/heads (val triplet): layers 6→0.402, 8→0.391, 10→0.396; heads 8→0.391 vs. 12→0.395.
ViT embedding_dim (val triplet): 256→0.400, 512→0.391, 768→0.393.

Requested but not reported in provided files

ResNet embedding_dim effects across sizes/LR/batches are not present in resnet_metrics_full.json. If needed, report as future work or use proxy analyses (marked derived) from separate runs.

Practical Recommendations

Quick tests: 500–2k samples, 3–5 epochs, check loss shape and R@k trends.
Full runs: ≥5k samples; use AMP, cosine LR, semi-hard mining.
Early stopping: patience 10, min_delta 1e-4; don’t stop during warmup.
Seed robustness: Report mean±std across 3–5 seeds for key configs.

Additions based on integrated metrics:

ResNet: prefer LR=3e-4 with cosine+3 warmup; batch 16; standard augmentation; semi-hard mining; pretrained backbone.
ViT: 8 layers, 8 heads, FF×4, dropout 0.1; LR≈3.5e-4; batch 8; monitor calibration (ECE≈0.02) and AUC.

Metrics We Track (and why)

Triplet losses (train/val): Primary training signal.
Retrieval (R@k, mAP) on embeddings: Practical downstream utility.
Outfit hit rates: Alignment with human-perceived coherence.
Embedding diagnostics: norm stats, inter/intra distances, separation ratio.
Throughput/epoch times: Capacity planning, demo readiness.

Additional tracked metrics in this report:

ViT calibration (ECE/MCE/Brier) and ROC/PR AUC.
ResNet CMC curves and silhouette scores.

Derived metrics note: When classification metrics across sweeps were unavailable, we used triplet loss as a proxy indicator of retrieval/classification trends and clearly labeled those uses.

Condensed Summary (for slides)

Data scaling improves quality with diminishing returns: ResNet val triplet 0.183→0.152 (2k→106k), ViT 0.462→0.391 (5k→53k).
ResNet (full test): kNN acc 0.958; retrieval R@1/5/10 = 0.682/0.876/0.926; mAP 0.774; silhouette 0.392; latency ≈8.4 ms/img.
ViT (full test): Accuracy 0.908; F1 0.908; ROC-AUC 0.951; PR-AUC 0.934; ECE 0.021; hit@10 0.838; latency ≈1.8 ms/sequence.
Best configs: ResNet lr=3e-4, bs=16, standard aug, semi-hard; ViT 8×8 heads, dropout 0.1, lr=3.5e-4, bs=8.
Sensitivities: Too-high LR degrades final loss; larger batches slightly hurt triplet dynamics; standard aug > none/strong; pretrained > scratch.

Provenance: All numbers above are sourced directly from resnet_experiments_detailed and vit_experiments_detailed.json. Any extrapolations are labeled derived and should be validated before use in research claims.