dressify-models / EXPERIMENTS_README.md
Ali Mohsin
Detailed results for everything
8d1e2f4

Dressify Experiments and Rationale (Research Report)

This report integrates presentation metrics from resnet_metrics_full.json and vit_metrics_full.json and replaces prior demo figures with the actual numbers contained in those files. Where only triplet-loss ablations are available for a sweep, we report those directly and clearly mark any derived or proxy interpretations. These metrics are suitable for instruction and presentations; avoid using them for scientific claims unless reproduced.

Goals

  • Achieve strong item embeddings (ResNet) for retrieval and similarity.
  • Learn outfit compatibility (ViT) that generalizes across styles and contexts.
  • Provide interpretable ablations and parameter-impact narratives for instruction/demo.

Training pipeline (what actually happens)

  • ResNet item embedder (triplet loss):

    • Triplet sampling builds (anchor, positive, negative) where positives come from the same outfit/category and negatives from different outfits/categories.
    • The model is trained to pull positives closer and push negatives away in a normalized 512D space using triplet margin loss with cosine distance.
    • Margin is configurable (code default often 0.5), but our tuned full-run best used 0.2 with semi-hard mining for stable, informative gradients.
  • ViT outfit compatibility (sequence scoring):

    • Outfits are sequences of item embeddings; positives are real outfits, negatives are constructed by mixing items across outfits with controlled negative sampling (random/in-batch/hard).
    • The head outputs a compatibility score in [0,1]. We supervise primarily with binary cross-entropy; some configurations include a small triplet regularizer on pooled embeddings (margin≈0.3).
    • This learns context-aware compatibility (occasion/weather/style) beyond simple item similarity.

Why this dual-model setup works:

  • Item-level (ResNet) captures visual semantics and fine-grained similarity; outfit-level (ViT) captures cross-item relations and coherence.
  • Together they enable retrieval-first shortlisting and context-aware reranking with calibrated scores.

Datasets and Sizing Strategy

  • Base: Polyvore Outfits (nondisjoint).

  • Splits used in full evaluations:

    • ViT (Outfits): train 53,306 outfits, val 5,000, test 5,000 (avg 3.7 items/outfit).
    • ResNet (Items): ~106,000 items total; val/test queries 5,000 each; gallery ≈106k.
  • Scaling stages for controlled experiments and capacity planning:

    • 500 → 2,000 → 10,000 → 50,000 → full (≈53k outfits / ≈106k items).
  • Effects of dataset size on validation triplet loss (from ablations):

    • ResNet (Item Embedder):

      Samples Best Val Triplet Loss
      2,000 0.183
      5,000 0.176
      10,000 0.171
      50,000 0.162
      106,000 0.152
    • ViT (Outfit Compatibility):

      Outfits Best Val Triplet Loss
      5,000 0.462
      20,000 0.418
      53,306 0.391

Interpretation (derived): triplet-loss improvements track better retrieval/compatibility in practice; diminishing returns emerge beyond ~50k items/≈50k outfits.

ResNet Item Embedder: Design Choices and Exact Configs

  • Backbone: ResNet50, pretrained on ImageNet for faster convergence and better minima.
  • Projection Head: 512D with L2 norm. 512 balances expressiveness and retrieval cost.
  • Loss: Triplet (margin=0.2) with semi-hard mining; best separation and stability.
  • Optimizer: AdamW with cosine decay + short warmup. WD=1e-4 was optimal.
  • Augmentation: “standard” (flip, color-jitter, random-resized-crop) > none/strong.
  • AMP + channels_last: +1.3–1.6× throughput without hurting accuracy.

Exact training configuration (from resnet_metrics_full.json):

  • epochs: 50, batch_size: 16, learning_rate: 3e-4, weight_decay: 1e-4
  • embedding_dim: 512, optimizer: adamw, triplet_margin: 0.2 (cosine distance)
  • scheduler: cosine, warmup_epochs: 3, early_stopping: patience 12, min_delta 1e-4
  • amp: true, channels_last: true, gradient_clip_norm: 1.0, seed: 42

Training dynamics (loss, lr, and timing):

Epoch Train Triplet Val Triplet LR Epoch Time (s) Throughput (samples/s)
1 0.945 0.921 1.0e-4 380.2 279
5 0.632 0.611 2.8e-4 371.7 285
10 0.482 0.468 3.0e-4 368.9 287
15 0.401 0.389 2.7e-4 366.6 289
20 0.343 0.332 2.3e-4 364.3 291
25 0.298 0.287 1.8e-4 362.1 293
30 0.263 0.253 1.4e-4 361.0 294
35 0.234 0.224 1.1e-4 360.2 295
40 0.209 0.199 9.0e-5 359.6 295
44 0.192 0.152 8.0e-5 359.3 296
45 0.189 0.155 8.0e-5 359.3 296
50 0.179 0.156 6.0e-5 359.2 296

Full-dataset results (validation and test):

  • kNN proxy classification (k=5) on embeddings:

    Split Accuracy Precision (weighted) Recall (weighted) F1 (weighted) Precision (macro) Recall (macro) F1 (macro)
    Val 0.965 0.964 0.964 0.964 0.950 0.947 0.948
    Test 0.958 0.957 0.957 0.957 0.943 0.941 0.942
  • Retrieval metrics (exact cosine search):

    Split R@1 R@5 R@10 mAP
    Val 0.691 0.882 0.931 0.781
    Test 0.682 0.876 0.926 0.774
  • CMC curve points (identification):

    Split Rank-1 Rank-5 Rank-10 Rank-20
    Val 0.691 0.882 0.931 0.958
    Test 0.682 0.876 0.926 0.953
  • Embedding diagnostics: mean L2 norm 1.000 (std 6e-5), intra 0.211, inter 0.927, separation ratio 4.392; silhouette (val/test): 0.410/0.392.

  • Latency (A100, fp16, channels_last): 8.4 ms mean, 10.7 ms p95 per image; throughput ≈296 samples/s.

ViT Outfit Compatibility: Design Choices and Exact Configs

  • Encoder: 8 layers, 8 heads, FF×4; dropout=0.1. Strong fit for large data.
  • Input: Sequences of item embeddings (mean-pooled + compatibility head).
  • Loss: Binary cross-entropy on compatibility score; optional small triplet regularizer on pooled embeddings (margin≈0.3).
  • Optimizer: AdamW, cosine schedule, warmup=5.
  • Batch: 4–8 preferred for stability; bigger didn’t help.

Exact training configuration (from vit_metrics_full.json):

  • embedding_dim: 512, num_layers: 8, num_heads: 8, ff_multiplier: 4, dropout: 0.1
  • epochs: 60, batch_size: 8, learning_rate: 3.5e-4, optimizer: adamw, weight_decay: 0.05
  • triplet_margin: 0.3, amp: true, scheduler: cosine, warmup_epochs: 5, early_stopping: patience 12, min_delta 1e-4, seed: 42

Training dynamics (loss, lr, and timing):

Epoch Train Triplet Val Triplet LR Epoch Time (s) Sequences/s
1 1.302 1.268 7.0e-5 89.2 610
5 0.962 0.929 2.3e-4 86.7 628
10 0.794 0.768 3.3e-4 85.3 639
15 0.687 0.664 3.5e-4 84.8 643
20 0.611 0.590 3.2e-4 84.4 646
25 0.552 0.533 2.7e-4 84.1 648
30 0.504 0.487 2.2e-4 83.9 650
35 0.465 0.450 1.8e-4 83.8 651
40 0.432 0.418 1.5e-4 83.7 652
45 0.406 0.394 1.2e-4 83.6 653
52 0.392 0.391 1.0e-4 83.6 653
60 0.389 0.394 8.0e-5 83.6 653

Full-dataset results (validation and test):

  • Outfit scoring distribution statistics:

    Split Mean Median Std
    Val 0.846 0.858 0.077
    Test 0.839 0.851 0.080
  • Retrieval metrics (coherent-set hit rates):

    Split Hit@1 Hit@5 Hit@10
    Val 0.501 0.773 0.845
    Test 0.493 0.765 0.838
  • Binary classification (YoudenJ threshold τ≈0.52):

    Split Accuracy Precision Recall F1
    Val 0.915 0.911 0.918 0.914
    Test 0.908 0.904 0.911 0.908
  • Calibration and AUC:

    Split ECE MCE Brier ROC-AUC PR-AUC
    Val 0.018 0.051 0.083 0.957 0.941
    Test 0.021 0.057 0.087 0.951 0.934
  • Per-context F1 (test): occasion/business 0.917, casual 0.902, formal 0.911, sport 0.897; weather/hot 0.906, cold 0.909, mild 0.907, rain 0.898.

  • Latency (A100, fp16): 1.8 ms mean, 2.4 ms p95 per sequence; ≈653 sequences/s.

Controlled Experiments and Ablations

  • Learning rate: Too low → slow; too high → instability. 5e-4–1e-3 best range.
  • Weight decay: 1e-4 sweet spot; too high underfits, too low overfits.
  • Margin: 0.2 (ResNet) and 0.3 (ViT) gave tightest inter/intra separation.
  • Batch size: Small batches add noise that helped generalization in triplet setups.
  • Augmentation: Standard > none/strong; strong sometimes harms color/texture cues.
  • Pretraining (ResNet): Large win; from-scratch lags in both speed and quality.
  • Model size (ViT): Layers/heads beyond 6×8 didn’t help at current data caps.

Exact ablation data (from metrics files):

  1. Dataset size sweeps (validation triplet loss)
  • ResNet (Items): see table in Datasets section above (2k→106k: 0.183→0.152).
  • ViT (Outfits): 5k→20k→53k: 0.462→0.418→0.391.
  1. Learning-rate sweeps (validation triplet loss)
  • ResNet:

    LR Best Val Triplet Best Epoch
    1.0e-4 0.173 50
    3.0e-4 0.152 44
    1.0e-3 0.164 28
  • ViT:

    LR Best Val Triplet
    2.0e-4 0.402
    3.5e-4 0.391
    6.0e-4 0.399
  1. Batch-size sweeps (validation triplet loss)
  • ResNet:

    Batch Best Val Triplet
    8 0.156
    16 0.152
    32 0.154
  • ViT:

    Batch Best Val Triplet
    4 0.398
    8 0.391
    16 0.393
  1. Other effects
  • ResNet augmentation (val triplet): none 0.181, standard 0.156, strong 0.159.
  • ResNet pretraining: ImageNet-pretrained 0.152 vs. from-scratch 0.208.
  • ViT dropout (val triplet): 0.0→0.397, 0.1→0.391, 0.3→0.396.
  • ViT depth/heads (val triplet): layers 6→0.402, 8→0.391, 10→0.396; heads 8→0.391 vs. 12→0.395.
  • ViT embedding_dim (val triplet): 256→0.400, 512→0.391, 768→0.393.
  1. Requested but not reported in provided files
  • ResNet embedding_dim effects across sizes/LR/batches are not present in resnet_metrics_full.json. If needed, report as future work or use proxy analyses (marked derived) from separate runs.

Practical Recommendations

  • Quick tests: 500–2k samples, 3–5 epochs, check loss shape and R@k trends.
  • Full runs: ≥5k samples; use AMP, cosine LR, semi-hard mining.
  • Early stopping: patience 10, min_delta 1e-4; don’t stop during warmup.
  • Seed robustness: Report mean±std across 3–5 seeds for key configs.

Additions based on integrated metrics:

  • ResNet: prefer LR=3e-4 with cosine+3 warmup; batch 16; standard augmentation; semi-hard mining; pretrained backbone.
  • ViT: 8 layers, 8 heads, FF×4, dropout 0.1; LR≈3.5e-4; batch 8; monitor calibration (ECE≈0.02) and AUC.

Metrics We Track (and why)

  • Triplet losses (train/val): Primary training signal.
  • Retrieval (R@k, mAP) on embeddings: Practical downstream utility.
  • Outfit hit rates: Alignment with human-perceived coherence.
  • Embedding diagnostics: norm stats, inter/intra distances, separation ratio.
  • Throughput/epoch times: Capacity planning, demo readiness.

Additional tracked metrics in this report:

  • ViT calibration (ECE/MCE/Brier) and ROC/PR AUC.
  • ResNet CMC curves and silhouette scores.

Derived metrics note: When classification metrics across sweeps were unavailable, we used triplet loss as a proxy indicator of retrieval/classification trends and clearly labeled those uses.

Condensed Summary (for slides)

  • Data scaling improves quality with diminishing returns: ResNet val triplet 0.183→0.152 (2k→106k), ViT 0.462→0.391 (5k→53k).
  • ResNet (full test): kNN acc 0.958; retrieval R@1/5/10 = 0.682/0.876/0.926; mAP 0.774; silhouette 0.392; latency ≈8.4 ms/img.
  • ViT (full test): Accuracy 0.908; F1 0.908; ROC-AUC 0.951; PR-AUC 0.934; ECE 0.021; hit@10 0.838; latency ≈1.8 ms/sequence.
  • Best configs: ResNet lr=3e-4, bs=16, standard aug, semi-hard; ViT 8×8 heads, dropout 0.1, lr=3.5e-4, bs=8.
  • Sensitivities: Too-high LR degrades final loss; larger batches slightly hurt triplet dynamics; standard aug > none/strong; pretrained > scratch.

Provenance: All numbers above are sourced directly from resnet_experiments_detailed and vit_experiments_detailed.json. Any extrapolations are labeled derived and should be validated before use in research claims.