# 🎯 How Dressify Recommendations Actually Work ## ✅ **YES - Both ResNet and ViT are used during inference!** This document explains the complete recommendation pipeline and proves that both deep learning models are actively used. --- ## 📊 **Complete Recommendation Pipeline** ### **Step 1: Image Input & Category Detection** **Location:** `inference.py:356-384` ```python # User uploads wardrobe images items = [ {"id": "item_0", "image": , "category": None}, {"id": "item_1", "image": , "category": None}, ... ] # For each item: for item in items: # 1. Auto-detect category using CLIP (if available) category = self._detect_category_with_clip(item["image"]) # OR fallback to filename-based detection # 2. Generate embedding if not provided if embedding is None: embedding = self.embed_images([item["image"]])[0] ``` **What happens:** - Each clothing item image is processed - Category is detected (shirt, pants, shoes, etc.) using CLIP or filename - If no embedding exists, it's generated using **ResNet** --- ### **Step 2: ResNet Generates Item Embeddings** ⭐ **Location:** `inference.py:313-337` → `embed_images()` ```python @torch.inference_mode() def embed_images(self, images: List[Image.Image]) -> List[np.ndarray]: # Transform images to tensor batch = torch.stack([self.transform(img) for img in images]) batch = batch.to(self.device, memory_format=torch.channels_last) # ✅ RESNET IS CALLED HERE! use_amp = (self.device == "cuda") with torch.autocast(device_type=("cuda" if use_amp else "cpu"), enabled=use_amp): emb = self.resnet(batch) # <-- RESNET FORWARD PASS # Normalize embeddings emb = nn.functional.normalize(emb, dim=-1) result = [e.detach().cpu().numpy().astype(np.float32) for e in emb] return result ``` **What ResNet does:** - Takes raw clothing item images (224x224 RGB) - Passes through ResNet50 backbone (pretrained on ImageNet) - Generates **512-dimensional embeddings** for each item - These embeddings capture visual features (color, texture, style, pattern) **Example:** - Input: Image of a blue shirt → ResNet → Output: `[0.123, -0.456, 0.789, ...]` (512-dim vector) --- ### **Step 3: Tag Processing & Context Building** **Location:** `inference.py:490-545` ```python # Process user tags (occasion, weather, style, etc.) processed_tags = self.tag_processor.process_tags(context) # Build outfit template based on tags template = outfit_templates[outfit_style].copy() # Apply weather/occasion modifications # Generate constraints (min_items, max_items, accessory_limit) ``` **What happens:** - User preferences (formal, cold weather, elegant style) are processed - Outfit templates are selected and modified - Constraints are generated (e.g., formal requires 4-5 items, needs outerwear) --- ### **Step 4: Candidate Outfit Generation** **Location:** `inference.py:910-1092` ```python # Generate many candidate outfit combinations candidates = [] for _ in range(num_samples): # Typically 50-100+ candidates subset = [] # Strategy-based generation: # - Strategy 0: Core outfit (shirt + pants + shoes + accessories) # - Strategy 1: Accessory-focused # - Strategy 2: Flexible combination # Add items based on context (formal, casual, etc.) if occasion == "formal" and outerwear: subset.append(jacket) subset.append(shirt) subset.append(pants) subset.append(shoes) candidates.append(subset) ``` **What happens:** - System generates **50-100+ candidate outfit combinations** - Each candidate is a list of item indices (e.g., `[0, 3, 7, 12]`) - Candidates are generated using: - Category pools (uppers, bottoms, shoes, outerwear, accessories) - Context-aware strategies (formal vs casual) - Randomization for variety --- ### **Step 5: ViT Scores Outfit Compatibility** ⭐⭐ **Location:** `inference.py:1094-1103` → `score_subset()` ```python def score_subset(idx_subset: List[int]) -> float: # Get embeddings for items in this outfit embs = torch.tensor( np.stack([proc_items[i]["embedding"] for i in idx_subset], axis=0), dtype=torch.float32, device=self.device, ) # Shape: (N, 512) where N = number of items in outfit embs = embs.unsqueeze(0) # Shape: (1, N, 512) - batch dimension # ✅ VIT IS CALLED HERE! s = self.vit.score_compatibility(embs).item() # <-- VIT FORWARD PASS return float(s) ``` **What ViT does:** - Takes **multiple item embeddings** (e.g., jacket, shirt, pants, shoes) - Passes through **Vision Transformer encoder**: - Transformer processes the sequence of item embeddings - Learns relationships between items (do they go together?) - Outputs a **compatibility score** (higher = better match) **ViT Architecture:** ```python # From models/vit_outfit.py class OutfitCompatibilityModel(nn.Module): def forward(self, tokens: torch.Tensor) -> torch.Tensor: # tokens: (B, N, D) - batch of outfits, each with N items, D-dim embeddings h = self.encoder(tokens) # Transformer encoder pooled = h.mean(dim=1) # Average pooling across items score = self.compatibility_head(pooled) # Final compatibility score return score.squeeze(-1) ``` **Example:** - Input: `[jacket_emb, shirt_emb, pants_emb, shoes_emb]` (4 items × 512 dims) - ViT Processing: Transformer analyzes relationships - Output: `0.85` (high compatibility score) --- ### **Step 6: Scoring & Ranking** **Location:** `inference.py:1266-1274` ```python # Score all valid candidates scored = [] for subset in valid_candidates: base_score = score_subset(subset) # <-- ViT score (0.0 to 1.0+) # Apply penalties and bonuses adjusted_score = calculate_outfit_penalty(subset, base_score) # - Penalties: missing categories, duplicates, wrong context # - Bonuses: color harmony, style coherence, complete sets scored.append((subset, adjusted_score, base_score)) # Sort by adjusted score (highest first) scored.sort(key=lambda x: x[1], reverse=True) ``` **What happens:** - Each candidate outfit gets: 1. **Base score from ViT** (0.0 to ~1.0+) 2. **Penalties** (e.g., -500 if formal without jacket) 3. **Bonuses** (e.g., +0.6 for color harmony, +0.4 for style coherence) - Final score = base_score + penalties + bonuses - Outfits are ranked by final score --- ### **Step 7: Final Selection & Deduplication** **Location:** `inference.py:1276-1300` ```python # Remove duplicate outfits seen_outfits = set() unique_scored = [] for subset, adjusted_score, base_score in scored: normalized = normalize_outfit(subset) # Sort item IDs if normalized not in seen_outfits: seen_outfits.add(normalized) unique_scored.append((subset, adjusted_score, base_score)) # Select top N with randomization topk = unique_scored[:num_outfits] ``` **What happens:** - Duplicate outfits (same items, different order) are removed - Top N outfits are selected - Some randomization is added for variety --- ## 🔍 **Proof: Both Models Are Used** ### **Evidence 1: ResNet Usage** ```python # Line 330 in inference.py emb = self.resnet(batch) # ✅ ResNet forward pass ``` - Called in `embed_images()` method - Generates embeddings for every clothing item - **Called during inference** when items don't have pre-computed embeddings ### **Evidence 2: ViT Usage** ```python # Line 1102 in inference.py s = self.vit.score_compatibility(embs).item() # ✅ ViT forward pass ``` - Called in `score_subset()` function - Scores **every candidate outfit** (50-100+ times per recommendation request) - **Called during inference** to rank outfit combinations ### **Evidence 3: Model Loading** ```python # Lines 49-50, 285-286 in inference.py self.resnet, self.resnet_loaded = self._load_resnet() self.vit, self.vit_loaded = self._load_vit() # Models are loaded and set to eval mode if self.resnet_loaded: self.resnet = self.resnet.to(self.device).eval() if self.vit_loaded: self.vit = self.vit.to(self.device).eval() ``` --- ## 📈 **Complete Flow Diagram** ``` User Input ↓ [Upload Images] → [CLIP Category Detection] ↓ [ResNet Embedding Generation] ← ✅ RESNET USED HERE ↓ [512-dim Embeddings for Each Item] ↓ [Tag Processing] → [Context Building] ↓ [Candidate Generation] → [50-100+ Outfit Combinations] ↓ [ViT Compatibility Scoring] ← ✅ VIT USED HERE (50-100+ times) ↓ [Penalty/Bonus Adjustment] ↓ [Ranking & Deduplication] ↓ [Top N Recommendations] ``` --- ## 🎯 **Key Points** 1. **ResNet is used:** - Generates embeddings for each clothing item - Called once per item (or uses cached embeddings) - Output: 512-dimensional feature vectors 2. **ViT is used:** - Scores compatibility of outfit combinations - Called **50-100+ times** per recommendation request (once per candidate) - Output: Compatibility score (0.0 to ~1.0+) 3. **Both models work together:** - ResNet provides item-level understanding - ViT provides outfit-level compatibility - Together they create personalized, context-aware recommendations 4. **The system is NOT just rule-based:** - Deep learning models (ResNet + ViT) provide the core intelligence - Rules and heuristics (penalties/bonuses) refine the results - Tags and context guide the generation process --- ## 🔬 **Technical Details** ### **ResNet Architecture:** - **Backbone:** ResNet50 (pretrained on ImageNet) - **Input:** 224×224 RGB images - **Output:** 512-dimensional embeddings - **Purpose:** Extract visual features from clothing items ### **ViT Architecture:** - **Encoder:** Transformer with 4-6 layers, 8 attention heads - **Input:** Sequence of item embeddings (variable length, 2-6 items) - **Output:** Single compatibility score - **Purpose:** Learn which items go well together ### **Training:** - **ResNet:** Trained with triplet loss on item pairs - **ViT:** Trained with triplet loss on outfit triplets (positive, anchor, negative) - **Both:** Use early stopping, best model checkpointing --- ## ✅ **Conclusion** **YES - Both ResNet and ViT are actively used during inference!** - **ResNet** generates item embeddings (visual understanding) - **ViT** scores outfit compatibility (relationship learning) - Together they create intelligent, personalized recommendations The system is a **true deep learning pipeline**, not just rule-based filtering!