Spaces:
Paused
Paused
| # π― How Dressify Recommendations Actually Work | |
| ## β **YES - Both ResNet and ViT are used during inference!** | |
| This document explains the complete recommendation pipeline and proves that both deep learning models are actively used. | |
| --- | |
| ## π **Complete Recommendation Pipeline** | |
| ### **Step 1: Image Input & Category Detection** | |
| **Location:** `inference.py:356-384` | |
| ```python | |
| # User uploads wardrobe images | |
| items = [ | |
| {"id": "item_0", "image": <PIL.Image>, "category": None}, | |
| {"id": "item_1", "image": <PIL.Image>, "category": None}, | |
| ... | |
| ] | |
| # For each item: | |
| for item in items: | |
| # 1. Auto-detect category using CLIP (if available) | |
| category = self._detect_category_with_clip(item["image"]) | |
| # OR fallback to filename-based detection | |
| # 2. Generate embedding if not provided | |
| if embedding is None: | |
| embedding = self.embed_images([item["image"]])[0] | |
| ``` | |
| **What happens:** | |
| - Each clothing item image is processed | |
| - Category is detected (shirt, pants, shoes, etc.) using CLIP or filename | |
| - If no embedding exists, it's generated using **ResNet** | |
| --- | |
| ### **Step 2: ResNet Generates Item Embeddings** β | |
| **Location:** `inference.py:313-337` β `embed_images()` | |
| ```python | |
| @torch.inference_mode() | |
| def embed_images(self, images: List[Image.Image]) -> List[np.ndarray]: | |
| # Transform images to tensor | |
| batch = torch.stack([self.transform(img) for img in images]) | |
| batch = batch.to(self.device, memory_format=torch.channels_last) | |
| # β RESNET IS CALLED HERE! | |
| use_amp = (self.device == "cuda") | |
| with torch.autocast(device_type=("cuda" if use_amp else "cpu"), enabled=use_amp): | |
| emb = self.resnet(batch) # <-- RESNET FORWARD PASS | |
| # Normalize embeddings | |
| emb = nn.functional.normalize(emb, dim=-1) | |
| result = [e.detach().cpu().numpy().astype(np.float32) for e in emb] | |
| return result | |
| ``` | |
| **What ResNet does:** | |
| - Takes raw clothing item images (224x224 RGB) | |
| - Passes through ResNet50 backbone (pretrained on ImageNet) | |
| - Generates **512-dimensional embeddings** for each item | |
| - These embeddings capture visual features (color, texture, style, pattern) | |
| **Example:** | |
| - Input: Image of a blue shirt β ResNet β Output: `[0.123, -0.456, 0.789, ...]` (512-dim vector) | |
| --- | |
| ### **Step 3: Tag Processing & Context Building** | |
| **Location:** `inference.py:490-545` | |
| ```python | |
| # Process user tags (occasion, weather, style, etc.) | |
| processed_tags = self.tag_processor.process_tags(context) | |
| # Build outfit template based on tags | |
| template = outfit_templates[outfit_style].copy() | |
| # Apply weather/occasion modifications | |
| # Generate constraints (min_items, max_items, accessory_limit) | |
| ``` | |
| **What happens:** | |
| - User preferences (formal, cold weather, elegant style) are processed | |
| - Outfit templates are selected and modified | |
| - Constraints are generated (e.g., formal requires 4-5 items, needs outerwear) | |
| --- | |
| ### **Step 4: Candidate Outfit Generation** | |
| **Location:** `inference.py:910-1092` | |
| ```python | |
| # Generate many candidate outfit combinations | |
| candidates = [] | |
| for _ in range(num_samples): # Typically 50-100+ candidates | |
| subset = [] | |
| # Strategy-based generation: | |
| # - Strategy 0: Core outfit (shirt + pants + shoes + accessories) | |
| # - Strategy 1: Accessory-focused | |
| # - Strategy 2: Flexible combination | |
| # Add items based on context (formal, casual, etc.) | |
| if occasion == "formal" and outerwear: | |
| subset.append(jacket) | |
| subset.append(shirt) | |
| subset.append(pants) | |
| subset.append(shoes) | |
| candidates.append(subset) | |
| ``` | |
| **What happens:** | |
| - System generates **50-100+ candidate outfit combinations** | |
| - Each candidate is a list of item indices (e.g., `[0, 3, 7, 12]`) | |
| - Candidates are generated using: | |
| - Category pools (uppers, bottoms, shoes, outerwear, accessories) | |
| - Context-aware strategies (formal vs casual) | |
| - Randomization for variety | |
| --- | |
| ### **Step 5: ViT Scores Outfit Compatibility** ββ | |
| **Location:** `inference.py:1094-1103` β `score_subset()` | |
| ```python | |
| def score_subset(idx_subset: List[int]) -> float: | |
| # Get embeddings for items in this outfit | |
| embs = torch.tensor( | |
| np.stack([proc_items[i]["embedding"] for i in idx_subset], axis=0), | |
| dtype=torch.float32, | |
| device=self.device, | |
| ) # Shape: (N, 512) where N = number of items in outfit | |
| embs = embs.unsqueeze(0) # Shape: (1, N, 512) - batch dimension | |
| # β VIT IS CALLED HERE! | |
| s = self.vit.score_compatibility(embs).item() # <-- VIT FORWARD PASS | |
| return float(s) | |
| ``` | |
| **What ViT does:** | |
| - Takes **multiple item embeddings** (e.g., jacket, shirt, pants, shoes) | |
| - Passes through **Vision Transformer encoder**: | |
| - Transformer processes the sequence of item embeddings | |
| - Learns relationships between items (do they go together?) | |
| - Outputs a **compatibility score** (higher = better match) | |
| **ViT Architecture:** | |
| ```python | |
| # From models/vit_outfit.py | |
| class OutfitCompatibilityModel(nn.Module): | |
| def forward(self, tokens: torch.Tensor) -> torch.Tensor: | |
| # tokens: (B, N, D) - batch of outfits, each with N items, D-dim embeddings | |
| h = self.encoder(tokens) # Transformer encoder | |
| pooled = h.mean(dim=1) # Average pooling across items | |
| score = self.compatibility_head(pooled) # Final compatibility score | |
| return score.squeeze(-1) | |
| ``` | |
| **Example:** | |
| - Input: `[jacket_emb, shirt_emb, pants_emb, shoes_emb]` (4 items Γ 512 dims) | |
| - ViT Processing: Transformer analyzes relationships | |
| - Output: `0.85` (high compatibility score) | |
| --- | |
| ### **Step 6: Scoring & Ranking** | |
| **Location:** `inference.py:1266-1274` | |
| ```python | |
| # Score all valid candidates | |
| scored = [] | |
| for subset in valid_candidates: | |
| base_score = score_subset(subset) # <-- ViT score (0.0 to 1.0+) | |
| # Apply penalties and bonuses | |
| adjusted_score = calculate_outfit_penalty(subset, base_score) | |
| # - Penalties: missing categories, duplicates, wrong context | |
| # - Bonuses: color harmony, style coherence, complete sets | |
| scored.append((subset, adjusted_score, base_score)) | |
| # Sort by adjusted score (highest first) | |
| scored.sort(key=lambda x: x[1], reverse=True) | |
| ``` | |
| **What happens:** | |
| - Each candidate outfit gets: | |
| 1. **Base score from ViT** (0.0 to ~1.0+) | |
| 2. **Penalties** (e.g., -500 if formal without jacket) | |
| 3. **Bonuses** (e.g., +0.6 for color harmony, +0.4 for style coherence) | |
| - Final score = base_score + penalties + bonuses | |
| - Outfits are ranked by final score | |
| --- | |
| ### **Step 7: Final Selection & Deduplication** | |
| **Location:** `inference.py:1276-1300` | |
| ```python | |
| # Remove duplicate outfits | |
| seen_outfits = set() | |
| unique_scored = [] | |
| for subset, adjusted_score, base_score in scored: | |
| normalized = normalize_outfit(subset) # Sort item IDs | |
| if normalized not in seen_outfits: | |
| seen_outfits.add(normalized) | |
| unique_scored.append((subset, adjusted_score, base_score)) | |
| # Select top N with randomization | |
| topk = unique_scored[:num_outfits] | |
| ``` | |
| **What happens:** | |
| - Duplicate outfits (same items, different order) are removed | |
| - Top N outfits are selected | |
| - Some randomization is added for variety | |
| --- | |
| ## π **Proof: Both Models Are Used** | |
| ### **Evidence 1: ResNet Usage** | |
| ```python | |
| # Line 330 in inference.py | |
| emb = self.resnet(batch) # β ResNet forward pass | |
| ``` | |
| - Called in `embed_images()` method | |
| - Generates embeddings for every clothing item | |
| - **Called during inference** when items don't have pre-computed embeddings | |
| ### **Evidence 2: ViT Usage** | |
| ```python | |
| # Line 1102 in inference.py | |
| s = self.vit.score_compatibility(embs).item() # β ViT forward pass | |
| ``` | |
| - Called in `score_subset()` function | |
| - Scores **every candidate outfit** (50-100+ times per recommendation request) | |
| - **Called during inference** to rank outfit combinations | |
| ### **Evidence 3: Model Loading** | |
| ```python | |
| # Lines 49-50, 285-286 in inference.py | |
| self.resnet, self.resnet_loaded = self._load_resnet() | |
| self.vit, self.vit_loaded = self._load_vit() | |
| # Models are loaded and set to eval mode | |
| if self.resnet_loaded: | |
| self.resnet = self.resnet.to(self.device).eval() | |
| if self.vit_loaded: | |
| self.vit = self.vit.to(self.device).eval() | |
| ``` | |
| --- | |
| ## π **Complete Flow Diagram** | |
| ``` | |
| User Input | |
| β | |
| [Upload Images] β [CLIP Category Detection] | |
| β | |
| [ResNet Embedding Generation] β β RESNET USED HERE | |
| β | |
| [512-dim Embeddings for Each Item] | |
| β | |
| [Tag Processing] β [Context Building] | |
| β | |
| [Candidate Generation] β [50-100+ Outfit Combinations] | |
| β | |
| [ViT Compatibility Scoring] β β VIT USED HERE (50-100+ times) | |
| β | |
| [Penalty/Bonus Adjustment] | |
| β | |
| [Ranking & Deduplication] | |
| β | |
| [Top N Recommendations] | |
| ``` | |
| --- | |
| ## π― **Key Points** | |
| 1. **ResNet is used:** | |
| - Generates embeddings for each clothing item | |
| - Called once per item (or uses cached embeddings) | |
| - Output: 512-dimensional feature vectors | |
| 2. **ViT is used:** | |
| - Scores compatibility of outfit combinations | |
| - Called **50-100+ times** per recommendation request (once per candidate) | |
| - Output: Compatibility score (0.0 to ~1.0+) | |
| 3. **Both models work together:** | |
| - ResNet provides item-level understanding | |
| - ViT provides outfit-level compatibility | |
| - Together they create personalized, context-aware recommendations | |
| 4. **The system is NOT just rule-based:** | |
| - Deep learning models (ResNet + ViT) provide the core intelligence | |
| - Rules and heuristics (penalties/bonuses) refine the results | |
| - Tags and context guide the generation process | |
| --- | |
| ## π¬ **Technical Details** | |
| ### **ResNet Architecture:** | |
| - **Backbone:** ResNet50 (pretrained on ImageNet) | |
| - **Input:** 224Γ224 RGB images | |
| - **Output:** 512-dimensional embeddings | |
| - **Purpose:** Extract visual features from clothing items | |
| ### **ViT Architecture:** | |
| - **Encoder:** Transformer with 4-6 layers, 8 attention heads | |
| - **Input:** Sequence of item embeddings (variable length, 2-6 items) | |
| - **Output:** Single compatibility score | |
| - **Purpose:** Learn which items go well together | |
| ### **Training:** | |
| - **ResNet:** Trained with triplet loss on item pairs | |
| - **ViT:** Trained with triplet loss on outfit triplets (positive, anchor, negative) | |
| - **Both:** Use early stopping, best model checkpointing | |
| --- | |
| ## β **Conclusion** | |
| **YES - Both ResNet and ViT are actively used during inference!** | |
| - **ResNet** generates item embeddings (visual understanding) | |
| - **ViT** scores outfit compatibility (relationship learning) | |
| - Together they create intelligent, personalized recommendations | |
| The system is a **true deep learning pipeline**, not just rule-based filtering! | |