Spaces:
Paused
Paused
File size: 10,559 Bytes
45b7274 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 |
# π― How Dressify Recommendations Actually Work
## β
**YES - Both ResNet and ViT are used during inference!**
This document explains the complete recommendation pipeline and proves that both deep learning models are actively used.
---
## π **Complete Recommendation Pipeline**
### **Step 1: Image Input & Category Detection**
**Location:** `inference.py:356-384`
```python
# User uploads wardrobe images
items = [
{"id": "item_0", "image": <PIL.Image>, "category": None},
{"id": "item_1", "image": <PIL.Image>, "category": None},
...
]
# For each item:
for item in items:
# 1. Auto-detect category using CLIP (if available)
category = self._detect_category_with_clip(item["image"])
# OR fallback to filename-based detection
# 2. Generate embedding if not provided
if embedding is None:
embedding = self.embed_images([item["image"]])[0]
```
**What happens:**
- Each clothing item image is processed
- Category is detected (shirt, pants, shoes, etc.) using CLIP or filename
- If no embedding exists, it's generated using **ResNet**
---
### **Step 2: ResNet Generates Item Embeddings** β
**Location:** `inference.py:313-337` β `embed_images()`
```python
@torch.inference_mode()
def embed_images(self, images: List[Image.Image]) -> List[np.ndarray]:
# Transform images to tensor
batch = torch.stack([self.transform(img) for img in images])
batch = batch.to(self.device, memory_format=torch.channels_last)
# β
RESNET IS CALLED HERE!
use_amp = (self.device == "cuda")
with torch.autocast(device_type=("cuda" if use_amp else "cpu"), enabled=use_amp):
emb = self.resnet(batch) # <-- RESNET FORWARD PASS
# Normalize embeddings
emb = nn.functional.normalize(emb, dim=-1)
result = [e.detach().cpu().numpy().astype(np.float32) for e in emb]
return result
```
**What ResNet does:**
- Takes raw clothing item images (224x224 RGB)
- Passes through ResNet50 backbone (pretrained on ImageNet)
- Generates **512-dimensional embeddings** for each item
- These embeddings capture visual features (color, texture, style, pattern)
**Example:**
- Input: Image of a blue shirt β ResNet β Output: `[0.123, -0.456, 0.789, ...]` (512-dim vector)
---
### **Step 3: Tag Processing & Context Building**
**Location:** `inference.py:490-545`
```python
# Process user tags (occasion, weather, style, etc.)
processed_tags = self.tag_processor.process_tags(context)
# Build outfit template based on tags
template = outfit_templates[outfit_style].copy()
# Apply weather/occasion modifications
# Generate constraints (min_items, max_items, accessory_limit)
```
**What happens:**
- User preferences (formal, cold weather, elegant style) are processed
- Outfit templates are selected and modified
- Constraints are generated (e.g., formal requires 4-5 items, needs outerwear)
---
### **Step 4: Candidate Outfit Generation**
**Location:** `inference.py:910-1092`
```python
# Generate many candidate outfit combinations
candidates = []
for _ in range(num_samples): # Typically 50-100+ candidates
subset = []
# Strategy-based generation:
# - Strategy 0: Core outfit (shirt + pants + shoes + accessories)
# - Strategy 1: Accessory-focused
# - Strategy 2: Flexible combination
# Add items based on context (formal, casual, etc.)
if occasion == "formal" and outerwear:
subset.append(jacket)
subset.append(shirt)
subset.append(pants)
subset.append(shoes)
candidates.append(subset)
```
**What happens:**
- System generates **50-100+ candidate outfit combinations**
- Each candidate is a list of item indices (e.g., `[0, 3, 7, 12]`)
- Candidates are generated using:
- Category pools (uppers, bottoms, shoes, outerwear, accessories)
- Context-aware strategies (formal vs casual)
- Randomization for variety
---
### **Step 5: ViT Scores Outfit Compatibility** ββ
**Location:** `inference.py:1094-1103` β `score_subset()`
```python
def score_subset(idx_subset: List[int]) -> float:
# Get embeddings for items in this outfit
embs = torch.tensor(
np.stack([proc_items[i]["embedding"] for i in idx_subset], axis=0),
dtype=torch.float32,
device=self.device,
) # Shape: (N, 512) where N = number of items in outfit
embs = embs.unsqueeze(0) # Shape: (1, N, 512) - batch dimension
# β
VIT IS CALLED HERE!
s = self.vit.score_compatibility(embs).item() # <-- VIT FORWARD PASS
return float(s)
```
**What ViT does:**
- Takes **multiple item embeddings** (e.g., jacket, shirt, pants, shoes)
- Passes through **Vision Transformer encoder**:
- Transformer processes the sequence of item embeddings
- Learns relationships between items (do they go together?)
- Outputs a **compatibility score** (higher = better match)
**ViT Architecture:**
```python
# From models/vit_outfit.py
class OutfitCompatibilityModel(nn.Module):
def forward(self, tokens: torch.Tensor) -> torch.Tensor:
# tokens: (B, N, D) - batch of outfits, each with N items, D-dim embeddings
h = self.encoder(tokens) # Transformer encoder
pooled = h.mean(dim=1) # Average pooling across items
score = self.compatibility_head(pooled) # Final compatibility score
return score.squeeze(-1)
```
**Example:**
- Input: `[jacket_emb, shirt_emb, pants_emb, shoes_emb]` (4 items Γ 512 dims)
- ViT Processing: Transformer analyzes relationships
- Output: `0.85` (high compatibility score)
---
### **Step 6: Scoring & Ranking**
**Location:** `inference.py:1266-1274`
```python
# Score all valid candidates
scored = []
for subset in valid_candidates:
base_score = score_subset(subset) # <-- ViT score (0.0 to 1.0+)
# Apply penalties and bonuses
adjusted_score = calculate_outfit_penalty(subset, base_score)
# - Penalties: missing categories, duplicates, wrong context
# - Bonuses: color harmony, style coherence, complete sets
scored.append((subset, adjusted_score, base_score))
# Sort by adjusted score (highest first)
scored.sort(key=lambda x: x[1], reverse=True)
```
**What happens:**
- Each candidate outfit gets:
1. **Base score from ViT** (0.0 to ~1.0+)
2. **Penalties** (e.g., -500 if formal without jacket)
3. **Bonuses** (e.g., +0.6 for color harmony, +0.4 for style coherence)
- Final score = base_score + penalties + bonuses
- Outfits are ranked by final score
---
### **Step 7: Final Selection & Deduplication**
**Location:** `inference.py:1276-1300`
```python
# Remove duplicate outfits
seen_outfits = set()
unique_scored = []
for subset, adjusted_score, base_score in scored:
normalized = normalize_outfit(subset) # Sort item IDs
if normalized not in seen_outfits:
seen_outfits.add(normalized)
unique_scored.append((subset, adjusted_score, base_score))
# Select top N with randomization
topk = unique_scored[:num_outfits]
```
**What happens:**
- Duplicate outfits (same items, different order) are removed
- Top N outfits are selected
- Some randomization is added for variety
---
## π **Proof: Both Models Are Used**
### **Evidence 1: ResNet Usage**
```python
# Line 330 in inference.py
emb = self.resnet(batch) # β
ResNet forward pass
```
- Called in `embed_images()` method
- Generates embeddings for every clothing item
- **Called during inference** when items don't have pre-computed embeddings
### **Evidence 2: ViT Usage**
```python
# Line 1102 in inference.py
s = self.vit.score_compatibility(embs).item() # β
ViT forward pass
```
- Called in `score_subset()` function
- Scores **every candidate outfit** (50-100+ times per recommendation request)
- **Called during inference** to rank outfit combinations
### **Evidence 3: Model Loading**
```python
# Lines 49-50, 285-286 in inference.py
self.resnet, self.resnet_loaded = self._load_resnet()
self.vit, self.vit_loaded = self._load_vit()
# Models are loaded and set to eval mode
if self.resnet_loaded:
self.resnet = self.resnet.to(self.device).eval()
if self.vit_loaded:
self.vit = self.vit.to(self.device).eval()
```
---
## π **Complete Flow Diagram**
```
User Input
β
[Upload Images] β [CLIP Category Detection]
β
[ResNet Embedding Generation] β β
RESNET USED HERE
β
[512-dim Embeddings for Each Item]
β
[Tag Processing] β [Context Building]
β
[Candidate Generation] β [50-100+ Outfit Combinations]
β
[ViT Compatibility Scoring] β β
VIT USED HERE (50-100+ times)
β
[Penalty/Bonus Adjustment]
β
[Ranking & Deduplication]
β
[Top N Recommendations]
```
---
## π― **Key Points**
1. **ResNet is used:**
- Generates embeddings for each clothing item
- Called once per item (or uses cached embeddings)
- Output: 512-dimensional feature vectors
2. **ViT is used:**
- Scores compatibility of outfit combinations
- Called **50-100+ times** per recommendation request (once per candidate)
- Output: Compatibility score (0.0 to ~1.0+)
3. **Both models work together:**
- ResNet provides item-level understanding
- ViT provides outfit-level compatibility
- Together they create personalized, context-aware recommendations
4. **The system is NOT just rule-based:**
- Deep learning models (ResNet + ViT) provide the core intelligence
- Rules and heuristics (penalties/bonuses) refine the results
- Tags and context guide the generation process
---
## π¬ **Technical Details**
### **ResNet Architecture:**
- **Backbone:** ResNet50 (pretrained on ImageNet)
- **Input:** 224Γ224 RGB images
- **Output:** 512-dimensional embeddings
- **Purpose:** Extract visual features from clothing items
### **ViT Architecture:**
- **Encoder:** Transformer with 4-6 layers, 8 attention heads
- **Input:** Sequence of item embeddings (variable length, 2-6 items)
- **Output:** Single compatibility score
- **Purpose:** Learn which items go well together
### **Training:**
- **ResNet:** Trained with triplet loss on item pairs
- **ViT:** Trained with triplet loss on outfit triplets (positive, anchor, negative)
- **Both:** Use early stopping, best model checkpointing
---
## β
**Conclusion**
**YES - Both ResNet and ViT are actively used during inference!**
- **ResNet** generates item embeddings (visual understanding)
- **ViT** scores outfit compatibility (relationship learning)
- Together they create intelligent, personalized recommendations
The system is a **true deep learning pipeline**, not just rule-based filtering!
|