File size: 10,559 Bytes
45b7274
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
# 🎯 How Dressify Recommendations Actually Work

## βœ… **YES - Both ResNet and ViT are used during inference!**

This document explains the complete recommendation pipeline and proves that both deep learning models are actively used.

---

## πŸ“Š **Complete Recommendation Pipeline**

### **Step 1: Image Input & Category Detection** 
**Location:** `inference.py:356-384`

```python
# User uploads wardrobe images
items = [
    {"id": "item_0", "image": <PIL.Image>, "category": None},
    {"id": "item_1", "image": <PIL.Image>, "category": None},
    ...
]

# For each item:
for item in items:
    # 1. Auto-detect category using CLIP (if available)
    category = self._detect_category_with_clip(item["image"])
    # OR fallback to filename-based detection
    
    # 2. Generate embedding if not provided
    if embedding is None:
        embedding = self.embed_images([item["image"]])[0]
```

**What happens:**
- Each clothing item image is processed
- Category is detected (shirt, pants, shoes, etc.) using CLIP or filename
- If no embedding exists, it's generated using **ResNet**

---

### **Step 2: ResNet Generates Item Embeddings** ⭐
**Location:** `inference.py:313-337` β†’ `embed_images()`

```python
@torch.inference_mode()
def embed_images(self, images: List[Image.Image]) -> List[np.ndarray]:
    # Transform images to tensor
    batch = torch.stack([self.transform(img) for img in images])
    batch = batch.to(self.device, memory_format=torch.channels_last)
    
    # βœ… RESNET IS CALLED HERE!
    use_amp = (self.device == "cuda")
    with torch.autocast(device_type=("cuda" if use_amp else "cpu"), enabled=use_amp):
        emb = self.resnet(batch)  # <-- RESNET FORWARD PASS
    
    # Normalize embeddings
    emb = nn.functional.normalize(emb, dim=-1)
    result = [e.detach().cpu().numpy().astype(np.float32) for e in emb]
    return result
```

**What ResNet does:**
- Takes raw clothing item images (224x224 RGB)
- Passes through ResNet50 backbone (pretrained on ImageNet)
- Generates **512-dimensional embeddings** for each item
- These embeddings capture visual features (color, texture, style, pattern)

**Example:**
- Input: Image of a blue shirt β†’ ResNet β†’ Output: `[0.123, -0.456, 0.789, ...]` (512-dim vector)

---

### **Step 3: Tag Processing & Context Building**
**Location:** `inference.py:490-545`

```python
# Process user tags (occasion, weather, style, etc.)
processed_tags = self.tag_processor.process_tags(context)

# Build outfit template based on tags
template = outfit_templates[outfit_style].copy()
# Apply weather/occasion modifications
# Generate constraints (min_items, max_items, accessory_limit)
```

**What happens:**
- User preferences (formal, cold weather, elegant style) are processed
- Outfit templates are selected and modified
- Constraints are generated (e.g., formal requires 4-5 items, needs outerwear)

---

### **Step 4: Candidate Outfit Generation**
**Location:** `inference.py:910-1092`

```python
# Generate many candidate outfit combinations
candidates = []
for _ in range(num_samples):  # Typically 50-100+ candidates
    subset = []
    
    # Strategy-based generation:
    # - Strategy 0: Core outfit (shirt + pants + shoes + accessories)
    # - Strategy 1: Accessory-focused
    # - Strategy 2: Flexible combination
    
    # Add items based on context (formal, casual, etc.)
    if occasion == "formal" and outerwear:
        subset.append(jacket)
        subset.append(shirt)
        subset.append(pants)
        subset.append(shoes)
    
    candidates.append(subset)
```

**What happens:**
- System generates **50-100+ candidate outfit combinations**
- Each candidate is a list of item indices (e.g., `[0, 3, 7, 12]`)
- Candidates are generated using:
  - Category pools (uppers, bottoms, shoes, outerwear, accessories)
  - Context-aware strategies (formal vs casual)
  - Randomization for variety

---

### **Step 5: ViT Scores Outfit Compatibility** ⭐⭐
**Location:** `inference.py:1094-1103` β†’ `score_subset()`

```python
def score_subset(idx_subset: List[int]) -> float:
    # Get embeddings for items in this outfit
    embs = torch.tensor(
        np.stack([proc_items[i]["embedding"] for i in idx_subset], axis=0),
        dtype=torch.float32,
        device=self.device,
    )  # Shape: (N, 512) where N = number of items in outfit
    
    embs = embs.unsqueeze(0)  # Shape: (1, N, 512) - batch dimension
    
    # βœ… VIT IS CALLED HERE!
    s = self.vit.score_compatibility(embs).item()  # <-- VIT FORWARD PASS
    return float(s)
```

**What ViT does:**
- Takes **multiple item embeddings** (e.g., jacket, shirt, pants, shoes)
- Passes through **Vision Transformer encoder**:
  - Transformer processes the sequence of item embeddings
  - Learns relationships between items (do they go together?)
  - Outputs a **compatibility score** (higher = better match)

**ViT Architecture:**
```python
# From models/vit_outfit.py
class OutfitCompatibilityModel(nn.Module):
    def forward(self, tokens: torch.Tensor) -> torch.Tensor:
        # tokens: (B, N, D) - batch of outfits, each with N items, D-dim embeddings
        h = self.encoder(tokens)      # Transformer encoder
        pooled = h.mean(dim=1)        # Average pooling across items
        score = self.compatibility_head(pooled)  # Final compatibility score
        return score.squeeze(-1)
```

**Example:**
- Input: `[jacket_emb, shirt_emb, pants_emb, shoes_emb]` (4 items Γ— 512 dims)
- ViT Processing: Transformer analyzes relationships
- Output: `0.85` (high compatibility score)

---

### **Step 6: Scoring & Ranking**
**Location:** `inference.py:1266-1274`

```python
# Score all valid candidates
scored = []
for subset in valid_candidates:
    base_score = score_subset(subset)  # <-- ViT score (0.0 to 1.0+)
    
    # Apply penalties and bonuses
    adjusted_score = calculate_outfit_penalty(subset, base_score)
    # - Penalties: missing categories, duplicates, wrong context
    # - Bonuses: color harmony, style coherence, complete sets
    
    scored.append((subset, adjusted_score, base_score))

# Sort by adjusted score (highest first)
scored.sort(key=lambda x: x[1], reverse=True)
```

**What happens:**
- Each candidate outfit gets:
  1. **Base score from ViT** (0.0 to ~1.0+)
  2. **Penalties** (e.g., -500 if formal without jacket)
  3. **Bonuses** (e.g., +0.6 for color harmony, +0.4 for style coherence)
- Final score = base_score + penalties + bonuses
- Outfits are ranked by final score

---

### **Step 7: Final Selection & Deduplication**
**Location:** `inference.py:1276-1300`

```python
# Remove duplicate outfits
seen_outfits = set()
unique_scored = []
for subset, adjusted_score, base_score in scored:
    normalized = normalize_outfit(subset)  # Sort item IDs
    if normalized not in seen_outfits:
        seen_outfits.add(normalized)
        unique_scored.append((subset, adjusted_score, base_score))

# Select top N with randomization
topk = unique_scored[:num_outfits]
```

**What happens:**
- Duplicate outfits (same items, different order) are removed
- Top N outfits are selected
- Some randomization is added for variety

---

## πŸ” **Proof: Both Models Are Used**

### **Evidence 1: ResNet Usage**
```python
# Line 330 in inference.py
emb = self.resnet(batch)  # βœ… ResNet forward pass
```
- Called in `embed_images()` method
- Generates embeddings for every clothing item
- **Called during inference** when items don't have pre-computed embeddings

### **Evidence 2: ViT Usage**
```python
# Line 1102 in inference.py
s = self.vit.score_compatibility(embs).item()  # βœ… ViT forward pass
```
- Called in `score_subset()` function
- Scores **every candidate outfit** (50-100+ times per recommendation request)
- **Called during inference** to rank outfit combinations

### **Evidence 3: Model Loading**
```python
# Lines 49-50, 285-286 in inference.py
self.resnet, self.resnet_loaded = self._load_resnet()
self.vit, self.vit_loaded = self._load_vit()

# Models are loaded and set to eval mode
if self.resnet_loaded:
    self.resnet = self.resnet.to(self.device).eval()
if self.vit_loaded:
    self.vit = self.vit.to(self.device).eval()
```

---

## πŸ“ˆ **Complete Flow Diagram**

```
User Input
    ↓
[Upload Images] β†’ [CLIP Category Detection]
    ↓
[ResNet Embedding Generation] ← βœ… RESNET USED HERE
    ↓
[512-dim Embeddings for Each Item]
    ↓
[Tag Processing] β†’ [Context Building]
    ↓
[Candidate Generation] β†’ [50-100+ Outfit Combinations]
    ↓
[ViT Compatibility Scoring] ← βœ… VIT USED HERE (50-100+ times)
    ↓
[Penalty/Bonus Adjustment]
    ↓
[Ranking & Deduplication]
    ↓
[Top N Recommendations]
```

---

## 🎯 **Key Points**

1. **ResNet is used:**
   - Generates embeddings for each clothing item
   - Called once per item (or uses cached embeddings)
   - Output: 512-dimensional feature vectors

2. **ViT is used:**
   - Scores compatibility of outfit combinations
   - Called **50-100+ times** per recommendation request (once per candidate)
   - Output: Compatibility score (0.0 to ~1.0+)

3. **Both models work together:**
   - ResNet provides item-level understanding
   - ViT provides outfit-level compatibility
   - Together they create personalized, context-aware recommendations

4. **The system is NOT just rule-based:**
   - Deep learning models (ResNet + ViT) provide the core intelligence
   - Rules and heuristics (penalties/bonuses) refine the results
   - Tags and context guide the generation process

---

## πŸ”¬ **Technical Details**

### **ResNet Architecture:**
- **Backbone:** ResNet50 (pretrained on ImageNet)
- **Input:** 224Γ—224 RGB images
- **Output:** 512-dimensional embeddings
- **Purpose:** Extract visual features from clothing items

### **ViT Architecture:**
- **Encoder:** Transformer with 4-6 layers, 8 attention heads
- **Input:** Sequence of item embeddings (variable length, 2-6 items)
- **Output:** Single compatibility score
- **Purpose:** Learn which items go well together

### **Training:**
- **ResNet:** Trained with triplet loss on item pairs
- **ViT:** Trained with triplet loss on outfit triplets (positive, anchor, negative)
- **Both:** Use early stopping, best model checkpointing

---

## βœ… **Conclusion**

**YES - Both ResNet and ViT are actively used during inference!**

- **ResNet** generates item embeddings (visual understanding)
- **ViT** scores outfit compatibility (relationship learning)
- Together they create intelligent, personalized recommendations

The system is a **true deep learning pipeline**, not just rule-based filtering!