recomendation / docs /PIPELINE.md
Ali Mohsin
folder reorganise
72af8c3
# 🎯 How Dressify Recommendations Actually Work
## βœ… **YES - Both ResNet and ViT are used during inference!**
This document explains the complete recommendation pipeline and proves that both deep learning models are actively used.
---
## πŸ“Š **Complete Recommendation Pipeline**
### **Step 1: Image Input & Category Detection**
**Location:** `inference.py:356-384`
```python
# User uploads wardrobe images
items = [
{"id": "item_0", "image": <PIL.Image>, "category": None},
{"id": "item_1", "image": <PIL.Image>, "category": None},
...
]
# For each item:
for item in items:
# 1. Auto-detect category using CLIP (if available)
category = self._detect_category_with_clip(item["image"])
# OR fallback to filename-based detection
# 2. Generate embedding if not provided
if embedding is None:
embedding = self.embed_images([item["image"]])[0]
```
**What happens:**
- Each clothing item image is processed
- Category is detected (shirt, pants, shoes, etc.) using CLIP or filename
- If no embedding exists, it's generated using **ResNet**
---
### **Step 2: ResNet Generates Item Embeddings** ⭐
**Location:** `inference.py:313-337` β†’ `embed_images()`
```python
@torch.inference_mode()
def embed_images(self, images: List[Image.Image]) -> List[np.ndarray]:
# Transform images to tensor
batch = torch.stack([self.transform(img) for img in images])
batch = batch.to(self.device, memory_format=torch.channels_last)
# βœ… RESNET IS CALLED HERE!
use_amp = (self.device == "cuda")
with torch.autocast(device_type=("cuda" if use_amp else "cpu"), enabled=use_amp):
emb = self.resnet(batch) # <-- RESNET FORWARD PASS
# Normalize embeddings
emb = nn.functional.normalize(emb, dim=-1)
result = [e.detach().cpu().numpy().astype(np.float32) for e in emb]
return result
```
**What ResNet does:**
- Takes raw clothing item images (224x224 RGB)
- Passes through ResNet50 backbone (pretrained on ImageNet)
- Generates **512-dimensional embeddings** for each item
- These embeddings capture visual features (color, texture, style, pattern)
**Example:**
- Input: Image of a blue shirt β†’ ResNet β†’ Output: `[0.123, -0.456, 0.789, ...]` (512-dim vector)
---
### **Step 3: Tag Processing & Context Building**
**Location:** `inference.py:490-545`
```python
# Process user tags (occasion, weather, style, etc.)
processed_tags = self.tag_processor.process_tags(context)
# Build outfit template based on tags
template = outfit_templates[outfit_style].copy()
# Apply weather/occasion modifications
# Generate constraints (min_items, max_items, accessory_limit)
```
**What happens:**
- User preferences (formal, cold weather, elegant style) are processed
- Outfit templates are selected and modified
- Constraints are generated (e.g., formal requires 4-5 items, needs outerwear)
---
### **Step 4: Candidate Outfit Generation**
**Location:** `inference.py:910-1092`
```python
# Generate many candidate outfit combinations
candidates = []
for _ in range(num_samples): # Typically 50-100+ candidates
subset = []
# Strategy-based generation:
# - Strategy 0: Core outfit (shirt + pants + shoes + accessories)
# - Strategy 1: Accessory-focused
# - Strategy 2: Flexible combination
# Add items based on context (formal, casual, etc.)
if occasion == "formal" and outerwear:
subset.append(jacket)
subset.append(shirt)
subset.append(pants)
subset.append(shoes)
candidates.append(subset)
```
**What happens:**
- System generates **50-100+ candidate outfit combinations**
- Each candidate is a list of item indices (e.g., `[0, 3, 7, 12]`)
- Candidates are generated using:
- Category pools (uppers, bottoms, shoes, outerwear, accessories)
- Context-aware strategies (formal vs casual)
- Randomization for variety
---
### **Step 5: ViT Scores Outfit Compatibility** ⭐⭐
**Location:** `inference.py:1094-1103` β†’ `score_subset()`
```python
def score_subset(idx_subset: List[int]) -> float:
# Get embeddings for items in this outfit
embs = torch.tensor(
np.stack([proc_items[i]["embedding"] for i in idx_subset], axis=0),
dtype=torch.float32,
device=self.device,
) # Shape: (N, 512) where N = number of items in outfit
embs = embs.unsqueeze(0) # Shape: (1, N, 512) - batch dimension
# βœ… VIT IS CALLED HERE!
s = self.vit.score_compatibility(embs).item() # <-- VIT FORWARD PASS
return float(s)
```
**What ViT does:**
- Takes **multiple item embeddings** (e.g., jacket, shirt, pants, shoes)
- Passes through **Vision Transformer encoder**:
- Transformer processes the sequence of item embeddings
- Learns relationships between items (do they go together?)
- Outputs a **compatibility score** (higher = better match)
**ViT Architecture:**
```python
# From models/vit_outfit.py
class OutfitCompatibilityModel(nn.Module):
def forward(self, tokens: torch.Tensor) -> torch.Tensor:
# tokens: (B, N, D) - batch of outfits, each with N items, D-dim embeddings
h = self.encoder(tokens) # Transformer encoder
pooled = h.mean(dim=1) # Average pooling across items
score = self.compatibility_head(pooled) # Final compatibility score
return score.squeeze(-1)
```
**Example:**
- Input: `[jacket_emb, shirt_emb, pants_emb, shoes_emb]` (4 items Γ— 512 dims)
- ViT Processing: Transformer analyzes relationships
- Output: `0.85` (high compatibility score)
---
### **Step 6: Scoring & Ranking**
**Location:** `inference.py:1266-1274`
```python
# Score all valid candidates
scored = []
for subset in valid_candidates:
base_score = score_subset(subset) # <-- ViT score (0.0 to 1.0+)
# Apply penalties and bonuses
adjusted_score = calculate_outfit_penalty(subset, base_score)
# - Penalties: missing categories, duplicates, wrong context
# - Bonuses: color harmony, style coherence, complete sets
scored.append((subset, adjusted_score, base_score))
# Sort by adjusted score (highest first)
scored.sort(key=lambda x: x[1], reverse=True)
```
**What happens:**
- Each candidate outfit gets:
1. **Base score from ViT** (0.0 to ~1.0+)
2. **Penalties** (e.g., -500 if formal without jacket)
3. **Bonuses** (e.g., +0.6 for color harmony, +0.4 for style coherence)
- Final score = base_score + penalties + bonuses
- Outfits are ranked by final score
---
### **Step 7: Final Selection & Deduplication**
**Location:** `inference.py:1276-1300`
```python
# Remove duplicate outfits
seen_outfits = set()
unique_scored = []
for subset, adjusted_score, base_score in scored:
normalized = normalize_outfit(subset) # Sort item IDs
if normalized not in seen_outfits:
seen_outfits.add(normalized)
unique_scored.append((subset, adjusted_score, base_score))
# Select top N with randomization
topk = unique_scored[:num_outfits]
```
**What happens:**
- Duplicate outfits (same items, different order) are removed
- Top N outfits are selected
- Some randomization is added for variety
---
## πŸ” **Proof: Both Models Are Used**
### **Evidence 1: ResNet Usage**
```python
# Line 330 in inference.py
emb = self.resnet(batch) # βœ… ResNet forward pass
```
- Called in `embed_images()` method
- Generates embeddings for every clothing item
- **Called during inference** when items don't have pre-computed embeddings
### **Evidence 2: ViT Usage**
```python
# Line 1102 in inference.py
s = self.vit.score_compatibility(embs).item() # βœ… ViT forward pass
```
- Called in `score_subset()` function
- Scores **every candidate outfit** (50-100+ times per recommendation request)
- **Called during inference** to rank outfit combinations
### **Evidence 3: Model Loading**
```python
# Lines 49-50, 285-286 in inference.py
self.resnet, self.resnet_loaded = self._load_resnet()
self.vit, self.vit_loaded = self._load_vit()
# Models are loaded and set to eval mode
if self.resnet_loaded:
self.resnet = self.resnet.to(self.device).eval()
if self.vit_loaded:
self.vit = self.vit.to(self.device).eval()
```
---
## πŸ“ˆ **Complete Flow Diagram**
```
User Input
↓
[Upload Images] β†’ [CLIP Category Detection]
↓
[ResNet Embedding Generation] ← βœ… RESNET USED HERE
↓
[512-dim Embeddings for Each Item]
↓
[Tag Processing] β†’ [Context Building]
↓
[Candidate Generation] β†’ [50-100+ Outfit Combinations]
↓
[ViT Compatibility Scoring] ← βœ… VIT USED HERE (50-100+ times)
↓
[Penalty/Bonus Adjustment]
↓
[Ranking & Deduplication]
↓
[Top N Recommendations]
```
---
## 🎯 **Key Points**
1. **ResNet is used:**
- Generates embeddings for each clothing item
- Called once per item (or uses cached embeddings)
- Output: 512-dimensional feature vectors
2. **ViT is used:**
- Scores compatibility of outfit combinations
- Called **50-100+ times** per recommendation request (once per candidate)
- Output: Compatibility score (0.0 to ~1.0+)
3. **Both models work together:**
- ResNet provides item-level understanding
- ViT provides outfit-level compatibility
- Together they create personalized, context-aware recommendations
4. **The system is NOT just rule-based:**
- Deep learning models (ResNet + ViT) provide the core intelligence
- Rules and heuristics (penalties/bonuses) refine the results
- Tags and context guide the generation process
---
## πŸ”¬ **Technical Details**
### **ResNet Architecture:**
- **Backbone:** ResNet50 (pretrained on ImageNet)
- **Input:** 224Γ—224 RGB images
- **Output:** 512-dimensional embeddings
- **Purpose:** Extract visual features from clothing items
### **ViT Architecture:**
- **Encoder:** Transformer with 4-6 layers, 8 attention heads
- **Input:** Sequence of item embeddings (variable length, 2-6 items)
- **Output:** Single compatibility score
- **Purpose:** Learn which items go well together
### **Training:**
- **ResNet:** Trained with triplet loss on item pairs
- **ViT:** Trained with triplet loss on outfit triplets (positive, anchor, negative)
- **Both:** Use early stopping, best model checkpointing
---
## βœ… **Conclusion**
**YES - Both ResNet and ViT are actively used during inference!**
- **ResNet** generates item embeddings (visual understanding)
- **ViT** scores outfit compatibility (relationship learning)
- Together they create intelligent, personalized recommendations
The system is a **true deep learning pipeline**, not just rule-based filtering!