Spaces:

Stylique
/

recomendation

Paused

App Files Files Community

recomendation / docs /PIPELINE.md

Ali Mohsin

folder reorganise

72af8c3 about 2 months ago

preview code

raw

history blame contribute delete

10.6 kB

	# 🎯 How Dressify Recommendations Actually Work

	## ✅ YES - Both ResNet and ViT are used during inference!

	This document explains the complete recommendation pipeline and proves that both deep learning models are actively used.

	---

	## 📊 Complete Recommendation Pipeline

	### Step 1: Image Input & Category Detection
	Location: `inference.py:356-384`

	```python
	# User uploads wardrobe images
	items = [
	{"id": "item_0", "image": <PIL.Image>, "category": None},
	{"id": "item_1", "image": <PIL.Image>, "category": None},
	...
	]

	# For each item:
	for item in items:
	# 1. Auto-detect category using CLIP (if available)
	category = self._detect_category_with_clip(item["image"])
	# OR fallback to filename-based detection

	# 2. Generate embedding if not provided
	if embedding is None:
	embedding = self.embed_images([item["image"]])[0]
	```

	What happens:
	- Each clothing item image is processed
	- Category is detected (shirt, pants, shoes, etc.) using CLIP or filename
	- If no embedding exists, it's generated using ResNet

	---

	### Step 2: ResNet Generates Item Embeddings ⭐
	Location: `inference.py:313-337` → `embed_images()`

	```python
	@torch.inference_mode()
	def embed_images(self, images: List[Image.Image]) -> List[np.ndarray]:
	# Transform images to tensor
	batch = torch.stack([self.transform(img) for img in images])
	batch = batch.to(self.device, memory_format=torch.channels_last)

	# ✅ RESNET IS CALLED HERE!
	use_amp = (self.device == "cuda")
	with torch.autocast(device_type=("cuda" if use_amp else "cpu"), enabled=use_amp):
	emb = self.resnet(batch) # <-- RESNET FORWARD PASS

	# Normalize embeddings
	emb = nn.functional.normalize(emb, dim=-1)
	result = [e.detach().cpu().numpy().astype(np.float32) for e in emb]
	return result
	```

	What ResNet does:
	- Takes raw clothing item images (224x224 RGB)
	- Passes through ResNet50 backbone (pretrained on ImageNet)
	- Generates 512-dimensional embeddings for each item
	- These embeddings capture visual features (color, texture, style, pattern)

	Example:
	- Input: Image of a blue shirt → ResNet → Output: `[0.123, -0.456, 0.789, ...]` (512-dim vector)

	---

	### Step 3: Tag Processing & Context Building
	Location: `inference.py:490-545`

	```python
	# Process user tags (occasion, weather, style, etc.)
	processed_tags = self.tag_processor.process_tags(context)

	# Build outfit template based on tags
	template = outfit_templates[outfit_style].copy()
	# Apply weather/occasion modifications
	# Generate constraints (min_items, max_items, accessory_limit)
	```

	What happens:
	- User preferences (formal, cold weather, elegant style) are processed
	- Outfit templates are selected and modified
	- Constraints are generated (e.g., formal requires 4-5 items, needs outerwear)

	---

	### Step 4: Candidate Outfit Generation
	Location: `inference.py:910-1092`

	```python
	# Generate many candidate outfit combinations
	candidates = []
	for _ in range(num_samples): # Typically 50-100+ candidates
	subset = []

	# Strategy-based generation:
	# - Strategy 0: Core outfit (shirt + pants + shoes + accessories)
	# - Strategy 1: Accessory-focused
	# - Strategy 2: Flexible combination

	# Add items based on context (formal, casual, etc.)
	if occasion == "formal" and outerwear:
	subset.append(jacket)
	subset.append(shirt)
	subset.append(pants)
	subset.append(shoes)

	candidates.append(subset)
	```

	What happens:
	- System generates 50-100+ candidate outfit combinations
	- Each candidate is a list of item indices (e.g., `[0, 3, 7, 12]`)
	- Candidates are generated using:
	- Category pools (uppers, bottoms, shoes, outerwear, accessories)
	- Context-aware strategies (formal vs casual)
	- Randomization for variety

	---

	### Step 5: ViT Scores Outfit Compatibility ⭐⭐
	Location: `inference.py:1094-1103` → `score_subset()`

	```python
	def score_subset(idx_subset: List[int]) -> float:
	# Get embeddings for items in this outfit
	embs = torch.tensor(
	np.stack([proc_items[i]["embedding"] for i in idx_subset], axis=0),
	dtype=torch.float32,
	device=self.device,
	) # Shape: (N, 512) where N = number of items in outfit

	embs = embs.unsqueeze(0) # Shape: (1, N, 512) - batch dimension

	# ✅ VIT IS CALLED HERE!
	s = self.vit.score_compatibility(embs).item() # <-- VIT FORWARD PASS
	return float(s)
	```

	What ViT does:
	- Takes multiple item embeddings (e.g., jacket, shirt, pants, shoes)
	- Passes through Vision Transformer encoder:
	- Transformer processes the sequence of item embeddings
	- Learns relationships between items (do they go together?)
	- Outputs a compatibility score (higher = better match)

	ViT Architecture:
	```python
	# From models/vit_outfit.py
	class OutfitCompatibilityModel(nn.Module):
	def forward(self, tokens: torch.Tensor) -> torch.Tensor:
	# tokens: (B, N, D) - batch of outfits, each with N items, D-dim embeddings
	h = self.encoder(tokens) # Transformer encoder
	pooled = h.mean(dim=1) # Average pooling across items
	score = self.compatibility_head(pooled) # Final compatibility score
	return score.squeeze(-1)
	```

	Example:
	- Input: `[jacket_emb, shirt_emb, pants_emb, shoes_emb]` (4 items × 512 dims)
	- ViT Processing: Transformer analyzes relationships
	- Output: `0.85` (high compatibility score)

	---

	### Step 6: Scoring & Ranking
	Location: `inference.py:1266-1274`

	```python
	# Score all valid candidates
	scored = []
	for subset in valid_candidates:
	base_score = score_subset(subset) # <-- ViT score (0.0 to 1.0+)

	# Apply penalties and bonuses
	adjusted_score = calculate_outfit_penalty(subset, base_score)
	# - Penalties: missing categories, duplicates, wrong context
	# - Bonuses: color harmony, style coherence, complete sets

	scored.append((subset, adjusted_score, base_score))

	# Sort by adjusted score (highest first)
	scored.sort(key=lambda x: x[1], reverse=True)
	```

	What happens:
	- Each candidate outfit gets:
	1. Base score from ViT (0.0 to ~1.0+)
	2. Penalties (e.g., -500 if formal without jacket)
	3. Bonuses (e.g., +0.6 for color harmony, +0.4 for style coherence)
	- Final score = base_score + penalties + bonuses
	- Outfits are ranked by final score

	---

	### Step 7: Final Selection & Deduplication
	Location: `inference.py:1276-1300`

	```python
	# Remove duplicate outfits
	seen_outfits = set()
	unique_scored = []
	for subset, adjusted_score, base_score in scored:
	normalized = normalize_outfit(subset) # Sort item IDs
	if normalized not in seen_outfits:
	seen_outfits.add(normalized)
	unique_scored.append((subset, adjusted_score, base_score))

	# Select top N with randomization
	topk = unique_scored[:num_outfits]
	```

	What happens:
	- Duplicate outfits (same items, different order) are removed
	- Top N outfits are selected
	- Some randomization is added for variety

	---

	## 🔍 Proof: Both Models Are Used

	### Evidence 1: ResNet Usage
	```python
	# Line 330 in inference.py
	emb = self.resnet(batch) # ✅ ResNet forward pass
	```
	- Called in `embed_images()` method
	- Generates embeddings for every clothing item
	- Called during inference when items don't have pre-computed embeddings

	### Evidence 2: ViT Usage
	```python
	# Line 1102 in inference.py
	s = self.vit.score_compatibility(embs).item() # ✅ ViT forward pass
	```
	- Called in `score_subset()` function
	- Scores every candidate outfit (50-100+ times per recommendation request)
	- Called during inference to rank outfit combinations

	### Evidence 3: Model Loading
	```python
	# Lines 49-50, 285-286 in inference.py
	self.resnet, self.resnet_loaded = self._load_resnet()
	self.vit, self.vit_loaded = self._load_vit()

	# Models are loaded and set to eval mode
	if self.resnet_loaded:
	self.resnet = self.resnet.to(self.device).eval()
	if self.vit_loaded:
	self.vit = self.vit.to(self.device).eval()
	```

	---

	## 📈 Complete Flow Diagram

	```
	User Input
	↓
	[Upload Images] → [CLIP Category Detection]
	↓
	[ResNet Embedding Generation] ← ✅ RESNET USED HERE
	↓
	[512-dim Embeddings for Each Item]
	↓
	[Tag Processing] → [Context Building]
	↓
	[Candidate Generation] → [50-100+ Outfit Combinations]
	↓
	[ViT Compatibility Scoring] ← ✅ VIT USED HERE (50-100+ times)
	↓
	[Penalty/Bonus Adjustment]
	↓
	[Ranking & Deduplication]
	↓
	[Top N Recommendations]
	```

	---

	## 🎯 Key Points

	1. ResNet is used:
	- Generates embeddings for each clothing item
	- Called once per item (or uses cached embeddings)
	- Output: 512-dimensional feature vectors

	2. ViT is used:
	- Scores compatibility of outfit combinations
	- Called 50-100+ times per recommendation request (once per candidate)
	- Output: Compatibility score (0.0 to ~1.0+)

	3. Both models work together:
	- ResNet provides item-level understanding
	- ViT provides outfit-level compatibility
	- Together they create personalized, context-aware recommendations

	4. The system is NOT just rule-based:
	- Deep learning models (ResNet + ViT) provide the core intelligence
	- Rules and heuristics (penalties/bonuses) refine the results
	- Tags and context guide the generation process

	---

	## 🔬 Technical Details

	### ResNet Architecture:
	- Backbone: ResNet50 (pretrained on ImageNet)
	- Input: 224×224 RGB images
	- Output: 512-dimensional embeddings
	- Purpose: Extract visual features from clothing items

	### ViT Architecture:
	- Encoder: Transformer with 4-6 layers, 8 attention heads
	- Input: Sequence of item embeddings (variable length, 2-6 items)
	- Output: Single compatibility score
	- Purpose: Learn which items go well together

	### Training:
	- ResNet: Trained with triplet loss on item pairs
	- ViT: Trained with triplet loss on outfit triplets (positive, anchor, negative)
	- Both: Use early stopping, best model checkpointing

	---

	## ✅ Conclusion

	YES - Both ResNet and ViT are actively used during inference!

	- ResNet generates item embeddings (visual understanding)
	- ViT scores outfit compatibility (relationship learning)
	- Together they create intelligent, personalized recommendations

	The system is a true deep learning pipeline, not just rule-based filtering!