recomendation / docs /PIPELINE.md
Ali Mohsin
folder reorganise
72af8c3

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

🎯 How Dressify Recommendations Actually Work

βœ… YES - Both ResNet and ViT are used during inference!

This document explains the complete recommendation pipeline and proves that both deep learning models are actively used.


πŸ“Š Complete Recommendation Pipeline

Step 1: Image Input & Category Detection

Location: inference.py:356-384

# User uploads wardrobe images
items = [
    {"id": "item_0", "image": <PIL.Image>, "category": None},
    {"id": "item_1", "image": <PIL.Image>, "category": None},
    ...
]

# For each item:
for item in items:
    # 1. Auto-detect category using CLIP (if available)
    category = self._detect_category_with_clip(item["image"])
    # OR fallback to filename-based detection
    
    # 2. Generate embedding if not provided
    if embedding is None:
        embedding = self.embed_images([item["image"]])[0]

What happens:

  • Each clothing item image is processed
  • Category is detected (shirt, pants, shoes, etc.) using CLIP or filename
  • If no embedding exists, it's generated using ResNet

Step 2: ResNet Generates Item Embeddings ⭐

Location: inference.py:313-337 β†’ embed_images()

@torch.inference_mode()
def embed_images(self, images: List[Image.Image]) -> List[np.ndarray]:
    # Transform images to tensor
    batch = torch.stack([self.transform(img) for img in images])
    batch = batch.to(self.device, memory_format=torch.channels_last)
    
    # βœ… RESNET IS CALLED HERE!
    use_amp = (self.device == "cuda")
    with torch.autocast(device_type=("cuda" if use_amp else "cpu"), enabled=use_amp):
        emb = self.resnet(batch)  # <-- RESNET FORWARD PASS
    
    # Normalize embeddings
    emb = nn.functional.normalize(emb, dim=-1)
    result = [e.detach().cpu().numpy().astype(np.float32) for e in emb]
    return result

What ResNet does:

  • Takes raw clothing item images (224x224 RGB)
  • Passes through ResNet50 backbone (pretrained on ImageNet)
  • Generates 512-dimensional embeddings for each item
  • These embeddings capture visual features (color, texture, style, pattern)

Example:

  • Input: Image of a blue shirt β†’ ResNet β†’ Output: [0.123, -0.456, 0.789, ...] (512-dim vector)

Step 3: Tag Processing & Context Building

Location: inference.py:490-545

# Process user tags (occasion, weather, style, etc.)
processed_tags = self.tag_processor.process_tags(context)

# Build outfit template based on tags
template = outfit_templates[outfit_style].copy()
# Apply weather/occasion modifications
# Generate constraints (min_items, max_items, accessory_limit)

What happens:

  • User preferences (formal, cold weather, elegant style) are processed
  • Outfit templates are selected and modified
  • Constraints are generated (e.g., formal requires 4-5 items, needs outerwear)

Step 4: Candidate Outfit Generation

Location: inference.py:910-1092

# Generate many candidate outfit combinations
candidates = []
for _ in range(num_samples):  # Typically 50-100+ candidates
    subset = []
    
    # Strategy-based generation:
    # - Strategy 0: Core outfit (shirt + pants + shoes + accessories)
    # - Strategy 1: Accessory-focused
    # - Strategy 2: Flexible combination
    
    # Add items based on context (formal, casual, etc.)
    if occasion == "formal" and outerwear:
        subset.append(jacket)
        subset.append(shirt)
        subset.append(pants)
        subset.append(shoes)
    
    candidates.append(subset)

What happens:

  • System generates 50-100+ candidate outfit combinations
  • Each candidate is a list of item indices (e.g., [0, 3, 7, 12])
  • Candidates are generated using:
    • Category pools (uppers, bottoms, shoes, outerwear, accessories)
    • Context-aware strategies (formal vs casual)
    • Randomization for variety

Step 5: ViT Scores Outfit Compatibility ⭐⭐

Location: inference.py:1094-1103 β†’ score_subset()

def score_subset(idx_subset: List[int]) -> float:
    # Get embeddings for items in this outfit
    embs = torch.tensor(
        np.stack([proc_items[i]["embedding"] for i in idx_subset], axis=0),
        dtype=torch.float32,
        device=self.device,
    )  # Shape: (N, 512) where N = number of items in outfit
    
    embs = embs.unsqueeze(0)  # Shape: (1, N, 512) - batch dimension
    
    # βœ… VIT IS CALLED HERE!
    s = self.vit.score_compatibility(embs).item()  # <-- VIT FORWARD PASS
    return float(s)

What ViT does:

  • Takes multiple item embeddings (e.g., jacket, shirt, pants, shoes)
  • Passes through Vision Transformer encoder:
    • Transformer processes the sequence of item embeddings
    • Learns relationships between items (do they go together?)
    • Outputs a compatibility score (higher = better match)

ViT Architecture:

# From models/vit_outfit.py
class OutfitCompatibilityModel(nn.Module):
    def forward(self, tokens: torch.Tensor) -> torch.Tensor:
        # tokens: (B, N, D) - batch of outfits, each with N items, D-dim embeddings
        h = self.encoder(tokens)      # Transformer encoder
        pooled = h.mean(dim=1)        # Average pooling across items
        score = self.compatibility_head(pooled)  # Final compatibility score
        return score.squeeze(-1)

Example:

  • Input: [jacket_emb, shirt_emb, pants_emb, shoes_emb] (4 items Γ— 512 dims)
  • ViT Processing: Transformer analyzes relationships
  • Output: 0.85 (high compatibility score)

Step 6: Scoring & Ranking

Location: inference.py:1266-1274

# Score all valid candidates
scored = []
for subset in valid_candidates:
    base_score = score_subset(subset)  # <-- ViT score (0.0 to 1.0+)
    
    # Apply penalties and bonuses
    adjusted_score = calculate_outfit_penalty(subset, base_score)
    # - Penalties: missing categories, duplicates, wrong context
    # - Bonuses: color harmony, style coherence, complete sets
    
    scored.append((subset, adjusted_score, base_score))

# Sort by adjusted score (highest first)
scored.sort(key=lambda x: x[1], reverse=True)

What happens:

  • Each candidate outfit gets:
    1. Base score from ViT (0.0 to ~1.0+)
    2. Penalties (e.g., -500 if formal without jacket)
    3. Bonuses (e.g., +0.6 for color harmony, +0.4 for style coherence)
  • Final score = base_score + penalties + bonuses
  • Outfits are ranked by final score

Step 7: Final Selection & Deduplication

Location: inference.py:1276-1300

# Remove duplicate outfits
seen_outfits = set()
unique_scored = []
for subset, adjusted_score, base_score in scored:
    normalized = normalize_outfit(subset)  # Sort item IDs
    if normalized not in seen_outfits:
        seen_outfits.add(normalized)
        unique_scored.append((subset, adjusted_score, base_score))

# Select top N with randomization
topk = unique_scored[:num_outfits]

What happens:

  • Duplicate outfits (same items, different order) are removed
  • Top N outfits are selected
  • Some randomization is added for variety

πŸ” Proof: Both Models Are Used

Evidence 1: ResNet Usage

# Line 330 in inference.py
emb = self.resnet(batch)  # βœ… ResNet forward pass
  • Called in embed_images() method
  • Generates embeddings for every clothing item
  • Called during inference when items don't have pre-computed embeddings

Evidence 2: ViT Usage

# Line 1102 in inference.py
s = self.vit.score_compatibility(embs).item()  # βœ… ViT forward pass
  • Called in score_subset() function
  • Scores every candidate outfit (50-100+ times per recommendation request)
  • Called during inference to rank outfit combinations

Evidence 3: Model Loading

# Lines 49-50, 285-286 in inference.py
self.resnet, self.resnet_loaded = self._load_resnet()
self.vit, self.vit_loaded = self._load_vit()

# Models are loaded and set to eval mode
if self.resnet_loaded:
    self.resnet = self.resnet.to(self.device).eval()
if self.vit_loaded:
    self.vit = self.vit.to(self.device).eval()

πŸ“ˆ Complete Flow Diagram

User Input
    ↓
[Upload Images] β†’ [CLIP Category Detection]
    ↓
[ResNet Embedding Generation] ← βœ… RESNET USED HERE
    ↓
[512-dim Embeddings for Each Item]
    ↓
[Tag Processing] β†’ [Context Building]
    ↓
[Candidate Generation] β†’ [50-100+ Outfit Combinations]
    ↓
[ViT Compatibility Scoring] ← βœ… VIT USED HERE (50-100+ times)
    ↓
[Penalty/Bonus Adjustment]
    ↓
[Ranking & Deduplication]
    ↓
[Top N Recommendations]

🎯 Key Points

  1. ResNet is used:

    • Generates embeddings for each clothing item
    • Called once per item (or uses cached embeddings)
    • Output: 512-dimensional feature vectors
  2. ViT is used:

    • Scores compatibility of outfit combinations
    • Called 50-100+ times per recommendation request (once per candidate)
    • Output: Compatibility score (0.0 to ~1.0+)
  3. Both models work together:

    • ResNet provides item-level understanding
    • ViT provides outfit-level compatibility
    • Together they create personalized, context-aware recommendations
  4. The system is NOT just rule-based:

    • Deep learning models (ResNet + ViT) provide the core intelligence
    • Rules and heuristics (penalties/bonuses) refine the results
    • Tags and context guide the generation process

πŸ”¬ Technical Details

ResNet Architecture:

  • Backbone: ResNet50 (pretrained on ImageNet)
  • Input: 224Γ—224 RGB images
  • Output: 512-dimensional embeddings
  • Purpose: Extract visual features from clothing items

ViT Architecture:

  • Encoder: Transformer with 4-6 layers, 8 attention heads
  • Input: Sequence of item embeddings (variable length, 2-6 items)
  • Output: Single compatibility score
  • Purpose: Learn which items go well together

Training:

  • ResNet: Trained with triplet loss on item pairs
  • ViT: Trained with triplet loss on outfit triplets (positive, anchor, negative)
  • Both: Use early stopping, best model checkpointing

βœ… Conclusion

YES - Both ResNet and ViT are actively used during inference!

  • ResNet generates item embeddings (visual understanding)
  • ViT scores outfit compatibility (relationship learning)
  • Together they create intelligent, personalized recommendations

The system is a true deep learning pipeline, not just rule-based filtering!