fkuyumcu
/

OutfitTransformer-labse

+---
+license: mit
+language:
+  - en
+  - tr
+tags:
+  - fashion
+  - outfit-recommendation
+  - multimodal
+  - transformer
+  - image-text
+  - complementary-item-retrieval
+  - pytorch
+datasets:
+  - polyvore
+pipeline_tag: feature-extraction
+---
+# Outfit Transformer CIR (Complementary Item Retrieval)
+A multimodal Transformer model for **fashion outfit completion** and **complementary item retrieval**. Given a partial outfit (e.g., a t-shirt and jeans), the model predicts the ideal embedding for a missing item (e.g., shoes) that would complete the outfit harmoniously.
+## Model Description
+This model is based on the architecture proposed by **Sarkar et al.** in their paper on outfit recommendation, with several key modifications:
+### Differences from Original Paper
+| Aspect | Original (Sarkar et al.) | This Implementation |
+|--------|--------------------------|---------------------|
+| **Text Encoder** | BERT (768-dim) | **LaBSE** (768-dim) |
+| **Text Language** | English only | Multilingual (109 languages) |
+| **Loss Function** | InfoNCE | **Set-wise Outfit Ranking Loss** |
+| **Negative Sampling** | Random | **Hard Negative Mining** (same category) |
+### Why LaBSE instead of BERT?
+[LaBSE (Language-agnostic BERT Sentence Embedding)](https://huggingface.co/sentence-transformers/LaBSE) was chosen because:
+1. **Multilingual Support**: Works with 109 languages, enabling Turkish/English fashion descriptions
+2. **Cross-lingual Alignment**: "Mavi tişört" and "blue t-shirt" produce similar embeddings
+3. **Same Dimensionality**: Still outputs 768-dim vectors, compatible with the original architecture
+4. **Production Ready**: Better suited for real-world e-commerce applications
+### Loss Function: Set-wise Outfit Ranking Loss
+Instead of the standard InfoNCE loss, we use the **Set-wise Outfit Ranking Loss** from the paper (Section 3.2.2):
+```
+L_set = L_all + L_hard
+```
+Where:
+- **L_all**: Margin-based ranking over all negatives
+- **L_hard**: Extra penalty on the hardest negative (closest wrong answer)
+```python
+# L_ALL: General ranking loss
+diff_all = pos_dist - neg_dist + margin  # margin = 2.0
+loss_all = ReLU(diff_all).mean()
+# L_HARD: Hardest negative focus
+min_neg_dist = neg_dist.min(dim=1)
+diff_hard = pos_dist - min_neg_dist + margin
+loss_hard = ReLU(diff_hard).mean()
+total_loss = loss_all + loss_hard
+```
+**Why this helps:**
+- InfoNCE treats all negatives equally via softmax
+- Set-wise loss explicitly penalizes the hardest negative
+- Reduces **hubness problem** where popular items dominate retrieval
+## Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    OutfitTransformerCIR                     │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│  ┌──────────────┐    ┌──────────────┐                       │
+│  │ ResNet-18    │    │ LaBSE        │                       │
+│  │ (Frozen)     │    │ (Frozen)     │                       │
+│  │ 512-dim      │    │ 768-dim      │                       │
+│  └──────┬───────┘    └──────┬───────┘                       │
+│         │                   │                               │
+│  ┌──────▼───────┐    ┌──────▼───────┐                       │
+│  │ Visual Proj  │    │ Text Proj    │   ← Trained          │
+│  │ 512 → 64     │    │ 768 → 64     │                       │
+│  └──────┬───────┘    └──────┬───────┘                       │
+│         │                   │                               │
+│         └────────┬──────────┘                               │
+│                  │                                          │
+│           ┌──────▼──────┐                                   │
+│           │ Concat      │                                   │
+│           │ 64+64 = 128 │                                   │
+│           └──────┬──────┘                                   │
+│                  │                                          │
+│    ┌─────────────▼─────────────┐                            │
+│    │ [QUERY] + Item Embeddings │                            │
+│    │     (Learnable Token)     │                            │
+│    └─────────────┬─────────────┘                            │
+│                  │                                          │
+│    ┌─────────────▼─────────────┐                            │
+│    │ Transformer Encoder       │                            │
+│    │ 6 layers, 16 heads        │                            │
+│    │ d_model=128, ff=512       │                            │
+│    └─────────────┬─────────────┘                            │
+│                  │                                          │
+│    ┌─────────────▼─────────────┐                            │
+│    │ Output Projection         │                            │
+│    │ + LayerNorm + L2 Norm     │                            │
+│    └─────────────┬─────────────┘                            │
+│                  │                                          │
+│           ┌──────▼──────┐                                   │
+│           │ 128-dim     │                                   │
+│           │ Predicted   │                                   │
+│           │ Embedding   │                                   │
+│           └─────────────┘                                   │
+│                                                             │
+└─────────────────────────────────────────────────────────────┘
+```
+## Benchmark Results
+Evaluated on **Polyvore Outfits** dataset (disjoint split):
+| Metric | Score |
+|--------|-------|
+| **FITB Accuracy** | 56.39% |
+| **MRR** | 0.7447 |
+| **Recall@1** | 56.39% |
+| **Recall@2** | 80.86% |
+| **Recall@3** | 93.56% |
+| **NDCG@3** | 0.7818 |
+| **NDCG@5** | 0.8095 |
+### Comparison with Baselines
+| Model | FITB Accuracy | Notes |
+|-------|---------------|-------|
+| Random | 25.00% | 4-choice task |
+| Type-Aware (Vasileva 2018) | ~53% | Category-specific spaces |
+| **Ours (LaBSE + SetWise)** | **56.39%** | Multilingual, margin-based |
+| Sarkar et al. (reported) | ~57% | English BERT, InfoNCE |
+## Usage
+### Installation
+```bash
+pip install torch torchvision transformers
+```
+### Loading the Model
+```python
+import torch
+from model import OutfitTransformerCIR
+# Load model
+model = OutfitTransformerCIR(embedding_dim=128, nhead=16, num_layers=6)
+model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
+model.eval()
+```
+### Inference Example
+```python
+# Assume you have pre-extracted features:
+# context_images: (1, num_items, 512) - ResNet-18 features
+# context_texts: (1, num_items, 768) - LaBSE embeddings
+with torch.no_grad():
+    # Predict missing item embedding
+    predicted_embedding = model(context_images, context_texts)
+    # predicted_embedding: (1, 128)
+    # Use cosine similarity to find closest items in your database
+    similarities = torch.cosine_similarity(predicted_embedding, item_database)
+    top_matches = similarities.argsort(descending=True)[:10]
+```
+### Feature Extraction (for your own items)
+```python
+from torchvision import models, transforms
+from transformers import AutoTokenizer, AutoModel
+from PIL import Image
+import torch.nn as nn
+# Image encoder (ResNet-18)
+resnet = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
+resnet = nn.Sequential(*list(resnet.children())[:-1])
+resnet.eval()
+preprocess = transforms.Compose([
+    transforms.Resize((224, 224)),
+    transforms.ToTensor(),
+    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+])
+# Text encoder (LaBSE)
+tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/LaBSE")
+labse = AutoModel.from_pretrained("sentence-transformers/LaBSE")
+labse.eval()
+def extract_features(image_path, text_description):
+    # Image: 512-dim
+    image = Image.open(image_path).convert('RGB')
+    img_tensor = preprocess(image).unsqueeze(0)
+    with torch.no_grad():
+        img_features = resnet(img_tensor).flatten(1)  # (1, 512)
+    # Text: 768-dim
+    inputs = tokenizer(text_description, return_tensors="pt", padding=True, truncation=True)
+    with torch.no_grad():
+        txt_features = labse(**inputs).pooler_output  # (1, 768)
+    return img_features, txt_features
+```
+## Training Details
+| Hyperparameter | Value |
+|----------------|-------|
+| Optimizer | AdamW |
+| Learning Rate | 1e-4 |
+| Weight Decay | 0.01 |
+| Batch Size | 64 |
+| Epochs | 30 |
+| Margin (loss) | 2.0 |
+| Num Negatives | 5 |
+| Hard Negative Ratio | 50% (same category) |
+### Training Data
+- **Dataset**: Polyvore Outfits (Maryland split, disjoint)
+- **Train**: ~17K outfits, ~250K items
+- **Validation**: ~2K outfits
+- **Test**: ~3K outfits
+## Limitations
+1. **Fixed Item Length**: Model expects max 8 items per outfit (padding applied)
+2. **Frozen Encoders**: ResNet-18 and LaBSE are frozen during training
+3. **Hubness**: Some popular items may dominate retrieval (mitigated with CSLS)
+4. **Fashion Domain**: Trained on Polyvore data, may not generalize to other domains
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{outfit-cir-transformer,
+  author = {Kuyumcu, Furkan},
+  title = {Outfit Transformer CIR: Multilingual Complementary Item Retrieval},
+  year = {2026},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/fkuyumcu/outfit-cir-transformer}
+}
+```
+### Original Paper Reference
+```bibtex
+@inproceedings{sarkar2022outfitbert,
+  title={OutfitTransformer: Learning Outfit Representations for Fashion Recommendation},
+  author={Sarkar, Rohan and others},
+  booktitle={CVPR Workshop on Computer Vision for Fashion, Art, and Design},
+  year={2022}
+}
+```
+## License
+MIT License

config.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "model_type": "outfit-cir-transformer",
+  "architectures": ["OutfitTransformerCIR"],
+  "embedding_dim": 128,
+  "nhead": 16,
+  "num_layers": 6,
+  "dim_feedforward": 512,
+  "dropout": 0.1,
+  "max_items": 8,
+  "image_encoder": {
+    "name": "resnet18",
+    "pretrained": "torchvision",
+    "output_dim": 512,
+    "frozen": true
+  },
+  "text_encoder": {
+    "name": "sentence-transformers/LaBSE",
+    "output_dim": 768,
+    "frozen": true
+  },
+  "projection": {
+    "visual": {"in_features": 512, "out_features": 64},
+    "text": {"in_features": 768, "out_features": 64}
+  },
+  "training": {
+    "loss": "SetWiseOutfitRankingLoss",
+    "margin": 2.0,
+    "num_negatives": 5,
+    "hard_negative_ratio": 0.5,
+    "optimizer": "AdamW",
+    "learning_rate": 1e-4,
+    "weight_decay": 0.01,
+    "batch_size": 64,
+    "epochs": 30
+  },
+  "dataset": {
+    "name": "polyvore_outfits",
+    "split": "disjoint",
+    "train_outfits": 17316,
+    "valid_outfits": 1497,
+    "test_outfits": 3076,
+    "total_items": 251008
+  },
+  "benchmark": {
+    "fitb_accuracy": 0.5639,
+    "mrr": 0.7447,
+    "recall_at_1": 0.5639,
+    "recall_at_2": 0.8086,
+    "recall_at_3": 0.9356,
+    "ndcg_at_3": 0.7818,
+    "ndcg_at_5": 0.8095
+  },
+  "base_paper": {
+    "title": "OutfitTransformer: Learning Outfit Representations for Fashion Recommendation",
+    "authors": "Sarkar et al.",
+    "venue": "CVPR Workshop 2022",
+    "modifications": [
+      "Replaced BERT with LaBSE for multilingual support",
+      "Replaced InfoNCE with Set-wise Outfit Ranking Loss",
+      "Added hard negative mining from same category"
+    ]
+  }
+}

model.py ADDED Viewed

	@@ -0,0 +1,193 @@

+"""
+OutfitTransformerCIR - Complementary Item Retrieval Model
+==========================================================
+Architecture based on Sarkar et al. with modifications:
+- LaBSE instead of BERT for multilingual text encoding
+- Set-wise Outfit Ranking Loss instead of InfoNCE
+Usage:
+    from model import OutfitTransformerCIR
+    model = OutfitTransformerCIR()
+    model.load_state_dict(torch.load("pytorch_model.bin"))
+    model.eval()
+    # context_images: (B, S, 512) - ResNet-18 features
+    # context_texts: (B, S, 768) - LaBSE embeddings
+    predicted = model(context_images, context_texts)
+    # predicted: (B, 128) - Missing item embedding
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class OutfitTransformerCIR(nn.Module):
+    """
+    Complementary Item Retrieval Transformer
+    Given context items (partial outfit), predicts the embedding of a missing item
+    that would complete the outfit harmoniously.
+    Architecture:
+        - Visual projection: 512 (ResNet-18) → 64
+        - Text projection: 768 (LaBSE) → 64
+        - Combined: 64 + 64 = 128 dim item embedding
+        - Transformer Encoder: 6 layers, 16 heads
+        - Learnable [QUERY] token for missing item prediction
+    Args:
+        embedding_dim (int): Final embedding dimension (default: 128)
+        nhead (int): Number of attention heads (default: 16)
+        num_layers (int): Number of transformer layers (default: 6)
+        use_projection (bool): Whether to apply projection layers.
+            - True: Input is raw features (512 + 768)
+            - False: Input is pre-projected features (64 + 64)
+    """
+    def __init__(self, embedding_dim=128, nhead=16, num_layers=6, use_projection=True):
+        super(OutfitTransformerCIR, self).__init__()
+        self.use_projection = use_projection
+        self.embedding_dim = embedding_dim
+        # Projection layers (trained, not frozen)
+        self.visual_proj = nn.Linear(512, 64)
+        self.text_proj = nn.Linear(768, 64)
+        # Transformer encoder
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=embedding_dim,
+            nhead=nhead,
+            dim_feedforward=512,
+            batch_first=True,
+            dropout=0.1
+        )
+        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
+        # Learnable query token (represents the missing item)
+        self.query_token = nn.Parameter(torch.randn(1, 1, embedding_dim))
+        # Output projection with normalization
+        self.output_proj = nn.Sequential(
+            nn.Linear(embedding_dim, embedding_dim),
+            nn.LayerNorm(embedding_dim)
+        )
+    def encode_items(self, images, texts):
+        """
+        Encode multiple items (for context).
+        Args:
+            images: (B, S, D_img) where D_img=512 (raw) or 64 (projected)
+            texts: (B, S, D_txt) where D_txt=768 (raw) or 64 (projected)
+        Returns:
+            (B, S, 128) - Unified item embeddings
+        """
+        if self.use_projection:
+            img_emb = self.visual_proj(images)
+            txt_emb = self.text_proj(texts)
+        else:
+            img_emb = images
+            txt_emb = texts
+        return torch.cat((img_emb, txt_emb), dim=-1)
+    def encode_single_item(self, image, text):
+        """
+        Encode a single item (for candidate scoring).
+        Args:
+            image: (B, D_img)
+            text: (B, D_txt)
+        Returns:
+            (B, 128) - Item embedding
+        """
+        if self.use_projection:
+            img_emb = self.visual_proj(image)
+            txt_emb = self.text_proj(text)
+        else:
+            img_emb = image
+            txt_emb = text
+        return torch.cat((img_emb, txt_emb), dim=-1)
+    def forward(self, context_images, context_texts, padding_mask=None):
+        """
+        Predict the embedding of a missing item.
+        Args:
+            context_images: (B, S, 512) - ResNet-18 features of context items
+            context_texts: (B, S, 768) - LaBSE embeddings of context items
+            padding_mask: (B, S) - True indicates padding positions
+        Returns:
+            (B, 128) - Predicted embedding for the missing item
+        Example:
+            >>> model = OutfitTransformerCIR()
+            >>> # Outfit with 3 items: t-shirt, jeans, watch
+            >>> img_features = torch.randn(1, 3, 512)  # ResNet-18 outputs
+            >>> txt_features = torch.randn(1, 3, 768)  # LaBSE outputs
+            >>> predicted = model(img_features, txt_features)
+            >>> # predicted: (1, 128) - embedding for ideal 4th item (e.g., shoes)
+        """
+        batch_size = context_images.size(0)
+        device = context_images.device
+        # 1. Encode context items
+        item_embeddings = self.encode_items(context_images, context_texts)
+        # 2. Prepend learnable query token
+        query = self.query_token.expand(batch_size, -1, -1)
+        x = torch.cat([query, item_embeddings], dim=1)
+        # 3. Build attention mask (query always attends, padding positions masked)
+        if padding_mask is not None:
+            query_mask = torch.zeros(batch_size, 1, dtype=torch.bool, device=device)
+            full_mask = torch.cat([query_mask, padding_mask], dim=1)
+        else:
+            full_mask = None
+        # 4. Transformer forward
+        out = self.transformer(x, src_key_padding_mask=full_mask)
+        # 5. Extract query output (first position)
+        query_out = out[:, 0, :]
+        # 6. Project and L2 normalize
+        predicted = self.output_proj(query_out)
+        predicted = F.normalize(predicted, p=2, dim=-1)
+        return predicted
+# Convenience function for loading
+def load_model(checkpoint_path, device="cpu"):
+    """
+    Load a trained OutfitTransformerCIR model.
+    Args:
+        checkpoint_path: Path to pytorch_model.bin
+        device: "cpu" or "cuda"
+    Returns:
+        Loaded model in eval mode
+    """
+    model = OutfitTransformerCIR(
+        embedding_dim=128,
+        nhead=16,
+        num_layers=6,
+        use_projection=True
+    )
+    state_dict = torch.load(checkpoint_path, map_location=device)
+    model.load_state_dict(state_dict)
+    model.to(device)
+    model.eval()
+    return model

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bb5aad51daa663bc45c489d9711b69661ed75009df267c79be380df4b8614823
+size 5184977