MODA-Fashion-Distilled-512d

Same quality as the 768-d model in a compact 512-d embedding.

MODA-Fashion-Distilled-512d combines the FashionSigLIP backbone with a learned 768β†’512 linear projection, trained via ensemble distillation. It achieves 67.63% Fine Recall@1 on LookBench β€” identical to the 768-d variant β€” while producing 33% smaller embeddings.

Highlights

  • 67.63% Fine R@1 on LookBench β€” tied with the 768-d distilled model
  • 54.11 nDCG@5 β€” highest of all models (even beats 768-d's 53.85)
  • 512-d output β€” 33% smaller index than 768-d models
  • PCA-initialized projection head, refined with RKD-Distance loss

LookBench Results

Model Params Dim Fine R@1 Coarse R@1 nDCG@5
FashionSigLIP 203M 768 63.84 83.67 49.63
FashionCLIP 151M 512 59.36 78.46 45.20
MODA-Fashion-Distilled 203M 768 67.63 86.74 53.85
MODA-Fashion-Distilled-512d 203M 512 67.63 86.87 54.11

Per-subset Fine Recall@1

Subset Queries FashionSigLIP Ours Delta
RealStudioFlat 1,011 66.96 70.23 +3.27
AIGen-Studio 193 76.68 78.24 +1.56
RealStreetLook 981 56.37 60.24 +3.87
AIGen-StreetLook 160 74.38 83.75 +9.37
Overall 2,345 63.84 67.63 +3.79

Model Spec

Property Value
Architecture ViT-B/16-SigLIP + Linear(768β†’512) projection
Parameters 203.2M (backbone) + 393K (projection) = 203.6M
Embedding Dimension 512
Output L2-normalized float32 vector
Model Size (safetensors) ~777 MB
Input Resolution 224 Γ— 224
Framework OpenCLIP + custom projection head
Precision float32

Inference β€” Quick Start

A standalone inference.py is included in this directory.

# Single image β†’ 512-d embedding
python inference.py --image query.jpg

# Two images β†’ embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity

# Run on GPU/MPS
python inference.py --image query.jpg --device cuda

Python API

import open_clip
import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import Image
from safetensors.torch import load_file

# Build the ViT-B-16-SigLIP architecture without downloading any pretrained weights.
# MODA's state_dict below contains all 162 visual.* keys, so no random-init values
# leak into the visual tower. Avoids the ~775 MB Marqo checkpoint download.
backbone, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP", pretrained=None
)

state = load_file("path/to/moda-fashion-distilled-512d/model.safetensors")
proj_weight = state.pop("proj.weight")
backbone.load_state_dict(state, strict=False)
backbone.eval()

proj = nn.Linear(768, 512, bias=False)
proj.weight.data.copy_(proj_weight)
proj.eval()

image = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
    features = backbone.encode_image(image)
    features = F.normalize(proj(features), p=2, dim=-1)  # [1, 512]

Requirements

open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors

Training Details

  • Base model: MODA-Fashion-Distilled (768-d, the distilled student)
  • Projection: Linear(768 β†’ 512, no bias), PCA-initialized from 4096 training samples
  • Teacher: 2048-d ensemble (MODA-SigLIP-DF2 + FashionSigLIP + FashionCLIP)
  • Loss: RKD-Distance (wt 25) + similarity mimicry (wt 10) + L2 weight drift (wt 0.01)
  • Optimizer: AdamW, backbone LR=5e-6, projection head LR=5e-4
  • Training data: DeepFashion-InShop + DeepFashion-Multimodal + H&M (30K limit)
  • Epochs: 3 (best at step 800)
  • Hardware: Apple M-series (MPS)

Related Models

Model Dim Fine R@1 Best for
MODA-Fashion-Distilled 768 67.63 Best overall quality
MODA-Fashion-Matryoshka 64-768 67.42 (256d) Flexible dim, 3x smaller index
MODA-Fashion-Vision-FP16 768 67.42 Smallest (186 MB), edge/mobile
MODA-Fashion-Distilled-512d (this model) 512 67.63 Compact index, highest nDCG@5
MODA-Fashion-DeepFashion2 768 66.52 Simplest recipe, no distillation

License

MIT

Citation

If you use this model, please cite:

@software{moda2026,
  title  = {MODA: Open-source benchmark and models for fashion search},
  author = {Hopit AI},
  year   = {2026},
  url    = {https://github.com/hopit-ai/Moda}
}
Downloads last month
118
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train HopitAI/moda-fashion-distilled-512d