srpone/look-bench
Viewer β’ Updated β’ 69.4k β’ 134 β’ 6
Same quality as the 768-d model in a compact 512-d embedding.
MODA-Fashion-Distilled-512d combines the FashionSigLIP backbone with a learned 768β512 linear projection, trained via ensemble distillation. It achieves 67.63% Fine Recall@1 on LookBench β identical to the 768-d variant β while producing 33% smaller embeddings.
| Model | Params | Dim | Fine R@1 | Coarse R@1 | nDCG@5 |
|---|---|---|---|---|---|
| FashionSigLIP | 203M | 768 | 63.84 | 83.67 | 49.63 |
| FashionCLIP | 151M | 512 | 59.36 | 78.46 | 45.20 |
| MODA-Fashion-Distilled | 203M | 768 | 67.63 | 86.74 | 53.85 |
| MODA-Fashion-Distilled-512d | 203M | 512 | 67.63 | 86.87 | 54.11 |
| Subset | Queries | FashionSigLIP | Ours | Delta |
|---|---|---|---|---|
| RealStudioFlat | 1,011 | 66.96 | 70.23 | +3.27 |
| AIGen-Studio | 193 | 76.68 | 78.24 | +1.56 |
| RealStreetLook | 981 | 56.37 | 60.24 | +3.87 |
| AIGen-StreetLook | 160 | 74.38 | 83.75 | +9.37 |
| Overall | 2,345 | 63.84 | 67.63 | +3.79 |
| Property | Value |
|---|---|
| Architecture | ViT-B/16-SigLIP + Linear(768β512) projection |
| Parameters | 203.2M (backbone) + 393K (projection) = 203.6M |
| Embedding Dimension | 512 |
| Output | L2-normalized float32 vector |
| Model Size (safetensors) | ~777 MB |
| Input Resolution | 224 Γ 224 |
| Framework | OpenCLIP + custom projection head |
| Precision | float32 |
A standalone inference.py is included in this directory.
# Single image β 512-d embedding
python inference.py --image query.jpg
# Two images β embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity
# Run on GPU/MPS
python inference.py --image query.jpg --device cuda
import open_clip
import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import Image
from safetensors.torch import load_file
# Build the ViT-B-16-SigLIP architecture without downloading any pretrained weights.
# MODA's state_dict below contains all 162 visual.* keys, so no random-init values
# leak into the visual tower. Avoids the ~775 MB Marqo checkpoint download.
backbone, _, preprocess = open_clip.create_model_and_transforms(
"ViT-B-16-SigLIP", pretrained=None
)
state = load_file("path/to/moda-fashion-distilled-512d/model.safetensors")
proj_weight = state.pop("proj.weight")
backbone.load_state_dict(state, strict=False)
backbone.eval()
proj = nn.Linear(768, 512, bias=False)
proj.weight.data.copy_(proj_weight)
proj.eval()
image = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
features = backbone.encode_image(image)
features = F.normalize(proj(features), p=2, dim=-1) # [1, 512]
open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors
| Model | Dim | Fine R@1 | Best for |
|---|---|---|---|
| MODA-Fashion-Distilled | 768 | 67.63 | Best overall quality |
| MODA-Fashion-Matryoshka | 64-768 | 67.42 (256d) | Flexible dim, 3x smaller index |
| MODA-Fashion-Vision-FP16 | 768 | 67.42 | Smallest (186 MB), edge/mobile |
| MODA-Fashion-Distilled-512d (this model) | 512 | 67.63 | Compact index, highest nDCG@5 |
| MODA-Fashion-DeepFashion2 | 768 | 66.52 | Simplest recipe, no distillation |
MIT
If you use this model, please cite:
@software{moda2026,
title = {MODA: Open-source benchmark and models for fashion search},
author = {Hopit AI},
year = {2026},
url = {https://github.com/hopit-ai/Moda}
}