MODA-Fashion-Vision-FP16

Compressed vision-only encoder for fast fashion image-to-image retrieval β€” 4.2x smaller, same quality.

This is the vision tower extracted from MODA-Fashion-Distilled, converted to FP16 half-precision. It strips the unused text encoder (image-to-image tasks never use it) and halves the weight precision, reducing model size from 775 MB to 186 MB with only -0.21 pp quality loss on LookBench.

Key Numbers

Property Value
Architecture ViT-B/16-SigLIP (vision tower only)
Parameters 92.9M (vs 203M full CLIP)
Precision float16
Model Size 186 MB (vs 775 MB full CLIP)
Embedding Dim 768
Input Resolution 224 x 224
LookBench Fine R@1 67.42% (full model: 67.63%)

LookBench Results (Fine Recall@1)

Variant Params Size RealStudio AIGenStudio RealStreet AIGenStreet Overall
MODA-Distilled (full CLIP) 203M 775 MB 70.23 80.31 60.24 81.25 67.63
MODA-Vision-FP16 (this) 92.9M 186 MB 70.13 80.83 59.73 81.25 67.42
FashionSigLIP baseline 203M 775 MB 66.96 76.68 56.37 74.38 63.84

Inference β€” Quick Start

A standalone inference.py is included in this directory.

# Single image β†’ 768-d embedding
python inference.py --image query.jpg

# Two images β†’ embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity

# Run on GPU (keeps FP16 precision for speed)
python inference.py --image query.jpg --device cuda

Python API

import torch
import open_clip
import torch.nn.functional as F
from safetensors.torch import load_file
from PIL import Image

# Build the ViT-B-16-SigLIP architecture without downloading any pretrained weights.
# The text tower is randomly initialized (we never use it). Only the visual tower
# is overwritten with MODA's fine-tuned weights below. Suppresses the ~775 MB
# pretrained-checkpoint download that would otherwise come with hf-hub:Marqo/marqo-fashionSigLIP.
base_model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP", pretrained=None
)

# Load MODA's vision-only fp16 weights (186 MB) and overlay onto the visual tower.
vision_sd = load_file("path/to/moda-fashion-vision-fp16/vision_encoder.safetensors")
vision_sd_fp32 = {k: v.float() for k, v in vision_sd.items()}
full_sd = base_model.state_dict()
for k, v in vision_sd_fp32.items():
    full_sd[k] = v
base_model.load_state_dict(full_sd, strict=True)

encoder = base_model.visual
encoder.eval()

image = preprocess(Image.open("fashion_item.jpg")).unsqueeze(0)
with torch.no_grad():
    features = encoder(image)
    features = F.normalize(features, p=2, dim=-1)  # [1, 768]

Requirements

open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors

Why Vision-Only?

For image-to-image retrieval (the primary use case), the text encoder is never used. Stripping it provides:

  • 54% fewer parameters (92.9M vs 203M) β€” faster forward pass
  • 4.2x smaller on disk (186 MB vs 775 MB) β€” cheaper to deploy
  • Near-zero quality loss (-0.21 pp Fine R@1) β€” within noise margin

Training

This model inherits its weights from MODA-Fashion-Distilled, which was trained via:

  1. Knowledge distillation from a 10-model ensemble (SigLIP, CLIP, EVA-CLIP, MetaCLIP, DFN variants)
  2. Stratified contrastive learning on Google Shopping data (10K image pairs, category-balanced)
  3. Vision-only export β€” text tower removed
  4. FP16 conversion β€” weights cast from float32 to float16

Related Models

Model Dim Fine R@1 Best for
MODA-Fashion-Distilled 768 67.63 Best overall quality
MODA-Fashion-Matryoshka 64-768 67.42 (256d) Flexible dim, 3x smaller index
MODA-Fashion-Vision-FP16 (this model) 768 67.42 Smallest (186 MB), edge/mobile
MODA-Fashion-Distilled-512d 512 67.63 Compact index, highest nDCG@5
MODA-Fashion-DeepFashion2 768 66.52 Simplest recipe, no distillation

License

MIT

Citation

If you use this model, please cite:

@software{moda2026,
  title  = {MODA: Open-source benchmark and models for fashion search},
  author = {Hopit AI},
  year   = {2026},
  url    = {https://github.com/hopit-ai/Moda}
}
Downloads last month
75
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train HopitAI/moda-fashion-vision-fp16