MODA-Fashion-Vision-FP16

Compressed vision-only encoder for fast fashion image-to-image retrieval — 4.2x smaller, same quality.

This is the vision tower extracted from MODA-Fashion-Distilled, converted to FP16 half-precision. It strips the unused text encoder (image-to-image tasks never use it) and halves the weight precision, reducing model size from 775 MB to 186 MB with only -0.21 pp quality loss on LookBench.

Key Numbers

Property	Value
Architecture	ViT-B/16-SigLIP (vision tower only)
Parameters	92.9M (vs 203M full CLIP)
Precision	float16
Model Size	186 MB (vs 775 MB full CLIP)
Embedding Dim	768
Input Resolution	224 x 224
LookBench Fine R@1	67.42% (full model: 67.63%)

LookBench Results (Fine Recall@1)

Variant	Params	Size	RealStudio	AIGenStudio	RealStreet	AIGenStreet	Overall
MODA-Distilled (full CLIP)	203M	775 MB	70.23	80.31	60.24	81.25	67.63
MODA-Vision-FP16 (this)	92.9M	186 MB	70.13	80.83	59.73	81.25	67.42
FashionSigLIP baseline	203M	775 MB	66.96	76.68	56.37	74.38	63.84

Inference — Quick Start

A standalone inference.py is included in this directory.

# Single image → 768-d embedding
python inference.py --image query.jpg

# Two images → embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity

# Run on GPU (keeps FP16 precision for speed)
python inference.py --image query.jpg --device cuda

Python API

import torch
import open_clip
import torch.nn.functional as F
from safetensors.torch import load_file
from PIL import Image

# Build the ViT-B-16-SigLIP architecture without downloading any pretrained weights.
# The text tower is randomly initialized (we never use it). Only the visual tower
# is overwritten with MODA's fine-tuned weights below. Suppresses the ~775 MB
# pretrained-checkpoint download that would otherwise come with hf-hub:Marqo/marqo-fashionSigLIP.
base_model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP", pretrained=None
)

# Load MODA's vision-only fp16 weights (186 MB) and overlay onto the visual tower.
vision_sd = load_file("path/to/moda-fashion-vision-fp16/vision_encoder.safetensors")
vision_sd_fp32 = {k: v.float() for k, v in vision_sd.items()}
full_sd = base_model.state_dict()
for k, v in vision_sd_fp32.items():
    full_sd[k] = v
base_model.load_state_dict(full_sd, strict=True)

encoder = base_model.visual
encoder.eval()

image = preprocess(Image.open("fashion_item.jpg")).unsqueeze(0)
with torch.no_grad():
    features = encoder(image)
    features = F.normalize(features, p=2, dim=-1)  # [1, 768]

Requirements

open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors

Why Vision-Only?

For image-to-image retrieval (the primary use case), the text encoder is never used. Stripping it provides:

54% fewer parameters (92.9M vs 203M) — faster forward pass
4.2x smaller on disk (186 MB vs 775 MB) — cheaper to deploy
Near-zero quality loss (-0.21 pp Fine R@1) — within noise margin

Training

This model inherits its weights from MODA-Fashion-Distilled, which was trained via:

Knowledge distillation from a 10-model ensemble (SigLIP, CLIP, EVA-CLIP, MetaCLIP, DFN variants)
Stratified contrastive learning on Google Shopping data (10K image pairs, category-balanced)
Vision-only export — text tower removed
FP16 conversion — weights cast from float32 to float16

Related Models

Model	Dim	Fine R@1	Best for
MODA-Fashion-Distilled	768	67.63	Best overall quality
MODA-Fashion-Matryoshka	64-768	67.42 (256d)	Flexible dim, 3x smaller index
MODA-Fashion-Vision-FP16 (this model)	768	67.42	Smallest (186 MB), edge/mobile
MODA-Fashion-Distilled-512d	512	67.63	Compact index, highest nDCG@5
MODA-Fashion-DeepFashion2	768	66.52	Simplest recipe, no distillation

License

MIT

Citation

If you use this model, please cite:

@software{moda2026,
  title  = {MODA: Open-source benchmark and models for fashion search},
  author = {Hopit AI},
  year   = {2026},
  url    = {https://github.com/hopit-ai/Moda}
}

Downloads last month: 75

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

HopitAI
/

moda-fashion-vision-fp16