GIL-CLIP

GIL-CLIP wraps Fashion-CLIP with an oracle-guided image projector that produces a re-aligned image embedding while preserving Fashion-CLIP's original text tower.

Architecture

GIL-CLIP exposes two towers in one model:

Image tower — Fashion-CLIP image encoder followed by the GIL projector. Produces image_embeds.
Text tower — Unchanged Fashion-CLIP text encoder. Produces text_embeds.

For convenience, the model also returns the original Fashion-CLIP image embedding (pre-projection) as clip_image_embeds, so downstream users can compare GIL-aligned and Fashion-CLIP-native image representations side by side.

The text tower is unchanged from Fashion-CLIP. This is by design: GIL training adjusts the image side via the oracle-guided projector while keeping the text side as the alignment anchor.

Example

For best results, GIL-CLIP is run on the cropped garment region rather than the full scene. The cropped version of the image above (example_top.png in this repo) is what the usage snippet below feeds into the model.

Usage

import torch
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import AutoModel, CLIPProcessor

model = AutoModel.from_pretrained("gilgmesh/gil-clip", trust_remote_code=True)
processor = CLIPProcessor.from_pretrained("gilgmesh/gil-clip")
model.eval()

# Load the cropped example image straight from this repo
example_path = hf_hub_download("gilgmesh/gil-clip", "example_top.png")
image = Image.open(example_path).convert("RGB")

texts = ["sleeveless navy top", "black dress", "graphic tee"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

with torch.no_grad():
    outputs = model(**inputs)

print(outputs.image_embeds.shape)        # GIL image tower
print(outputs.text_embeds.shape)         # Fashion-CLIP text tower
print(outputs.clip_image_embeds.shape)   # original Fashion-CLIP image tower

The embeddings are L2-normalized by default, so cosine similarity is just a dot product:

similarity = outputs.image_embeds @ outputs.text_embeds.T

Try it

The snippet below is the same example end-to-end: load the model, encode the cropped example image, score it against three candidate descriptions, and report the best match.

import torch
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import AutoModel, CLIPProcessor

model = AutoModel.from_pretrained("gilgmesh/gil-clip", trust_remote_code=True)
processor = CLIPProcessor.from_pretrained("gilgmesh/gil-clip")
model.eval()

example_path = hf_hub_download("gilgmesh/gil-clip", "example_top.png")
image = Image.open(example_path).convert("RGB")

texts = ["sleeveless navy top", "black dress", "graphic tee"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

with torch.no_grad():
    outputs = model(**inputs)

similarities = (outputs.image_embeds @ outputs.text_embeds.T).squeeze(0)

print("Similarities to each prompt:")
for text, sim in zip(texts, similarities.tolist()):
    print(f"  {text:30s} → {sim:.4f}")

best = texts[similarities.argmax().item()]
print(f"\nBest match: {best}")

Expected output:

Similarities to each prompt:
  sleeveless navy top            → 0.3282
  black dress                    → 0.0690
  graphic tee                    → 0.0192

Best match: sleeveless navy top

Intended use

GIL-CLIP is intended for fashion-domain image-text retrieval and zero-shot classification, particularly where the image-side representation benefits from oracle-guided realignment over Fashion-CLIP's native embedding.

Limitations

Domain. Trained on fashion data; behavior outside the fashion domain is unspecified.
Asymmetric towers. Only the image tower is GIL-projected. The text tower is the unmodified Fashion-CLIP text encoder. Cross-modal similarity therefore has asymmetric properties relative to a fully retrained CLIP.
Inherited limitations. GIL-CLIP inherits any biases and limitations of Fashion-CLIP and the underlying CLIP architecture.

Attribution

Built on top of Fashion-CLIP by Patrick John Chia et al.
Fashion-CLIP is itself based on CLIP by Radford et al.

Downloads last month: 88

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for gilgmesh/gil-clip

Base model

patrickjohncyh/fashion-clip

Finetuned

(5)

this model

Paper for gilgmesh/gil-clip

Learning Transferable Visual Models From Natural Language Supervision

Paper • 2103.00020 • Published Feb 26, 2021 • 22