GIL-CLIP

GIL-CLIP wraps Fashion-CLIP with an oracle-guided image projector that produces a re-aligned image embedding while preserving Fashion-CLIP's original text tower.

Architecture

GIL-CLIP exposes two towers in one model:

  • Image tower โ€” Fashion-CLIP image encoder followed by the GIL projector. Produces image_embeds.
  • Text tower โ€” Unchanged Fashion-CLIP text encoder. Produces text_embeds.

For convenience, the model also returns the original Fashion-CLIP image embedding (pre-projection) as clip_image_embeds, so downstream users can compare GIL-aligned and Fashion-CLIP-native image representations side by side.

The text tower is unchanged from Fashion-CLIP. This is by design: GIL training adjusts the image side via the oracle-guided projector while keeping the text side as the alignment anchor.

Example

Example fashion image

For best results, GIL-CLIP is run on the cropped garment region rather than the full scene. The cropped version of the image above (example_top.png in this repo) is what the usage snippet below feeds into the model.

Usage

import torch
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import AutoModel, CLIPProcessor

model = AutoModel.from_pretrained("gilgmesh/gil-clip", trust_remote_code=True)
processor = CLIPProcessor.from_pretrained("gilgmesh/gil-clip")
model.eval()

# Load the cropped example image straight from this repo
example_path = hf_hub_download("gilgmesh/gil-clip", "example_top.png")
image = Image.open(example_path).convert("RGB")

texts = ["sleeveless navy top", "black dress", "graphic tee"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

with torch.no_grad():
    outputs = model(**inputs)

print(outputs.image_embeds.shape)        # GIL image tower
print(outputs.text_embeds.shape)         # Fashion-CLIP text tower
print(outputs.clip_image_embeds.shape)   # original Fashion-CLIP image tower

The embeddings are L2-normalized by default, so cosine similarity is just a dot product:

similarity = outputs.image_embeds @ outputs.text_embeds.T

Try it

The snippet below is the same example end-to-end: load the model, encode the cropped example image, score it against three candidate descriptions, and report the best match.

import torch
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import AutoModel, CLIPProcessor

model = AutoModel.from_pretrained("gilgmesh/gil-clip", trust_remote_code=True)
processor = CLIPProcessor.from_pretrained("gilgmesh/gil-clip")
model.eval()

example_path = hf_hub_download("gilgmesh/gil-clip", "example_top.png")
image = Image.open(example_path).convert("RGB")

texts = ["sleeveless navy top", "black dress", "graphic tee"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

with torch.no_grad():
    outputs = model(**inputs)

similarities = (outputs.image_embeds @ outputs.text_embeds.T).squeeze(0)

print("Similarities to each prompt:")
for text, sim in zip(texts, similarities.tolist()):
    print(f"  {text:30s} โ†’ {sim:.4f}")

best = texts[similarities.argmax().item()]
print(f"\nBest match: {best}")

Expected output:

Similarities to each prompt:
  sleeveless navy top            โ†’ 0.3282
  black dress                    โ†’ 0.0690
  graphic tee                    โ†’ 0.0192

Best match: sleeveless navy top

Intended use

GIL-CLIP is intended for fashion-domain image-text retrieval and zero-shot classification, particularly where the image-side representation benefits from oracle-guided realignment over Fashion-CLIP's native embedding.

Limitations

  • Domain. Trained on fashion data; behavior outside the fashion domain is unspecified.
  • Asymmetric towers. Only the image tower is GIL-projected. The text tower is the unmodified Fashion-CLIP text encoder. Cross-modal similarity therefore has asymmetric properties relative to a fully retrained CLIP.
  • Inherited limitations. GIL-CLIP inherits any biases and limitations of Fashion-CLIP and the underlying CLIP architecture.

Attribution

  • Built on top of Fashion-CLIP by Patrick John Chia et al.
  • Fashion-CLIP is itself based on CLIP by Radford et al.
Downloads last month
88
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for gilgmesh/gil-clip

Finetuned
(5)
this model

Paper for gilgmesh/gil-clip