Instructions to use gilgmesh/gil-clip with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use gilgmesh/gil-clip with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("zero-shot-image-classification", model="gilgmesh/gil-clip", trust_remote_code=True) pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png", candidate_labels=["animals", "humans", "landscape"], )# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("gilgmesh/gil-clip", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
GIL-CLIP
GIL-CLIP wraps Fashion-CLIP with an oracle-guided image projector that produces a re-aligned image embedding while preserving Fashion-CLIP's original text tower.
Architecture
GIL-CLIP exposes two towers in one model:
- Image tower โ Fashion-CLIP image encoder followed by the GIL projector. Produces
image_embeds. - Text tower โ Unchanged Fashion-CLIP text encoder. Produces
text_embeds.
For convenience, the model also returns the original Fashion-CLIP image embedding (pre-projection) as clip_image_embeds, so downstream users can compare GIL-aligned and Fashion-CLIP-native image representations side by side.
The text tower is unchanged from Fashion-CLIP. This is by design: GIL training adjusts the image side via the oracle-guided projector while keeping the text side as the alignment anchor.
Example
For best results, GIL-CLIP is run on the cropped garment region rather than the full scene. The cropped version of the image above (example_top.png in this repo) is what the usage snippet below feeds into the model.
Usage
import torch
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import AutoModel, CLIPProcessor
model = AutoModel.from_pretrained("gilgmesh/gil-clip", trust_remote_code=True)
processor = CLIPProcessor.from_pretrained("gilgmesh/gil-clip")
model.eval()
# Load the cropped example image straight from this repo
example_path = hf_hub_download("gilgmesh/gil-clip", "example_top.png")
image = Image.open(example_path).convert("RGB")
texts = ["sleeveless navy top", "black dress", "graphic tee"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
print(outputs.image_embeds.shape) # GIL image tower
print(outputs.text_embeds.shape) # Fashion-CLIP text tower
print(outputs.clip_image_embeds.shape) # original Fashion-CLIP image tower
The embeddings are L2-normalized by default, so cosine similarity is just a dot product:
similarity = outputs.image_embeds @ outputs.text_embeds.T
Try it
The snippet below is the same example end-to-end: load the model, encode the cropped example image, score it against three candidate descriptions, and report the best match.
import torch
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import AutoModel, CLIPProcessor
model = AutoModel.from_pretrained("gilgmesh/gil-clip", trust_remote_code=True)
processor = CLIPProcessor.from_pretrained("gilgmesh/gil-clip")
model.eval()
example_path = hf_hub_download("gilgmesh/gil-clip", "example_top.png")
image = Image.open(example_path).convert("RGB")
texts = ["sleeveless navy top", "black dress", "graphic tee"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
similarities = (outputs.image_embeds @ outputs.text_embeds.T).squeeze(0)
print("Similarities to each prompt:")
for text, sim in zip(texts, similarities.tolist()):
print(f" {text:30s} โ {sim:.4f}")
best = texts[similarities.argmax().item()]
print(f"\nBest match: {best}")
Expected output:
Similarities to each prompt:
sleeveless navy top โ 0.3282
black dress โ 0.0690
graphic tee โ 0.0192
Best match: sleeveless navy top
Intended use
GIL-CLIP is intended for fashion-domain image-text retrieval and zero-shot classification, particularly where the image-side representation benefits from oracle-guided realignment over Fashion-CLIP's native embedding.
Limitations
- Domain. Trained on fashion data; behavior outside the fashion domain is unspecified.
- Asymmetric towers. Only the image tower is GIL-projected. The text tower is the unmodified Fashion-CLIP text encoder. Cross-modal similarity therefore has asymmetric properties relative to a fully retrained CLIP.
- Inherited limitations. GIL-CLIP inherits any biases and limitations of Fashion-CLIP and the underlying CLIP architecture.
Attribution
- Built on top of Fashion-CLIP by Patrick John Chia et al.
- Fashion-CLIP is itself based on CLIP by Radford et al.
- Downloads last month
- 88
Model tree for gilgmesh/gil-clip
Base model
patrickjohncyh/fashion-clip