gilgmesh
/

gil-clip

@@ -67,6 +67,50 @@ The embeddings are L2-normalized by default, so cosine similarity is just a dot
 similarity = outputs.image_embeds @ outputs.text_embeds.T
 ```
 ## Intended use
 GIL-CLIP is intended for fashion-domain image-text retrieval and zero-shot classification, particularly where the image-side representation benefits from oracle-guided realignment over Fashion-CLIP's native embedding.

 similarity = outputs.image_embeds @ outputs.text_embeds.T
 ```
+## Try it
+The snippet below is the same example end-to-end: load the model, encode the cropped example image, score it against three candidate descriptions, and report the best match.
+```python
+import torch
+from PIL import Image
+from huggingface_hub import hf_hub_download
+from transformers import AutoModel, CLIPProcessor
+model = AutoModel.from_pretrained("gilgmesh/gil-clip", trust_remote_code=True)
+processor = CLIPProcessor.from_pretrained("gilgmesh/gil-clip")
+model.eval()
+example_path = hf_hub_download("gilgmesh/gil-clip", "example_top.png")
+image = Image.open(example_path).convert("RGB")
+texts = ["sleeveless navy top", "black dress", "graphic tee"]
+inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
+with torch.no_grad():
+    outputs = model(**inputs)
+similarities = (outputs.image_embeds @ outputs.text_embeds.T).squeeze(0)
+print("Similarities to each prompt:")
+for text, sim in zip(texts, similarities.tolist()):
+    print(f"  {text:30s} → {sim:.4f}")
+best = texts[similarities.argmax().item()]
+print(f"\nBest match: {best}")
+```
+Expected output:
+```
+Similarities to each prompt:
+  sleeveless navy top            → 0.3282
+  black dress                    → 0.0690
+  graphic tee                    → 0.0192
+Best match: sleeveless navy top
+```
 ## Intended use
 GIL-CLIP is intended for fashion-domain image-text retrieval and zero-shot classification, particularly where the image-side representation benefits from oracle-guided realignment over Fashion-CLIP's native embedding.