gilgmesh commited on
Commit
461d495
·
verified ·
1 Parent(s): 7b215f7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +44 -0
README.md CHANGED
@@ -67,6 +67,50 @@ The embeddings are L2-normalized by default, so cosine similarity is just a dot
67
  similarity = outputs.image_embeds @ outputs.text_embeds.T
68
  ```
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  ## Intended use
71
 
72
  GIL-CLIP is intended for fashion-domain image-text retrieval and zero-shot classification, particularly where the image-side representation benefits from oracle-guided realignment over Fashion-CLIP's native embedding.
 
67
  similarity = outputs.image_embeds @ outputs.text_embeds.T
68
  ```
69
 
70
+ ## Try it
71
+
72
+ The snippet below is the same example end-to-end: load the model, encode the cropped example image, score it against three candidate descriptions, and report the best match.
73
+
74
+ ```python
75
+ import torch
76
+ from PIL import Image
77
+ from huggingface_hub import hf_hub_download
78
+ from transformers import AutoModel, CLIPProcessor
79
+
80
+ model = AutoModel.from_pretrained("gilgmesh/gil-clip", trust_remote_code=True)
81
+ processor = CLIPProcessor.from_pretrained("gilgmesh/gil-clip")
82
+ model.eval()
83
+
84
+ example_path = hf_hub_download("gilgmesh/gil-clip", "example_top.png")
85
+ image = Image.open(example_path).convert("RGB")
86
+
87
+ texts = ["sleeveless navy top", "black dress", "graphic tee"]
88
+ inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
89
+
90
+ with torch.no_grad():
91
+ outputs = model(**inputs)
92
+
93
+ similarities = (outputs.image_embeds @ outputs.text_embeds.T).squeeze(0)
94
+
95
+ print("Similarities to each prompt:")
96
+ for text, sim in zip(texts, similarities.tolist()):
97
+ print(f" {text:30s} → {sim:.4f}")
98
+
99
+ best = texts[similarities.argmax().item()]
100
+ print(f"\nBest match: {best}")
101
+ ```
102
+
103
+ Expected output:
104
+
105
+ ```
106
+ Similarities to each prompt:
107
+ sleeveless navy top → 0.3282
108
+ black dress → 0.0690
109
+ graphic tee → 0.0192
110
+
111
+ Best match: sleeveless navy top
112
+ ```
113
+
114
  ## Intended use
115
 
116
  GIL-CLIP is intended for fashion-domain image-text retrieval and zero-shot classification, particularly where the image-side representation benefits from oracle-guided realignment over Fashion-CLIP's native embedding.