Instructions to use zer0int/CLIP-GmP-ViT-L-14 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zer0int/CLIP-GmP-ViT-L-14 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("zero-shot-image-classification", model="zer0int/CLIP-GmP-ViT-L-14") pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png", candidate_labels=["animals", "humans", "landscape"], )# Load model directly from transformers import AutoProcessor, AutoModelForZeroShotImageClassification processor = AutoProcessor.from_pretrained("zer0int/CLIP-GmP-ViT-L-14") model = AutoModelForZeroShotImageClassification.from_pretrained("zer0int/CLIP-GmP-ViT-L-14") - Notebooks
- Google Colab
- Kaggle
Difference between 300 MB and 900 MB versions?
What are differences between versions (2 different sizes of files for TEXT and smooth) f.e.:
ViT-L-14-TEXT-detail-improved-hiT-GmP-HF.safetensors
ViT-L-14-TEXT-detail-improved-hiT-GmP-TE-only-HF.safetensors
In one there is "TE-only-HF" in second "GmP-HF". the same situation for smooth version.
Ps. Yes I saw "You'll generally want the "TE-only" .safetensors" in readme, but still I wonder what a differences:)
"TE only" stands for "Text Encoder only". The "full CLIP" (larger file) has a text encoder and an image encoder; you'll need that for e.g. zero-shot image classification or anything else where CLIP needs to know (encode) the image AND the text.
For a text-to-image AI system, CLIP is just the "translator" from natural language to "AI space", so it encodes the text prompt and passes that to the generative AI. In this scenario, CLIP does not need its vision transformer, and alas is a "Text Encoder only".
Hope that helps! :)