CLIP ViT-B/16 Vision Encoder (GGUF)
GGUF conversion of openai/clip-vit-base-patch16 for use with CrispEmbed.
- Architecture: CLIP ViT-B/16 vision encoder
- Parameters: 86M
- Output: 512-dimensional L2-normalized embeddings (768d internal, projected to 512d via visual_projection)
- Input: 224x224 RGB image with CLIP normalization
- Size: ~329 MB
- Source: openai/clip-vit-base-patch16
Usage
# Embed a single image
crispembed -m clip-vit-base-patch16 --image photo.jpg
# Batch processing
crispembed -m clip-vit-base-patch16 --image-dir ./photos/ --output embeddings.bin
Cross-modal pairing
This model shares an embedding space with cstr/clip-text-base-GGUF. Use both to perform zero-shot image-text matching:
crispembed -m clip-vit-base-patch16 --image photo.jpg # vision embedding
crispembed -m clip-text-base "a photo of a cat" # text embedding
Cosine similarity between the two outputs measures image-text alignment.
Notes
- All output embeddings are L2-normalized.
- This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.
- Downloads last month
- 93
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support