CLIP ViT-L/14@336 Vision Encoder (GGUF)
GGUF conversion of openai/clip-vit-large-patch14-336 for use with CrispEmbed.
- Architecture: CLIP ViT-L/14 vision encoder (336px variant)
- Parameters: 304M
- Output: 768-dimensional L2-normalized embeddings (1024d internal, projected to 768d)
- Input: 336x336 RGB image with CLIP normalization
- Size: ~1.2 GB
- Source: openai/clip-vit-large-patch14-336
Usage
# Embed a single image
crispembed -m clip-vit-large-patch14-336 --image photo.jpg
# Batch processing
crispembed -m clip-vit-large-patch14-336 --image-dir ./photos/ --output embeddings.bin
Higher input resolution (336x336 vs 224x224) captures finer spatial detail compared to the standard ViT-L/14.
Cross-modal pairing
Shares an embedding space with cstr/clip-text-large-GGUF for zero-shot image-text matching.
Notes
- All output embeddings are L2-normalized.
- This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.
- Downloads last month
- 91
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support