CLIP ViT-L/14@336 Vision Encoder (GGUF)

GGUF conversion of openai/clip-vit-large-patch14-336 for use with CrispEmbed.

  • Architecture: CLIP ViT-L/14 vision encoder (336px variant)
  • Parameters: 304M
  • Output: 768-dimensional L2-normalized embeddings (1024d internal, projected to 768d)
  • Input: 336x336 RGB image with CLIP normalization
  • Size: ~1.2 GB
  • Source: openai/clip-vit-large-patch14-336

Usage

# Embed a single image
crispembed -m clip-vit-large-patch14-336 --image photo.jpg

# Batch processing
crispembed -m clip-vit-large-patch14-336 --image-dir ./photos/ --output embeddings.bin

Higher input resolution (336x336 vs 224x224) captures finer spatial detail compared to the standard ViT-L/14.

Cross-modal pairing

Shares an embedding space with cstr/clip-text-large-GGUF for zero-shot image-text matching.

Notes

  • All output embeddings are L2-normalized.
  • This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.
Downloads last month
91
GGUF
Model size
0.3B params
Architecture
vit
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support