SigLIP ViT-L/16 Vision Encoder 256px (GGUF)

GGUF conversion of google/siglip-large-patch16-256 for use with CrispEmbed.

  • Architecture: SigLIP ViT-L/16 vision encoder
  • Parameters: 304M
  • Output: 1024-dimensional L2-normalized embeddings
  • Input: 256x256 RGB image with SigLIP normalization (mean=0.5, std=0.5)
  • Size: ~1.2 GB
  • Source: google/siglip-large-patch16-256

Usage

# Embed a single image
crispembed -m siglip-large-256 --image photo.jpg

# Batch processing
crispembed -m siglip-large-256 --image-dir ./photos/ --output embeddings.bin

About SigLIP

SigLIP replaces CLIP's softmax contrastive loss with a sigmoid loss, enabling better scaling and stronger performance on retrieval benchmarks. It uses a simpler normalization scheme (mean=0.5, std=0.5) compared to CLIP's ImageNet statistics.

Notes

  • All output embeddings are L2-normalized.
  • This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.
Downloads last month
93
GGUF
Model size
0.3B params
Architecture
vit
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support