SigLIP ViT-L/16 Vision Encoder 256px (GGUF)
GGUF conversion of google/siglip-large-patch16-256 for use with CrispEmbed.
- Architecture: SigLIP ViT-L/16 vision encoder
- Parameters: 304M
- Output: 1024-dimensional L2-normalized embeddings
- Input: 256x256 RGB image with SigLIP normalization (mean=0.5, std=0.5)
- Size: ~1.2 GB
- Source: google/siglip-large-patch16-256
Usage
# Embed a single image
crispembed -m siglip-large-256 --image photo.jpg
# Batch processing
crispembed -m siglip-large-256 --image-dir ./photos/ --output embeddings.bin
About SigLIP
SigLIP replaces CLIP's softmax contrastive loss with a sigmoid loss, enabling better scaling and stronger performance on retrieval benchmarks. It uses a simpler normalization scheme (mean=0.5, std=0.5) compared to CLIP's ImageNet statistics.
Notes
- All output embeddings are L2-normalized.
- This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.
- Downloads last month
- 93
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support