SigLIP Base GGUF

GGUF format of google/siglip-base-patch16-384 vision encoder for use with CrispEmbed.

768-dimensional image embeddings. Encodes 384x384 images via a 12-layer ViT with 16x16 patches.

Parity

Quant Cosine vs HF Size
F32 0.996 (mean pool) 355 MB

Quick Start

# Download
huggingface-cli download cstr/siglip-base-GGUF siglip-base.gguf --local-dir .

# Encode image
./crispembed -m siglip-base.gguf --image photo.jpg

# Print dimension
./crispembed -m siglip-base.gguf --dim  # โ†’ 768

Architecture

  • Model: SigLIP ViT-B/16 (Google, Apache 2.0)
  • Vision: 12 layers, 768-D, 12 heads, 3072 intermediate
  • Image: 384ร—384, 16ร—16 patches โ†’ 576 tokens
  • Pooling: Mean pool over patch tokens
  • Normalization: L2 normalized output
Downloads last month
89
GGUF
Model size
93.2M params
Architecture
vit
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/siglip-base-GGUF

Quantized
(2)
this model