SigLIP Base GGUF

GGUF format of google/siglip-base-patch16-384 vision encoder for use with CrispEmbed.

768-dimensional image embeddings. Encodes 384x384 images via a 12-layer ViT with 16x16 patches.

Parity

Quant	Cosine vs HF	Size
F32	0.996 (mean pool)	355 MB

Quick Start

# Download
huggingface-cli download cstr/siglip-base-GGUF siglip-base.gguf --local-dir .

# Encode image
./crispembed -m siglip-base.gguf --image photo.jpg

# Print dimension
./crispembed -m siglip-base.gguf --dim  # → 768

Architecture

Model: SigLIP ViT-B/16 (Google, Apache 2.0)
Vision: 12 layers, 768-D, 12 heads, 3072 intermediate
Image: 384×384, 16×16 patches → 576 tokens
Pooling: Mean pool over patch tokens
Normalization: L2 normalized output

Downloads last month: 108

GGUF

Model size

93.2M params

Architecture

vit

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for cstr/siglip-base-GGUF

Base model

google/siglip-base-patch16-384

Quantized

(2)

this model