cstr's picture
Upload README.md with huggingface_hub
c3cb7d1 verified
metadata
license: mit
tags:
  - clip
  - vision
  - gguf
  - crispembed
  - image-embedding
pipeline_tag: image-feature-extraction
library_name: ggml

CLIP ViT-B/16 Vision Encoder (GGUF)

GGUF conversion of openai/clip-vit-base-patch16 for use with CrispEmbed.

  • Architecture: CLIP ViT-B/16 vision encoder
  • Parameters: 86M
  • Output: 512-dimensional L2-normalized embeddings (768d internal, projected to 512d via visual_projection)
  • Input: 224x224 RGB image with CLIP normalization
  • Size: ~329 MB
  • Source: openai/clip-vit-base-patch16

Usage

# Embed a single image
crispembed -m clip-vit-base-patch16 --image photo.jpg

# Batch processing
crispembed -m clip-vit-base-patch16 --image-dir ./photos/ --output embeddings.bin

Cross-modal pairing

This model shares an embedding space with cstr/clip-text-base-GGUF. Use both to perform zero-shot image-text matching:

crispembed -m clip-vit-base-patch16 --image photo.jpg        # vision embedding
crispembed -m clip-text-base "a photo of a cat"               # text embedding

Cosine similarity between the two outputs measures image-text alignment.

Notes

  • All output embeddings are L2-normalized.
  • This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.