--- license: mit tags: - clip - vision - gguf - crispembed - image-embedding pipeline_tag: image-feature-extraction library_name: ggml --- # CLIP ViT-B/16 Vision Encoder (GGUF) GGUF conversion of [openai/clip-vit-base-patch16](https://huggingface.co/openai/clip-vit-base-patch16) for use with [CrispEmbed](https://github.com/CrispStrobe/CrispEmbed). - **Architecture:** CLIP ViT-B/16 vision encoder - **Parameters:** 86M - **Output:** 512-dimensional L2-normalized embeddings (768d internal, projected to 512d via visual_projection) - **Input:** 224x224 RGB image with CLIP normalization - **Size:** ~329 MB - **Source:** openai/clip-vit-base-patch16 ## Usage ```bash # Embed a single image crispembed -m clip-vit-base-patch16 --image photo.jpg # Batch processing crispembed -m clip-vit-base-patch16 --image-dir ./photos/ --output embeddings.bin ``` ## Cross-modal pairing This model shares an embedding space with [cstr/clip-text-base-GGUF](https://huggingface.co/cstr/clip-text-base-GGUF). Use both to perform zero-shot image-text matching: ```bash crispembed -m clip-vit-base-patch16 --image photo.jpg # vision embedding crispembed -m clip-text-base "a photo of a cat" # text embedding ``` Cosine similarity between the two outputs measures image-text alignment. ## Notes - All output embeddings are L2-normalized. - This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.