| license: mit | |
| tags: | |
| - clip | |
| - vision | |
| - gguf | |
| - crispembed | |
| - image-embedding | |
| pipeline_tag: image-feature-extraction | |
| library_name: ggml | |
| # CLIP ViT-B/16 Vision Encoder (GGUF) | |
| GGUF conversion of [openai/clip-vit-base-patch16](https://huggingface.co/openai/clip-vit-base-patch16) for use with [CrispEmbed](https://github.com/CrispStrobe/CrispEmbed). | |
| - **Architecture:** CLIP ViT-B/16 vision encoder | |
| - **Parameters:** 86M | |
| - **Output:** 512-dimensional L2-normalized embeddings (768d internal, projected to 512d via visual_projection) | |
| - **Input:** 224x224 RGB image with CLIP normalization | |
| - **Size:** ~329 MB | |
| - **Source:** openai/clip-vit-base-patch16 | |
| ## Usage | |
| ```bash | |
| # Embed a single image | |
| crispembed -m clip-vit-base-patch16 --image photo.jpg | |
| # Batch processing | |
| crispembed -m clip-vit-base-patch16 --image-dir ./photos/ --output embeddings.bin | |
| ``` | |
| ## Cross-modal pairing | |
| This model shares an embedding space with [cstr/clip-text-base-GGUF](https://huggingface.co/cstr/clip-text-base-GGUF). Use both to perform zero-shot image-text matching: | |
| ```bash | |
| crispembed -m clip-vit-base-patch16 --image photo.jpg # vision embedding | |
| crispembed -m clip-text-base "a photo of a cat" # text embedding | |
| ``` | |
| Cosine similarity between the two outputs measures image-text alignment. | |
| ## Notes | |
| - All output embeddings are L2-normalized. | |
| - This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model. | |