cstr's picture
Upload README.md with huggingface_hub
c3cb7d1 verified
---
license: mit
tags:
- clip
- vision
- gguf
- crispembed
- image-embedding
pipeline_tag: image-feature-extraction
library_name: ggml
---
# CLIP ViT-B/16 Vision Encoder (GGUF)
GGUF conversion of [openai/clip-vit-base-patch16](https://huggingface.co/openai/clip-vit-base-patch16) for use with [CrispEmbed](https://github.com/CrispStrobe/CrispEmbed).
- **Architecture:** CLIP ViT-B/16 vision encoder
- **Parameters:** 86M
- **Output:** 512-dimensional L2-normalized embeddings (768d internal, projected to 512d via visual_projection)
- **Input:** 224x224 RGB image with CLIP normalization
- **Size:** ~329 MB
- **Source:** openai/clip-vit-base-patch16
## Usage
```bash
# Embed a single image
crispembed -m clip-vit-base-patch16 --image photo.jpg
# Batch processing
crispembed -m clip-vit-base-patch16 --image-dir ./photos/ --output embeddings.bin
```
## Cross-modal pairing
This model shares an embedding space with [cstr/clip-text-base-GGUF](https://huggingface.co/cstr/clip-text-base-GGUF). Use both to perform zero-shot image-text matching:
```bash
crispembed -m clip-vit-base-patch16 --image photo.jpg # vision embedding
crispembed -m clip-text-base "a photo of a cat" # text embedding
```
Cosine similarity between the two outputs measures image-text alignment.
## Notes
- All output embeddings are L2-normalized.
- This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.