cstr
/

clip-vit-base-patch16-GGUF

Image Feature Extraction

image-embedding

Model card Files Files and versions

clip-vit-base-patch16-GGUF / README.md

cstr's picture

Upload README.md with huggingface_hub

c3cb7d1 verified 3 days ago

|

history blame contribute delete

1.46 kB

	---
	license: mit
	tags:
	- clip
	- vision
	- gguf
	- crispembed
	- image-embedding
	pipeline_tag: image-feature-extraction
	library_name: ggml
	---

	# CLIP ViT-B/16 Vision Encoder (GGUF)

	GGUF conversion of [openai/clip-vit-base-patch16](https://huggingface.co/openai/clip-vit-base-patch16) for use with [CrispEmbed](https://github.com/CrispStrobe/CrispEmbed).

	- Architecture: CLIP ViT-B/16 vision encoder
	- Parameters: 86M
	- Output: 512-dimensional L2-normalized embeddings (768d internal, projected to 512d via visual_projection)
	- Input: 224x224 RGB image with CLIP normalization
	- Size: ~329 MB
	- Source: openai/clip-vit-base-patch16

	## Usage

	```bash
	# Embed a single image
	crispembed -m clip-vit-base-patch16 --image photo.jpg

	# Batch processing
	crispembed -m clip-vit-base-patch16 --image-dir ./photos/ --output embeddings.bin
	```

	## Cross-modal pairing

	This model shares an embedding space with [cstr/clip-text-base-GGUF](https://huggingface.co/cstr/clip-text-base-GGUF). Use both to perform zero-shot image-text matching:

	```bash
	crispembed -m clip-vit-base-patch16 --image photo.jpg # vision embedding
	crispembed -m clip-text-base "a photo of a cat" # text embedding
	```

	Cosine similarity between the two outputs measures image-text alignment.

	## Notes

	- All output embeddings are L2-normalized.
	- This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.