Upload README.md with huggingface_hub

dc506b1 verified 3 days ago

1.17 kB

license: mit
tags:
  - clip
  - vision
  - gguf
  - crispembed
  - image-embedding
pipeline_tag: image-feature-extraction
library_name: ggml

CLIP ViT-L/14 Vision Encoder (GGUF)

GGUF conversion of openai/clip-vit-large-patch14 for use with CrispEmbed.

Architecture: CLIP ViT-L/14 vision encoder
Parameters: 304M
Output: 768-dimensional L2-normalized embeddings (1024d internal, projected to 768d)
Input: 224x224 RGB image with CLIP normalization
Size: ~1.2 GB
Source: openai/clip-vit-large-patch14

Usage

# Embed a single image
crispembed -m clip-vit-large-patch14 --image photo.jpg

# Batch processing
crispembed -m clip-vit-large-patch14 --image-dir ./photos/ --output embeddings.bin

Cross-modal pairing

Shares an embedding space with cstr/clip-text-large-GGUF for zero-shot image-text matching.

Notes

All output embeddings are L2-normalized.
This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.