---
license: mit
tags:
  - clip
  - vision
  - gguf
  - crispembed
  - image-embedding
pipeline_tag: image-feature-extraction
library_name: ggml
---

# CLIP ViT-L/14 Vision Encoder (GGUF)

GGUF conversion of [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) for use with [CrispEmbed](https://github.com/CrispStrobe/CrispEmbed).

- **Architecture:** CLIP ViT-L/14 vision encoder
- **Parameters:** 304M
- **Output:** 768-dimensional L2-normalized embeddings (1024d internal, projected to 768d)
- **Input:** 224x224 RGB image with CLIP normalization
- **Size:** ~1.2 GB
- **Source:** openai/clip-vit-large-patch14

## Usage

```bash
# Embed a single image
crispembed -m clip-vit-large-patch14 --image photo.jpg

# Batch processing
crispembed -m clip-vit-large-patch14 --image-dir ./photos/ --output embeddings.bin
```

## Cross-modal pairing

Shares an embedding space with [cstr/clip-text-large-GGUF](https://huggingface.co/cstr/clip-text-large-GGUF) for zero-shot image-text matching.

## Notes

- All output embeddings are L2-normalized.
- This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.