CLIP Text Encoder Large (GGUF)

GGUF conversion of the CLIP text encoder (large) for use with CrispEmbed. Extracted from openai/clip-vit-large-patch14.

Architecture: CLIP text transformer with causal attention
Parameters: 124M
Output: 768-dimensional L2-normalized embeddings
Tokenizer: BPE tokenizer (embedded in GGUF), max 77 tokens
Size: ~474 MB

Usage

# Embed a single text
crispembed -m clip-text-large "a photo of a cat"

# Embed from file
crispembed -m clip-text-large --input queries.txt --output embeddings.bin

Cross-modal pairing

Output embeddings live in the same vector space as cstr/clip-vit-large-patch14-GGUF and cstr/clip-vit-large-patch14-336-GGUF. Use both for zero-shot image-text retrieval.

Notes

All output embeddings are L2-normalized.
BPE tokenizer is bundled inside the GGUF file; no external vocab files needed.
This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.

Downloads last month: 150

GGUF

Model size

0.1B params

Architecture

clip_text

Hardware compatibility

We're not able to determine the quantization variants.

View all variants