CLIP-ViT-L/14-HXQ

3.6x smaller from FP32. CIFAR-100 Top-1 72.8%. First vision model compressed with HXQ.

CLIP ViT-Large/14 (text + vision dual encoder) compressed from 1.6 GB to 447 MB. Zero-shot classification accuracy matches the dense baseline. No calibration data. Same codec that compresses Transformers, SSMs, Hybrids, and MoEs.

Install and Run

pip install "helix-substrate[hf]"
import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import CLIPModel, CLIPProcessor
from PIL import Image

model = CLIPModel.from_pretrained("EchoLabs33/clip-vit-large-patch14-helix")
processor = CLIPProcessor.from_pretrained("EchoLabs33/clip-vit-large-patch14-helix")

image = Image.open("photo.jpg")
inputs = processor(
    text=["a photo of a cat", "a photo of a dog", "a photo of a car"],
    images=image,
    return_tensors="pt",
    padding=True,
)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=-1)
print(probs)  # [cat_prob, dog_prob, car_prob]

Downstream Benchmarks

Zero-shot CIFAR-100 classification (10,000 test images, 100 classes, prompt: "a photo of a {class}"):

Metric Dense HXQ (3.6x) Delta
Top-1 Accuracy 72.48% 72.75% +0.27%
Top-5 Accuracy 91.41% 91.64% +0.23%

All deltas within noise. Task performance preserved after 3.6x compression.

Compression Benchmark

Dense (FP32) HXQ
Size 1.6 GB 447 MB
Compression ratio -- 3.6x
VRAM (eval) 3,412 MB 2,266 MB
Compressed modules -- 218 HelixLinear layers
Architecture CLIP (ViT-L/14 + Text Transformer) unchanged

Verification Status

  • Compression receipt: PASS -- 218 compressed, 374 exact, mean cosine 0.9997
  • Conversion receipt: PASS (Gate 1 + Gate 2)
  • Downstream eval: PASS -- paired dense/HXQ on CIFAR-100 zero-shot

Good to Know

  • GPU and CPU supported -- runs on any CUDA GPU or CPU.
  • Not fine-tunable -- compressed weights are read-only (is_trainable = False).
  • Requires helix-substrate -- you need pip install "helix-substrate[hf]".
  • Embeddings stored exact -- token, position, and patch embeddings are at full precision. Only the 218 attention + MLP linear layers are compressed.

What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

  • Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
  • The compressed form is the executable -- no decompression step
  • Works on any nn.Linear regardless of architecture
  • No calibration data required -- codebooks are fit from the weights alone

Architecture Details

CLIP ViT-Large/14 is a dual-encoder multimodal model:

  • Vision encoder: 24-layer ViT-Large, hidden_size=1024, 16 attention heads, patch_size=14
  • Text encoder: 12-layer Transformer, hidden_size=768, 12 attention heads
  • Cross-modal projections: visual_projection (1024->768) + text_projection (768->768)

All 218 linear layers across both encoders are compressed. Embedding layers (token, position, patch), layer norms, and biases are stored at full precision.

Why This Matters

CLIP is the first vision model compressed with HXQ. The same codec now covers:

Family Models Eval
Transformer TinyLlama, Qwen 1.5B-14B PPL within noise
Pure SSM Mamba 130m, Mamba2 1.3B PPL receipted
Hybrid Zamba2 1.2B, 2.7B PPL receipted
MoE OLMoE 1B/7B HellaSwag -0.16%
Vision+Text CLIP ViT-L/14 Top-1 +0.27%

Five architecture families. One codec. One pip install.

Companion Models

Model Architecture Ratio Eval Delta
clip-vit-large-patch14-helix Vision+Text (CLIP) 3.6x +0.27% Top-1
olmoe-1b-7b-instruct-helix MoE (64 experts) 1.9x -0.16% HellaSwag
zamba2-2.7b-instruct-helix Hybrid (Mamba2+Transformer) 1.8x +6.59% PPL
zamba2-1.2b-helix Hybrid (Mamba2+Transformer) 1.7x +2.90% PPL
qwen2.5-14b-instruct-helix Transformer 3.4x pending
qwen2.5-3b-instruct-helix Transformer 1.6x +0.69% PPL
tinyllama-1.1b-helix Transformer 4.0x +0.78% PPL
mamba2-1.3b-helix Pure SSM (Mamba2) 2.1x +8.0% PPL
mamba-130m-helix Pure SSM 3.8x +18.4% PPL

Citation

@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}

License

Apache 2.0 (inherited from openai/clip-vit-large-patch14).

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EchoLabs33/clip-vit-large-patch14-hxq

Quantized
(6)
this model

Evaluation results