CLIP ViT-Large/14 β€” INT8 Quantized

OpenAI's CLIP vision encoder quantized to INT8 for real-time robotic perception. 4.5x smaller than the original β€” from 6.5 GB to 1.5 GB β€” while preserving zero-shot classification and visual grounding capabilities.

This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform β€” a modular ROS2-native AI system designed to bring foundation model intelligence to real robots operating in the real world.

Why This Model Exists

Large vision-language models like CLIP are essential for robotic scene understanding β€” identifying objects, understanding spatial relationships, and grounding natural language instructions to visual observations. But at 6.5 GB, the original CLIP ViT-L/14 is too heavy for edge deployment on devices like NVIDIA Jetson, Raspberry Pi, or embedded industrial controllers.

We quantized CLIP to INT8 and exported to ONNX so robots can run it in real-time, on-device, without cloud dependencies.

Model Details

Property Value
Architecture Vision Transformer (ViT-L/14)
Parameters 304M (vision encoder)
Hidden Dimension 1024
Layers 24 transformer blocks
Attention Heads 16
MLP Dimension 4096
Input Resolution 224 Γ— 224
Patch Size 14 Γ— 14
Tokens 257 (256 patches + 1 CLS)
Original Model openai/clip-vit-large-patch14
License MIT

Compression Results

Quantized on an NVIDIA L4 24GB GPU using INT8 dynamic quantization with ONNX Runtime export.

Metric Original INT8 Quantized Change
Total Size 6,529 MB 1,451 MB 4.5x smaller
INT8 Weights β€” 293 MB Vision encoder only
ONNX Graph β€” 1,158 MB Full model with optimizations
Quantization FP32 INT8 Dynamic Per-tensor symmetric
Format PyTorch PyTorch INT8 + ONNX Dual format

Included Files

clip-vit-large-patch14-int8/
β”œβ”€β”€ model_int8.pt              # 293 MB β€” INT8 quantized state dict
β”œβ”€β”€ model.onnx                 # 2.0 MB β€” ONNX graph structure
β”œβ”€β”€ model.onnx.data            # 1.2 GB β€” ONNX external weights
β”œβ”€β”€ config.json                # Model configuration
β”œβ”€β”€ preprocessor_config.json   # Image preprocessing config
β”œβ”€β”€ tokenizer_config.json      # Text tokenizer config
└── README.md                  # This file

Quick Start

PyTorch (INT8 Weights)

import torch
from transformers import CLIPModel, CLIPProcessor

# Load original architecture
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")

# Load INT8 quantized vision encoder weights
int8_state = torch.load("model_int8.pt", map_location="cuda", weights_only=True)
model.vision_model.load_state_dict(int8_state, strict=False)

# Run inference
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
inputs = processor(images=image, text=["a robot arm", "a table", "a cup"], return_tensors="pt")
outputs = model(**inputs)

ONNX Runtime (Recommended for Deployment)

import onnxruntime as ort
import numpy as np

# GPU inference
session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Preprocess image to (1, 3, 224, 224) float32
pixel_values = preprocess(image)  # Your preprocessing pipeline
outputs = session.run(None, {"pixel_values": pixel_values})

With FORGE (ANIMA Integration)

from forge.vision import VisionEncoderRegistry

# FORGE auto-detects INT8 weights and loads optimally
encoder = VisionEncoderRegistry.load("clip-vit-large-patch14-int8")
features = encoder(image_tensor)  # (B, 257, 1024)

Use Cases in ANIMA

CLIP serves as the visual grounding backbone across multiple ANIMA modules:

  • Object Recognition β€” Zero-shot identification of objects in the robot's workspace without task-specific training
  • Instruction Grounding β€” Matching natural language commands ("pick up the red cup") to visual observations
  • Scene Understanding β€” Encoding visual context for downstream VLA (Vision-Language-Action) models
  • Anomaly Detection β€” Comparing visual embeddings to detect unexpected objects or states
  • Multi-Modal Retrieval β€” Searching robot memory for visually similar past experiences

About ANIMA

ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules β€” from perception and planning to manipulation and safety β€” into a unified system that enables robots to understand, reason, and act in unstructured real-world environments.

ANIMA modules run on edge hardware (Jetson Orin, industrial PCs) with real-time constraints. Every foundation model we deploy must be compressed without sacrificing the capabilities that make it useful. That's why we built FORGE β€” our distillation and compression pipeline β€” and why we're releasing optimized model variants publicly.

We believe the robotics community deserves production-ready models, not just research checkpoints.

Other Models in This Collection

Browse all optimized models at huggingface.co/robotflowlabs:

  • Vision: SAM2, DINOv2, SigLIP, Depth Anything β€” for segmentation, features, and depth
  • Language: Qwen2.5, SmolLM2 β€” for reasoning and instruction following
  • VLM: Qwen2.5-VL β€” for visual question answering and scene description
  • VLA: SmolVLA, RDT2-FM β€” for end-to-end robotic action generation
  • Embeddings: BGE β€” for semantic search and retrieval

Intended Use

Designed For

  • Real-time robotic perception pipelines on edge GPUs
  • Zero-shot visual classification in manipulation and navigation tasks
  • Visual grounding of natural language instructions
  • Feature extraction for downstream VLA models

Limitations

  • INT8 quantization may slightly reduce accuracy on fine-grained classification tasks
  • Vision encoder only β€” text encoder not quantized in this release
  • Requires GPU for ONNX Runtime inference (CPU fallback available but slower)
  • Inherits any biases present in the original CLIP training data (WebImageText)

Out of Scope

  • Medical or safety-critical diagnosis without additional validation
  • Facial recognition or biometric identification
  • Surveillance applications

Technical Details

Compression Pipeline

Original CLIP ViT-L/14 (FP32, 6.5 GB)
    β”‚
    β”œβ”€β†’ torch.quantization.quantize_dynamic (INT8)
    β”‚   └─→ model_int8.pt (293 MB)
    β”‚
    └─→ torch.onnx.export (opset 18, GPU-traced)
        └─→ model.onnx + model.onnx.data (1.2 GB)
  • Quantization: Dynamic INT8 per-tensor symmetric quantization applied to all nn.Linear layers
  • ONNX Export: Traced on NVIDIA L4 GPU using PyTorch 2.10 dynamo-based ONNX exporter with opset 18
  • Graph Optimization: 100 pattern rewrite rules applied, unused nodes removed
  • Hardware: Compressed on NVIDIA L4 24GB, CUDA 13.0, Python 3.14

Attribution

Citation

@article{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
  journal={International Conference on Machine Learning},
  year={2021}
}
@misc{robotflowlabs2026anima,
  title={ANIMA: Agentic Networked Intelligence for Modular Autonomy},
  author={RobotFlowLabs},
  year={2026},
  url={https://huggingface.co/robotflowlabs}
}

Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for robotflowlabs/clip-vit-large-patch14-int8

Quantized
(5)
this model

Collection including robotflowlabs/clip-vit-large-patch14-int8

Paper for robotflowlabs/clip-vit-large-patch14-int8

Evaluation results