CLIP ViT-Large/14 β INT8 Quantized
OpenAI's CLIP vision encoder quantized to INT8 for real-time robotic perception. 4.5x smaller than the original β from 6.5 GB to 1.5 GB β while preserving zero-shot classification and visual grounding capabilities.
This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform β a modular ROS2-native AI system designed to bring foundation model intelligence to real robots operating in the real world.
Why This Model Exists
Large vision-language models like CLIP are essential for robotic scene understanding β identifying objects, understanding spatial relationships, and grounding natural language instructions to visual observations. But at 6.5 GB, the original CLIP ViT-L/14 is too heavy for edge deployment on devices like NVIDIA Jetson, Raspberry Pi, or embedded industrial controllers.
We quantized CLIP to INT8 and exported to ONNX so robots can run it in real-time, on-device, without cloud dependencies.
Model Details
| Property | Value |
|---|---|
| Architecture | Vision Transformer (ViT-L/14) |
| Parameters | 304M (vision encoder) |
| Hidden Dimension | 1024 |
| Layers | 24 transformer blocks |
| Attention Heads | 16 |
| MLP Dimension | 4096 |
| Input Resolution | 224 Γ 224 |
| Patch Size | 14 Γ 14 |
| Tokens | 257 (256 patches + 1 CLS) |
| Original Model | openai/clip-vit-large-patch14 |
| License | MIT |
Compression Results
Quantized on an NVIDIA L4 24GB GPU using INT8 dynamic quantization with ONNX Runtime export.
| Metric | Original | INT8 Quantized | Change |
|---|---|---|---|
| Total Size | 6,529 MB | 1,451 MB | 4.5x smaller |
| INT8 Weights | β | 293 MB | Vision encoder only |
| ONNX Graph | β | 1,158 MB | Full model with optimizations |
| Quantization | FP32 | INT8 Dynamic | Per-tensor symmetric |
| Format | PyTorch | PyTorch INT8 + ONNX | Dual format |
Included Files
clip-vit-large-patch14-int8/
βββ model_int8.pt # 293 MB β INT8 quantized state dict
βββ model.onnx # 2.0 MB β ONNX graph structure
βββ model.onnx.data # 1.2 GB β ONNX external weights
βββ config.json # Model configuration
βββ preprocessor_config.json # Image preprocessing config
βββ tokenizer_config.json # Text tokenizer config
βββ README.md # This file
Quick Start
PyTorch (INT8 Weights)
import torch
from transformers import CLIPModel, CLIPProcessor
# Load original architecture
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
# Load INT8 quantized vision encoder weights
int8_state = torch.load("model_int8.pt", map_location="cuda", weights_only=True)
model.vision_model.load_state_dict(int8_state, strict=False)
# Run inference
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
inputs = processor(images=image, text=["a robot arm", "a table", "a cup"], return_tensors="pt")
outputs = model(**inputs)
ONNX Runtime (Recommended for Deployment)
import onnxruntime as ort
import numpy as np
# GPU inference
session = ort.InferenceSession(
"model.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
# Preprocess image to (1, 3, 224, 224) float32
pixel_values = preprocess(image) # Your preprocessing pipeline
outputs = session.run(None, {"pixel_values": pixel_values})
With FORGE (ANIMA Integration)
from forge.vision import VisionEncoderRegistry
# FORGE auto-detects INT8 weights and loads optimally
encoder = VisionEncoderRegistry.load("clip-vit-large-patch14-int8")
features = encoder(image_tensor) # (B, 257, 1024)
Use Cases in ANIMA
CLIP serves as the visual grounding backbone across multiple ANIMA modules:
- Object Recognition β Zero-shot identification of objects in the robot's workspace without task-specific training
- Instruction Grounding β Matching natural language commands ("pick up the red cup") to visual observations
- Scene Understanding β Encoding visual context for downstream VLA (Vision-Language-Action) models
- Anomaly Detection β Comparing visual embeddings to detect unexpected objects or states
- Multi-Modal Retrieval β Searching robot memory for visually similar past experiences
About ANIMA
ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules β from perception and planning to manipulation and safety β into a unified system that enables robots to understand, reason, and act in unstructured real-world environments.
ANIMA modules run on edge hardware (Jetson Orin, industrial PCs) with real-time constraints. Every foundation model we deploy must be compressed without sacrificing the capabilities that make it useful. That's why we built FORGE β our distillation and compression pipeline β and why we're releasing optimized model variants publicly.
We believe the robotics community deserves production-ready models, not just research checkpoints.
Other Models in This Collection
Browse all optimized models at huggingface.co/robotflowlabs:
- Vision: SAM2, DINOv2, SigLIP, Depth Anything β for segmentation, features, and depth
- Language: Qwen2.5, SmolLM2 β for reasoning and instruction following
- VLM: Qwen2.5-VL β for visual question answering and scene description
- VLA: SmolVLA, RDT2-FM β for end-to-end robotic action generation
- Embeddings: BGE β for semantic search and retrieval
Intended Use
Designed For
- Real-time robotic perception pipelines on edge GPUs
- Zero-shot visual classification in manipulation and navigation tasks
- Visual grounding of natural language instructions
- Feature extraction for downstream VLA models
Limitations
- INT8 quantization may slightly reduce accuracy on fine-grained classification tasks
- Vision encoder only β text encoder not quantized in this release
- Requires GPU for ONNX Runtime inference (CPU fallback available but slower)
- Inherits any biases present in the original CLIP training data (WebImageText)
Out of Scope
- Medical or safety-critical diagnosis without additional validation
- Facial recognition or biometric identification
- Surveillance applications
Technical Details
Compression Pipeline
Original CLIP ViT-L/14 (FP32, 6.5 GB)
β
βββ torch.quantization.quantize_dynamic (INT8)
β βββ model_int8.pt (293 MB)
β
βββ torch.onnx.export (opset 18, GPU-traced)
βββ model.onnx + model.onnx.data (1.2 GB)
- Quantization: Dynamic INT8 per-tensor symmetric quantization applied to all
nn.Linearlayers - ONNX Export: Traced on NVIDIA L4 GPU using PyTorch 2.10 dynamo-based ONNX exporter with opset 18
- Graph Optimization: 100 pattern rewrite rules applied, unused nodes removed
- Hardware: Compressed on NVIDIA L4 24GB, CUDA 13.0, Python 3.14
Attribution
- Original Model:
openai/clip-vit-large-patch14by OpenAI - License: MIT β free for commercial and research use
- Paper: Learning Transferable Visual Models From Natural Language Supervision β Radford et al., 2021
- Compressed by: RobotFlowLabs using FORGE
Citation
@article{radford2021learning,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
journal={International Conference on Machine Learning},
year={2021}
}
@misc{robotflowlabs2026anima,
title={ANIMA: Agentic Networked Intelligence for Modular Autonomy},
author={RobotFlowLabs},
year={2026},
url={https://huggingface.co/robotflowlabs}
}
Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.
- Downloads last month
- 11
Model tree for robotflowlabs/clip-vit-large-patch14-int8
Base model
openai/clip-vit-large-patch14Collection including robotflowlabs/clip-vit-large-patch14-int8
Paper for robotflowlabs/clip-vit-large-patch14-int8
Evaluation results
- Model Size (MB)self-reported1451.000
- Compression Ratioself-reported4.500
- Original Size (MB)self-reported6529.000