CLIP ViT-Large/14 — INT8 Quantized

OpenAI's CLIP vision encoder quantized to INT8 for real-time robotic perception. 4.5x smaller than the original — from 6.5 GB to 1.5 GB — while preserving zero-shot classification and visual grounding capabilities.

This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform — a modular ROS2-native AI system designed to bring foundation model intelligence to real robots operating in the real world.

Why This Model Exists

Large vision-language models like CLIP are essential for robotic scene understanding — identifying objects, understanding spatial relationships, and grounding natural language instructions to visual observations. But at 6.5 GB, the original CLIP ViT-L/14 is too heavy for edge deployment on devices like NVIDIA Jetson, Raspberry Pi, or embedded industrial controllers.

We quantized CLIP to INT8 and exported to ONNX so robots can run it in real-time, on-device, without cloud dependencies.

Model Details

Property	Value
Architecture	Vision Transformer (ViT-L/14)
Parameters	304M (vision encoder)
Hidden Dimension	1024
Layers	24 transformer blocks
Attention Heads	16
MLP Dimension	4096
Input Resolution	224 × 224
Patch Size	14 × 14
Tokens	257 (256 patches + 1 CLS)
Original Model	`openai/clip-vit-large-patch14`
License	MIT

Compression Results

Quantized on an NVIDIA L4 24GB GPU using INT8 dynamic quantization with ONNX Runtime export.

Metric	Original	INT8 Quantized	Change
Total Size	6,529 MB	1,451 MB	4.5x smaller
INT8 Weights	—	293 MB	Vision encoder only
ONNX Graph	—	1,158 MB	Full model with optimizations
Quantization	FP32	INT8 Dynamic	Per-tensor symmetric
Format	PyTorch	PyTorch INT8 + ONNX	Dual format

Included Files

clip-vit-large-patch14-int8/
├── model_int8.pt              # 293 MB — INT8 quantized state dict
├── model.onnx                 # 2.0 MB — ONNX graph structure
├── model.onnx.data            # 1.2 GB — ONNX external weights
├── config.json                # Model configuration
├── preprocessor_config.json   # Image preprocessing config
├── tokenizer_config.json      # Text tokenizer config
└── README.md                  # This file

Quick Start

PyTorch (INT8 Weights)

import torch
from transformers import CLIPModel, CLIPProcessor

# Load original architecture
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")

# Load INT8 quantized vision encoder weights
int8_state = torch.load("model_int8.pt", map_location="cuda", weights_only=True)
model.vision_model.load_state_dict(int8_state, strict=False)

# Run inference
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
inputs = processor(images=image, text=["a robot arm", "a table", "a cup"], return_tensors="pt")
outputs = model(**inputs)

ONNX Runtime (Recommended for Deployment)

import onnxruntime as ort
import numpy as np

# GPU inference
session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Preprocess image to (1, 3, 224, 224) float32
pixel_values = preprocess(image)  # Your preprocessing pipeline
outputs = session.run(None, {"pixel_values": pixel_values})

With FORGE (ANIMA Integration)

from forge.vision import VisionEncoderRegistry

# FORGE auto-detects INT8 weights and loads optimally
encoder = VisionEncoderRegistry.load("clip-vit-large-patch14-int8")
features = encoder(image_tensor)  # (B, 257, 1024)

Use Cases in ANIMA

CLIP serves as the visual grounding backbone across multiple ANIMA modules:

Object Recognition — Zero-shot identification of objects in the robot's workspace without task-specific training
Instruction Grounding — Matching natural language commands ("pick up the red cup") to visual observations
Scene Understanding — Encoding visual context for downstream VLA (Vision-Language-Action) models
Anomaly Detection — Comparing visual embeddings to detect unexpected objects or states
Multi-Modal Retrieval — Searching robot memory for visually similar past experiences

About ANIMA

ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules — from perception and planning to manipulation and safety — into a unified system that enables robots to understand, reason, and act in unstructured real-world environments.

ANIMA modules run on edge hardware (Jetson Orin, industrial PCs) with real-time constraints. Every foundation model we deploy must be compressed without sacrificing the capabilities that make it useful. That's why we built FORGE — our distillation and compression pipeline — and why we're releasing optimized model variants publicly.

We believe the robotics community deserves production-ready models, not just research checkpoints.

Other Models in This Collection

Browse all optimized models at huggingface.co/robotflowlabs:

Vision: SAM2, DINOv2, SigLIP, Depth Anything — for segmentation, features, and depth
Language: Qwen2.5, SmolLM2 — for reasoning and instruction following
VLM: Qwen2.5-VL — for visual question answering and scene description
VLA: SmolVLA, RDT2-FM — for end-to-end robotic action generation
Embeddings: BGE — for semantic search and retrieval

Intended Use

Designed For

Real-time robotic perception pipelines on edge GPUs
Zero-shot visual classification in manipulation and navigation tasks
Visual grounding of natural language instructions
Feature extraction for downstream VLA models

Limitations

INT8 quantization may slightly reduce accuracy on fine-grained classification tasks
Vision encoder only — text encoder not quantized in this release
Requires GPU for ONNX Runtime inference (CPU fallback available but slower)
Inherits any biases present in the original CLIP training data (WebImageText)

Out of Scope

Medical or safety-critical diagnosis without additional validation
Facial recognition or biometric identification
Surveillance applications

Technical Details

Compression Pipeline

Original CLIP ViT-L/14 (FP32, 6.5 GB)
    │
    ├─→ torch.quantization.quantize_dynamic (INT8)
    │   └─→ model_int8.pt (293 MB)
    │
    └─→ torch.onnx.export (opset 18, GPU-traced)
        └─→ model.onnx + model.onnx.data (1.2 GB)

Quantization: Dynamic INT8 per-tensor symmetric quantization applied to all nn.Linear layers
ONNX Export: Traced on NVIDIA L4 GPU using PyTorch 2.10 dynamo-based ONNX exporter with opset 18
Graph Optimization: 100 pattern rewrite rules applied, unused nodes removed
Hardware: Compressed on NVIDIA L4 24GB, CUDA 13.0, Python 3.14

Attribution

Original Model: openai/clip-vit-large-patch14 by OpenAI
License: MIT — free for commercial and research use
Paper: Learning Transferable Visual Models From Natural Language Supervision — Radford et al., 2021
Compressed by: RobotFlowLabs using FORGE

Citation

@article{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
  journal={International Conference on Machine Learning},
  year={2021}
}

@misc{robotflowlabs2026anima,
  title={ANIMA: Agentic Networked Intelligence for Modular Autonomy},
  author={RobotFlowLabs},
  year={2026},
  url={https://huggingface.co/robotflowlabs}
}

Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.