SigLIP SO400M-patch14-384 β INT8 Quantized
Google's SigLIP vision-language encoder quantized to INT8 for real-time robotic perception. 1.6x smaller β from 3.4 GB to 2.1 GB β with both vision and text encoding capabilities preserved.
This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform β a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.
Why This Model Exists
SigLIP is the frozen vision encoder used in all FORGE student models and in leading VLA architectures like OpenVLA and SmolVLA. Its sigmoid-based contrastive learning produces better calibrated vision-language alignments than CLIP, especially for robotic manipulation where precise visual grounding of instructions matters. At 3.4 GB, running the full SigLIP model alongside student backbones and action heads is tight on edge hardware.
We quantized SigLIP to INT8 and exported to ONNX for faster inference with smaller memory footprint.
Model Details
| Property | Value |
|---|---|
| Architecture | SigLIP (Sigmoid Loss for Language-Image Pre-training) |
| Parameters | 400M (vision + text) |
| Vision Hidden Dim | 1152 |
| Vision Layers | 27 transformer blocks |
| Vision Attention Heads | 16 |
| Vision MLP Dim | 4304 |
| Input Resolution | 384 Γ 384 |
| Patch Size | 14 Γ 14 |
| Vision Tokens | 729 (27Γ27 patches) |
| Text Hidden Dim | 1152 |
| Text Layers | 27 transformer blocks |
| Original Model | google/siglip-so400m-patch14-384 |
| License | Apache-2.0 |
Compression Results
Quantized on an NVIDIA L4 24GB GPU using INT8 dynamic quantization with ONNX Runtime export.
| Metric | Original | INT8 Quantized | Change |
|---|---|---|---|
| Total Size | 3,352 MB | 2,067 MB | 1.6x smaller |
| INT8 Weights | β | 430 MB | Vision encoder quantized |
| ONNX Graph | β | 1,637 MB | Full model with optimizations |
| Quantization | FP32 | INT8 Dynamic | Per-tensor symmetric |
| Format | PyTorch | PyTorch INT8 + ONNX | Dual format |
Included Files
siglip-so400m-patch14-384-int8/
βββ model_int8.pt # 430 MB β INT8 quantized state dict
βββ model.onnx # 2.7 MB β ONNX graph structure
βββ model.onnx.data # 1.6 GB β ONNX external weights
βββ config.json # Model configuration
βββ preprocessor_config.json # Image preprocessing config
βββ tokenizer_config.json # Text tokenizer config
βββ README.md # This file
Quick Start
PyTorch (INT8 Weights)
import torch
from transformers import SiglipModel, AutoProcessor
# Load original architecture
model = SiglipModel.from_pretrained("google/siglip-so400m-patch14-384")
# Load INT8 quantized vision encoder weights
int8_state = torch.load("model_int8.pt", map_location="cuda", weights_only=True)
model.vision_model.load_state_dict(int8_state, strict=False)
model.to("cuda").eval()
# Extract features
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")
inputs = processor(images=image, text=["a robot arm", "a cup"], return_tensors="pt").to("cuda")
outputs = model(**inputs)
ONNX Runtime (Recommended for Deployment)
import onnxruntime as ort
# GPU inference
session = ort.InferenceSession(
"model.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
# Preprocess image to (1, 3, 384, 384) float32
pixel_values = preprocess(image)
outputs = session.run(None, {"pixel_values": pixel_values})
With FORGE (ANIMA Integration)
from forge.vision import VisionEncoderRegistry
# SigLIP is the default FORGE vision encoder
encoder = VisionEncoderRegistry.load("siglip-so400m-patch14-384-int8")
features = encoder(image_tensor) # (B, 729, 1152)
Role in FORGE Pipeline
SigLIP is the frozen vision encoder in every FORGE student model:
Image β [SigLIP Vision Encoder] β 729 tokens (1152-dim)
β
[Bridge Attention] β N queries (d_model-dim)
β
[Language Backbone] β contextualized features
β
[Action Head] β robot actions
The vision encoder is never fine-tuned β it stays frozen across all distillation and fine-tuning stages. This preserves the rich visual representations that make SigLIP valuable while keeping training efficient.
Use Cases in ANIMA
- Visual Grounding β Match natural language instructions to visual observations via sigmoid similarity
- Vision-Language Alignment β Joint embedding space for images and text
- FORGE Student Input β Frozen vision backbone for all FORGE student variants
- Zero-Shot Classification β Identify objects without task-specific training
- Multi-Modal Retrieval β Search robot memory using text or image queries
About ANIMA
ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.
Other Collections
- ANIMA Vision β SAM2, DINOv2, CLIP, SigLIP, Depth Anything
- ANIMA Language β Qwen2.5, SmolLM2
- ANIMA VLM β Qwen2.5-VL
- ANIMA VLA β SmolVLA, RDT2-FM, FORGE students
Intended Use
Designed For
- Frozen vision encoder in VLA distillation pipelines
- Visual grounding of natural language robot instructions
- Zero-shot visual classification in robotic workspaces
- Feature extraction for downstream action prediction
Limitations
- INT8 quantization may slightly reduce vision-language alignment on edge cases
- Fixed 384Γ384 input resolution
- Sigmoid loss calibration differs from CLIP's softmax β not a drop-in replacement
- Inherits biases from WebLI training data
Out of Scope
- Medical diagnosis without domain validation
- Facial recognition or biometric identification
- Surveillance applications
Technical Details
Compression Pipeline
Original SigLIP SO400M (FP32, 3.4 GB)
β
βββ torchao INT8 dynamic quantization (GPU-native, vision encoder)
β βββ model_int8.pt (430 MB)
β
βββ torch.onnx.export (opset 18, GPU-traced)
βββ model.onnx + model.onnx.data (1.6 GB)
- Quantization: INT8 dynamic activation + INT8 weight via
torchaoon NVIDIA L4 GPU - ONNX Export: Traced on GPU using PyTorch 2.10, opset 18
- Hardware: NVIDIA L4 24GB, CUDA 13.0, Python 3.14
Attribution
- Original Model:
google/siglip-so400m-patch14-384by Google Research - License: Apache-2.0 β free for commercial and research use
- Paper: Sigmoid Loss for Language Image Pre-Training β Zhai et al., 2023
- Compressed by: RobotFlowLabs using FORGE
Citation
@inproceedings{zhai2023sigmoid,
title={Sigmoid Loss for Language Image Pre-Training},
author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
booktitle={IEEE/CVF International Conference on Computer Vision},
year={2023}
}
@misc{robotflowlabs2026anima,
title={ANIMA: Agentic Networked Intelligence for Modular Autonomy},
author={RobotFlowLabs},
year={2026},
url={https://huggingface.co/robotflowlabs}
}
Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.
- Downloads last month
- 15
Model tree for robotflowlabs/siglip-so400m-patch14-384-int8
Base model
google/siglip-so400m-patch14-384Collection including robotflowlabs/siglip-so400m-patch14-384-int8
Paper for robotflowlabs/siglip-so400m-patch14-384-int8
Evaluation results
- Model Size (MB)self-reported2067.000
- Compression Ratioself-reported1.600
- Original Size (MB)self-reported3352.000