SigLIP SO400M-patch14-384 β€” INT8 Quantized

Google's SigLIP vision-language encoder quantized to INT8 for real-time robotic perception. 1.6x smaller β€” from 3.4 GB to 2.1 GB β€” with both vision and text encoding capabilities preserved.

This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform β€” a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.

Why This Model Exists

SigLIP is the frozen vision encoder used in all FORGE student models and in leading VLA architectures like OpenVLA and SmolVLA. Its sigmoid-based contrastive learning produces better calibrated vision-language alignments than CLIP, especially for robotic manipulation where precise visual grounding of instructions matters. At 3.4 GB, running the full SigLIP model alongside student backbones and action heads is tight on edge hardware.

We quantized SigLIP to INT8 and exported to ONNX for faster inference with smaller memory footprint.

Model Details

Property Value
Architecture SigLIP (Sigmoid Loss for Language-Image Pre-training)
Parameters 400M (vision + text)
Vision Hidden Dim 1152
Vision Layers 27 transformer blocks
Vision Attention Heads 16
Vision MLP Dim 4304
Input Resolution 384 Γ— 384
Patch Size 14 Γ— 14
Vision Tokens 729 (27Γ—27 patches)
Text Hidden Dim 1152
Text Layers 27 transformer blocks
Original Model google/siglip-so400m-patch14-384
License Apache-2.0

Compression Results

Quantized on an NVIDIA L4 24GB GPU using INT8 dynamic quantization with ONNX Runtime export.

Metric Original INT8 Quantized Change
Total Size 3,352 MB 2,067 MB 1.6x smaller
INT8 Weights β€” 430 MB Vision encoder quantized
ONNX Graph β€” 1,637 MB Full model with optimizations
Quantization FP32 INT8 Dynamic Per-tensor symmetric
Format PyTorch PyTorch INT8 + ONNX Dual format

Included Files

siglip-so400m-patch14-384-int8/
β”œβ”€β”€ model_int8.pt              # 430 MB β€” INT8 quantized state dict
β”œβ”€β”€ model.onnx                 # 2.7 MB β€” ONNX graph structure
β”œβ”€β”€ model.onnx.data            # 1.6 GB β€” ONNX external weights
β”œβ”€β”€ config.json                # Model configuration
β”œβ”€β”€ preprocessor_config.json   # Image preprocessing config
β”œβ”€β”€ tokenizer_config.json      # Text tokenizer config
└── README.md                  # This file

Quick Start

PyTorch (INT8 Weights)

import torch
from transformers import SiglipModel, AutoProcessor

# Load original architecture
model = SiglipModel.from_pretrained("google/siglip-so400m-patch14-384")

# Load INT8 quantized vision encoder weights
int8_state = torch.load("model_int8.pt", map_location="cuda", weights_only=True)
model.vision_model.load_state_dict(int8_state, strict=False)
model.to("cuda").eval()

# Extract features
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")
inputs = processor(images=image, text=["a robot arm", "a cup"], return_tensors="pt").to("cuda")
outputs = model(**inputs)

ONNX Runtime (Recommended for Deployment)

import onnxruntime as ort

# GPU inference
session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Preprocess image to (1, 3, 384, 384) float32
pixel_values = preprocess(image)
outputs = session.run(None, {"pixel_values": pixel_values})

With FORGE (ANIMA Integration)

from forge.vision import VisionEncoderRegistry

# SigLIP is the default FORGE vision encoder
encoder = VisionEncoderRegistry.load("siglip-so400m-patch14-384-int8")
features = encoder(image_tensor)  # (B, 729, 1152)

Role in FORGE Pipeline

SigLIP is the frozen vision encoder in every FORGE student model:

Image β†’ [SigLIP Vision Encoder] β†’ 729 tokens (1152-dim)
            ↓
      [Bridge Attention] β†’ N queries (d_model-dim)
            ↓
      [Language Backbone] β†’ contextualized features
            ↓
      [Action Head] β†’ robot actions

The vision encoder is never fine-tuned β€” it stays frozen across all distillation and fine-tuning stages. This preserves the rich visual representations that make SigLIP valuable while keeping training efficient.

Use Cases in ANIMA

  • Visual Grounding β€” Match natural language instructions to visual observations via sigmoid similarity
  • Vision-Language Alignment β€” Joint embedding space for images and text
  • FORGE Student Input β€” Frozen vision backbone for all FORGE student variants
  • Zero-Shot Classification β€” Identify objects without task-specific training
  • Multi-Modal Retrieval β€” Search robot memory using text or image queries

About ANIMA

ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.

Other Collections

Intended Use

Designed For

  • Frozen vision encoder in VLA distillation pipelines
  • Visual grounding of natural language robot instructions
  • Zero-shot visual classification in robotic workspaces
  • Feature extraction for downstream action prediction

Limitations

  • INT8 quantization may slightly reduce vision-language alignment on edge cases
  • Fixed 384Γ—384 input resolution
  • Sigmoid loss calibration differs from CLIP's softmax β€” not a drop-in replacement
  • Inherits biases from WebLI training data

Out of Scope

  • Medical diagnosis without domain validation
  • Facial recognition or biometric identification
  • Surveillance applications

Technical Details

Compression Pipeline

Original SigLIP SO400M (FP32, 3.4 GB)
    β”‚
    β”œβ”€β†’ torchao INT8 dynamic quantization (GPU-native, vision encoder)
    β”‚   └─→ model_int8.pt (430 MB)
    β”‚
    └─→ torch.onnx.export (opset 18, GPU-traced)
        └─→ model.onnx + model.onnx.data (1.6 GB)
  • Quantization: INT8 dynamic activation + INT8 weight via torchao on NVIDIA L4 GPU
  • ONNX Export: Traced on GPU using PyTorch 2.10, opset 18
  • Hardware: NVIDIA L4 24GB, CUDA 13.0, Python 3.14

Attribution

Citation

@inproceedings{zhai2023sigmoid,
  title={Sigmoid Loss for Language Image Pre-Training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  booktitle={IEEE/CVF International Conference on Computer Vision},
  year={2023}
}
@misc{robotflowlabs2026anima,
  title={ANIMA: Agentic Networked Intelligence for Modular Autonomy},
  author={RobotFlowLabs},
  year={2026},
  url={https://huggingface.co/robotflowlabs}
}

Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for robotflowlabs/siglip-so400m-patch14-384-int8

Quantized
(1)
this model

Collection including robotflowlabs/siglip-so400m-patch14-384-int8

Paper for robotflowlabs/siglip-so400m-patch14-384-int8

Evaluation results