SigLIP SO400M-patch14-384 — INT8 Quantized

Google's SigLIP vision-language encoder quantized to INT8 for real-time robotic perception. 1.6x smaller — from 3.4 GB to 2.1 GB — with both vision and text encoding capabilities preserved.

This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform — a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.

Why This Model Exists

SigLIP is the frozen vision encoder used in all FORGE student models and in leading VLA architectures like OpenVLA and SmolVLA. Its sigmoid-based contrastive learning produces better calibrated vision-language alignments than CLIP, especially for robotic manipulation where precise visual grounding of instructions matters. At 3.4 GB, running the full SigLIP model alongside student backbones and action heads is tight on edge hardware.

We quantized SigLIP to INT8 and exported to ONNX for faster inference with smaller memory footprint.

Model Details

Property	Value
Architecture	SigLIP (Sigmoid Loss for Language-Image Pre-training)
Parameters	400M (vision + text)
Vision Hidden Dim	1152
Vision Layers	27 transformer blocks
Vision Attention Heads	16
Vision MLP Dim	4304
Input Resolution	384 × 384
Patch Size	14 × 14
Vision Tokens	729 (27×27 patches)
Text Hidden Dim	1152
Text Layers	27 transformer blocks
Original Model	`google/siglip-so400m-patch14-384`
License	Apache-2.0

Compression Results

Quantized on an NVIDIA L4 24GB GPU using INT8 dynamic quantization with ONNX Runtime export.

Metric	Original	INT8 Quantized	Change
Total Size	3,352 MB	2,067 MB	1.6x smaller
INT8 Weights	—	430 MB	Vision encoder quantized
ONNX Graph	—	1,637 MB	Full model with optimizations
Quantization	FP32	INT8 Dynamic	Per-tensor symmetric
Format	PyTorch	PyTorch INT8 + ONNX	Dual format

Included Files

siglip-so400m-patch14-384-int8/
├── model_int8.pt              # 430 MB — INT8 quantized state dict
├── model.onnx                 # 2.7 MB — ONNX graph structure
├── model.onnx.data            # 1.6 GB — ONNX external weights
├── config.json                # Model configuration
├── preprocessor_config.json   # Image preprocessing config
├── tokenizer_config.json      # Text tokenizer config
└── README.md                  # This file

Quick Start

PyTorch (INT8 Weights)

import torch
from transformers import SiglipModel, AutoProcessor

# Load original architecture
model = SiglipModel.from_pretrained("google/siglip-so400m-patch14-384")

# Load INT8 quantized vision encoder weights
int8_state = torch.load("model_int8.pt", map_location="cuda", weights_only=True)
model.vision_model.load_state_dict(int8_state, strict=False)
model.to("cuda").eval()

# Extract features
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")
inputs = processor(images=image, text=["a robot arm", "a cup"], return_tensors="pt").to("cuda")
outputs = model(**inputs)

ONNX Runtime (Recommended for Deployment)

import onnxruntime as ort

# GPU inference
session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Preprocess image to (1, 3, 384, 384) float32
pixel_values = preprocess(image)
outputs = session.run(None, {"pixel_values": pixel_values})

With FORGE (ANIMA Integration)

from forge.vision import VisionEncoderRegistry

# SigLIP is the default FORGE vision encoder
encoder = VisionEncoderRegistry.load("siglip-so400m-patch14-384-int8")
features = encoder(image_tensor)  # (B, 729, 1152)

Role in FORGE Pipeline

SigLIP is the frozen vision encoder in every FORGE student model:

Image → [SigLIP Vision Encoder] → 729 tokens (1152-dim)
            ↓
      [Bridge Attention] → N queries (d_model-dim)
            ↓
      [Language Backbone] → contextualized features
            ↓
      [Action Head] → robot actions

The vision encoder is never fine-tuned — it stays frozen across all distillation and fine-tuning stages. This preserves the rich visual representations that make SigLIP valuable while keeping training efficient.

Use Cases in ANIMA

Visual Grounding — Match natural language instructions to visual observations via sigmoid similarity
Vision-Language Alignment — Joint embedding space for images and text
FORGE Student Input — Frozen vision backbone for all FORGE student variants
Zero-Shot Classification — Identify objects without task-specific training
Multi-Modal Retrieval — Search robot memory using text or image queries

About ANIMA

ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.

Other Collections

ANIMA Vision — SAM2, DINOv2, CLIP, SigLIP, Depth Anything
ANIMA Language — Qwen2.5, SmolLM2
ANIMA VLM — Qwen2.5-VL
ANIMA VLA — SmolVLA, RDT2-FM, FORGE students

Intended Use

Designed For

Frozen vision encoder in VLA distillation pipelines
Visual grounding of natural language robot instructions
Zero-shot visual classification in robotic workspaces
Feature extraction for downstream action prediction

Limitations

INT8 quantization may slightly reduce vision-language alignment on edge cases
Fixed 384×384 input resolution
Sigmoid loss calibration differs from CLIP's softmax — not a drop-in replacement
Inherits biases from WebLI training data

Out of Scope

Medical diagnosis without domain validation
Facial recognition or biometric identification
Surveillance applications

Technical Details

Compression Pipeline

Original SigLIP SO400M (FP32, 3.4 GB)
    │
    ├─→ torchao INT8 dynamic quantization (GPU-native, vision encoder)
    │   └─→ model_int8.pt (430 MB)
    │
    └─→ torch.onnx.export (opset 18, GPU-traced)
        └─→ model.onnx + model.onnx.data (1.6 GB)

Quantization: INT8 dynamic activation + INT8 weight via torchao on NVIDIA L4 GPU
ONNX Export: Traced on GPU using PyTorch 2.10, opset 18
Hardware: NVIDIA L4 24GB, CUDA 13.0, Python 3.14

Attribution

Original Model: google/siglip-so400m-patch14-384 by Google Research
License: Apache-2.0 — free for commercial and research use
Paper: Sigmoid Loss for Language Image Pre-Training — Zhai et al., 2023
Compressed by: RobotFlowLabs using FORGE

Citation

@inproceedings{zhai2023sigmoid,
  title={Sigmoid Loss for Language Image Pre-Training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  booktitle={IEEE/CVF International Conference on Computer Vision},
  year={2023}
}

@misc{robotflowlabs2026anima,
  title={ANIMA: Agentic Networked Intelligence for Modular Autonomy},
  author={RobotFlowLabs},
  year={2026},
  url={https://huggingface.co/robotflowlabs}
}

Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.