DINOv2 ViT-Large/14 — INT8 Quantized

Meta's DINOv2 self-supervised vision encoder quantized to INT8 for real-time robotic feature extraction. 1.6x smaller — from 2.3 GB to 1.5 GB — with rich visual representations preserved for downstream robotic tasks.

This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform — a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.

Why This Model Exists

DINOv2 produces the best general-purpose visual features available — dense, semantic representations that transfer to any downstream task without fine-tuning. In robotics, these features power grasp prediction, place recognition, object matching, and scene similarity. But at 2.3 GB, running DINOv2-Large alongside segmentation, depth, and action models is expensive on edge GPUs.

We quantized DINOv2 to INT8 and exported to ONNX so robots get rich visual features without VRAM bottlenecks.

Model Details

Property	Value
Architecture	Vision Transformer (ViT-L/14)
Parameters	304M
Hidden Dimension	1024
Layers	24 transformer blocks
Attention Heads	16
MLP Dimension	4096 (4x ratio)
Input Resolution	518 × 518
Patch Size	14 × 14
Tokens	1,370 (37×37 patches + 1 CLS)
Training	Self-supervised (no labels) on LVD-142M
Original Model	`facebook/dinov2-large`
License	Apache-2.0

Compression Results

Quantized on an NVIDIA L4 24GB GPU using INT8 dynamic quantization with ONNX Runtime export.

Metric	Original	INT8 Quantized	Change
Total Size	2,322 MB	1,461 MB	1.6x smaller
INT8 Weights	—	298 MB	Quantized linear layers
ONNX Graph	—	1,163 MB	Full model with optimizations
Quantization	FP32	INT8 Dynamic	Per-tensor symmetric
Format	PyTorch	PyTorch INT8 + ONNX	Dual format

Included Files

dinov2-large-int8/
├── model_int8.pt              # 298 MB — INT8 quantized state dict
├── model.onnx                 # 2.6 MB — ONNX graph structure
├── model.onnx.data            # 1.2 GB — ONNX external weights
├── config.json                # Model configuration
├── preprocessor_config.json   # Image preprocessing config
└── README.md                  # This file

Quick Start

PyTorch (INT8 Weights)

import torch
from transformers import Dinov2Model, AutoImageProcessor

# Load original architecture
model = Dinov2Model.from_pretrained("facebook/dinov2-large")

# Load INT8 quantized weights
int8_state = torch.load("model_int8.pt", map_location="cuda", weights_only=True)
model.load_state_dict(int8_state, strict=False)
model.to("cuda").eval()

# Extract features
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-large")
inputs = processor(images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model(**inputs)
features = outputs.last_hidden_state  # (1, 1370, 1024)
cls_token = outputs.last_hidden_state[:, 0]  # (1, 1024) — global feature

ONNX Runtime (Recommended for Deployment)

import onnxruntime as ort
import numpy as np

# GPU inference
session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Preprocess image to (1, 3, 518, 518) float32
pixel_values = preprocess(image)
outputs = session.run(None, {"pixel_values": pixel_values})

With FORGE (ANIMA Integration)

from forge.vision import VisionEncoderRegistry

# FORGE auto-detects INT8 weights and loads optimally
encoder = VisionEncoderRegistry.load("dinov2-large-int8")
features = encoder(image_tensor)  # (B, 1370, 1024)

Use Cases in ANIMA

DINOv2 serves as the visual representation backbone across ANIMA modules:

Grasp Prediction — Dense patch features for identifying graspable surfaces and grip points
Place Recognition — CLS token matching for visual localization in mapped environments
Object Matching — Patch-level similarity for re-identifying objects across viewpoints
Scene Similarity — Detecting when the robot encounters familiar vs novel environments
Feature Conditioning — Rich visual tokens fed to VLA models for action prediction
Affordance Detection — Identifying functional properties of surfaces and objects

About ANIMA

ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules — from perception and planning to manipulation and safety — into a unified system that enables robots to understand, reason, and act in unstructured real-world environments.

Other Collections

ANIMA Vision — SAM2, DINOv2, CLIP, SigLIP, Depth Anything
ANIMA Language — Qwen2.5, SmolLM2
ANIMA VLM — Qwen2.5-VL
ANIMA VLA — SmolVLA, RDT2-FM, FORGE students

Intended Use

Designed For

Visual feature extraction for robotic manipulation and navigation
Dense patch features for grasp prediction and affordance detection
Scene-level representations for place recognition and mapping
Feature backbone for downstream VLA models

Limitations

INT8 quantization may slightly reduce feature precision for very fine-grained tasks
Fixed input resolution (518×518) — images are resized/center-cropped
Self-supervised features may not capture task-specific semantics without fine-tuning
Inherits biases from LVD-142M training data

Out of Scope

Medical diagnosis without domain-specific validation
Facial recognition or biometric identification
Surveillance applications

Technical Details

Compression Pipeline

Original DINOv2 ViT-L/14 (FP32, 2.3 GB)
    │
    ├─→ torchao INT8 dynamic quantization (GPU-native)
    │   └─→ model_int8.pt (298 MB)
    │
    └─→ torch.onnx.export (opset 18, GPU-traced)
        └─→ model.onnx + model.onnx.data (1.2 GB)

Quantization: INT8 dynamic activation + INT8 weight via torchao on NVIDIA L4 GPU
ONNX Export: Traced on GPU using PyTorch 2.10 dynamo-based exporter, opset 18
Hardware: NVIDIA L4 24GB, CUDA 13.0, Python 3.14

Attribution

Original Model: facebook/dinov2-large by Meta AI (FAIR)
License: Apache-2.0 — free for commercial and research use
Paper: DINOv2: Learning Robust Visual Features without Supervision — Oquab et al., 2023
Dataset: LVD-142M — 142M curated images
Compressed by: RobotFlowLabs using FORGE

Citation

@article{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and others},
  journal={arXiv preprint arXiv:2304.07193},
  year={2023}
}

@misc{robotflowlabs2026anima,
  title={ANIMA: Agentic Networked Intelligence for Modular Autonomy},
  author={RobotFlowLabs},
  year={2026},
  url={https://huggingface.co/robotflowlabs}
}

Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.