DINOv2 ViT-Large/14 β€” INT8 Quantized

Meta's DINOv2 self-supervised vision encoder quantized to INT8 for real-time robotic feature extraction. 1.6x smaller β€” from 2.3 GB to 1.5 GB β€” with rich visual representations preserved for downstream robotic tasks.

This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform β€” a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.

Why This Model Exists

DINOv2 produces the best general-purpose visual features available β€” dense, semantic representations that transfer to any downstream task without fine-tuning. In robotics, these features power grasp prediction, place recognition, object matching, and scene similarity. But at 2.3 GB, running DINOv2-Large alongside segmentation, depth, and action models is expensive on edge GPUs.

We quantized DINOv2 to INT8 and exported to ONNX so robots get rich visual features without VRAM bottlenecks.

Model Details

Property Value
Architecture Vision Transformer (ViT-L/14)
Parameters 304M
Hidden Dimension 1024
Layers 24 transformer blocks
Attention Heads 16
MLP Dimension 4096 (4x ratio)
Input Resolution 518 Γ— 518
Patch Size 14 Γ— 14
Tokens 1,370 (37Γ—37 patches + 1 CLS)
Training Self-supervised (no labels) on LVD-142M
Original Model facebook/dinov2-large
License Apache-2.0

Compression Results

Quantized on an NVIDIA L4 24GB GPU using INT8 dynamic quantization with ONNX Runtime export.

Metric Original INT8 Quantized Change
Total Size 2,322 MB 1,461 MB 1.6x smaller
INT8 Weights β€” 298 MB Quantized linear layers
ONNX Graph β€” 1,163 MB Full model with optimizations
Quantization FP32 INT8 Dynamic Per-tensor symmetric
Format PyTorch PyTorch INT8 + ONNX Dual format

Included Files

dinov2-large-int8/
β”œβ”€β”€ model_int8.pt              # 298 MB β€” INT8 quantized state dict
β”œβ”€β”€ model.onnx                 # 2.6 MB β€” ONNX graph structure
β”œβ”€β”€ model.onnx.data            # 1.2 GB β€” ONNX external weights
β”œβ”€β”€ config.json                # Model configuration
β”œβ”€β”€ preprocessor_config.json   # Image preprocessing config
└── README.md                  # This file

Quick Start

PyTorch (INT8 Weights)

import torch
from transformers import Dinov2Model, AutoImageProcessor

# Load original architecture
model = Dinov2Model.from_pretrained("facebook/dinov2-large")

# Load INT8 quantized weights
int8_state = torch.load("model_int8.pt", map_location="cuda", weights_only=True)
model.load_state_dict(int8_state, strict=False)
model.to("cuda").eval()

# Extract features
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-large")
inputs = processor(images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model(**inputs)
features = outputs.last_hidden_state  # (1, 1370, 1024)
cls_token = outputs.last_hidden_state[:, 0]  # (1, 1024) β€” global feature

ONNX Runtime (Recommended for Deployment)

import onnxruntime as ort
import numpy as np

# GPU inference
session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Preprocess image to (1, 3, 518, 518) float32
pixel_values = preprocess(image)
outputs = session.run(None, {"pixel_values": pixel_values})

With FORGE (ANIMA Integration)

from forge.vision import VisionEncoderRegistry

# FORGE auto-detects INT8 weights and loads optimally
encoder = VisionEncoderRegistry.load("dinov2-large-int8")
features = encoder(image_tensor)  # (B, 1370, 1024)

Use Cases in ANIMA

DINOv2 serves as the visual representation backbone across ANIMA modules:

  • Grasp Prediction β€” Dense patch features for identifying graspable surfaces and grip points
  • Place Recognition β€” CLS token matching for visual localization in mapped environments
  • Object Matching β€” Patch-level similarity for re-identifying objects across viewpoints
  • Scene Similarity β€” Detecting when the robot encounters familiar vs novel environments
  • Feature Conditioning β€” Rich visual tokens fed to VLA models for action prediction
  • Affordance Detection β€” Identifying functional properties of surfaces and objects

About ANIMA

ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules β€” from perception and planning to manipulation and safety β€” into a unified system that enables robots to understand, reason, and act in unstructured real-world environments.

Other Collections

Intended Use

Designed For

  • Visual feature extraction for robotic manipulation and navigation
  • Dense patch features for grasp prediction and affordance detection
  • Scene-level representations for place recognition and mapping
  • Feature backbone for downstream VLA models

Limitations

  • INT8 quantization may slightly reduce feature precision for very fine-grained tasks
  • Fixed input resolution (518Γ—518) β€” images are resized/center-cropped
  • Self-supervised features may not capture task-specific semantics without fine-tuning
  • Inherits biases from LVD-142M training data

Out of Scope

  • Medical diagnosis without domain-specific validation
  • Facial recognition or biometric identification
  • Surveillance applications

Technical Details

Compression Pipeline

Original DINOv2 ViT-L/14 (FP32, 2.3 GB)
    β”‚
    β”œβ”€β†’ torchao INT8 dynamic quantization (GPU-native)
    β”‚   └─→ model_int8.pt (298 MB)
    β”‚
    └─→ torch.onnx.export (opset 18, GPU-traced)
        └─→ model.onnx + model.onnx.data (1.2 GB)
  • Quantization: INT8 dynamic activation + INT8 weight via torchao on NVIDIA L4 GPU
  • ONNX Export: Traced on GPU using PyTorch 2.10 dynamo-based exporter, opset 18
  • Hardware: NVIDIA L4 24GB, CUDA 13.0, Python 3.14

Attribution

Citation

@article{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and others},
  journal={arXiv preprint arXiv:2304.07193},
  year={2023}
}
@misc{robotflowlabs2026anima,
  title={ANIMA: Agentic Networked Intelligence for Modular Autonomy},
  author={RobotFlowLabs},
  year={2026},
  url={https://huggingface.co/robotflowlabs}
}

Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for robotflowlabs/dinov2-large-int8

Quantized
(6)
this model

Collection including robotflowlabs/dinov2-large-int8

Paper for robotflowlabs/dinov2-large-int8

Evaluation results