DINOv2 ViT-Large/14 β INT8 Quantized
Meta's DINOv2 self-supervised vision encoder quantized to INT8 for real-time robotic feature extraction. 1.6x smaller β from 2.3 GB to 1.5 GB β with rich visual representations preserved for downstream robotic tasks.
This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform β a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.
Why This Model Exists
DINOv2 produces the best general-purpose visual features available β dense, semantic representations that transfer to any downstream task without fine-tuning. In robotics, these features power grasp prediction, place recognition, object matching, and scene similarity. But at 2.3 GB, running DINOv2-Large alongside segmentation, depth, and action models is expensive on edge GPUs.
We quantized DINOv2 to INT8 and exported to ONNX so robots get rich visual features without VRAM bottlenecks.
Model Details
| Property | Value |
|---|---|
| Architecture | Vision Transformer (ViT-L/14) |
| Parameters | 304M |
| Hidden Dimension | 1024 |
| Layers | 24 transformer blocks |
| Attention Heads | 16 |
| MLP Dimension | 4096 (4x ratio) |
| Input Resolution | 518 Γ 518 |
| Patch Size | 14 Γ 14 |
| Tokens | 1,370 (37Γ37 patches + 1 CLS) |
| Training | Self-supervised (no labels) on LVD-142M |
| Original Model | facebook/dinov2-large |
| License | Apache-2.0 |
Compression Results
Quantized on an NVIDIA L4 24GB GPU using INT8 dynamic quantization with ONNX Runtime export.
| Metric | Original | INT8 Quantized | Change |
|---|---|---|---|
| Total Size | 2,322 MB | 1,461 MB | 1.6x smaller |
| INT8 Weights | β | 298 MB | Quantized linear layers |
| ONNX Graph | β | 1,163 MB | Full model with optimizations |
| Quantization | FP32 | INT8 Dynamic | Per-tensor symmetric |
| Format | PyTorch | PyTorch INT8 + ONNX | Dual format |
Included Files
dinov2-large-int8/
βββ model_int8.pt # 298 MB β INT8 quantized state dict
βββ model.onnx # 2.6 MB β ONNX graph structure
βββ model.onnx.data # 1.2 GB β ONNX external weights
βββ config.json # Model configuration
βββ preprocessor_config.json # Image preprocessing config
βββ README.md # This file
Quick Start
PyTorch (INT8 Weights)
import torch
from transformers import Dinov2Model, AutoImageProcessor
# Load original architecture
model = Dinov2Model.from_pretrained("facebook/dinov2-large")
# Load INT8 quantized weights
int8_state = torch.load("model_int8.pt", map_location="cuda", weights_only=True)
model.load_state_dict(int8_state, strict=False)
model.to("cuda").eval()
# Extract features
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-large")
inputs = processor(images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model(**inputs)
features = outputs.last_hidden_state # (1, 1370, 1024)
cls_token = outputs.last_hidden_state[:, 0] # (1, 1024) β global feature
ONNX Runtime (Recommended for Deployment)
import onnxruntime as ort
import numpy as np
# GPU inference
session = ort.InferenceSession(
"model.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
# Preprocess image to (1, 3, 518, 518) float32
pixel_values = preprocess(image)
outputs = session.run(None, {"pixel_values": pixel_values})
With FORGE (ANIMA Integration)
from forge.vision import VisionEncoderRegistry
# FORGE auto-detects INT8 weights and loads optimally
encoder = VisionEncoderRegistry.load("dinov2-large-int8")
features = encoder(image_tensor) # (B, 1370, 1024)
Use Cases in ANIMA
DINOv2 serves as the visual representation backbone across ANIMA modules:
- Grasp Prediction β Dense patch features for identifying graspable surfaces and grip points
- Place Recognition β CLS token matching for visual localization in mapped environments
- Object Matching β Patch-level similarity for re-identifying objects across viewpoints
- Scene Similarity β Detecting when the robot encounters familiar vs novel environments
- Feature Conditioning β Rich visual tokens fed to VLA models for action prediction
- Affordance Detection β Identifying functional properties of surfaces and objects
About ANIMA
ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules β from perception and planning to manipulation and safety β into a unified system that enables robots to understand, reason, and act in unstructured real-world environments.
Other Collections
- ANIMA Vision β SAM2, DINOv2, CLIP, SigLIP, Depth Anything
- ANIMA Language β Qwen2.5, SmolLM2
- ANIMA VLM β Qwen2.5-VL
- ANIMA VLA β SmolVLA, RDT2-FM, FORGE students
Intended Use
Designed For
- Visual feature extraction for robotic manipulation and navigation
- Dense patch features for grasp prediction and affordance detection
- Scene-level representations for place recognition and mapping
- Feature backbone for downstream VLA models
Limitations
- INT8 quantization may slightly reduce feature precision for very fine-grained tasks
- Fixed input resolution (518Γ518) β images are resized/center-cropped
- Self-supervised features may not capture task-specific semantics without fine-tuning
- Inherits biases from LVD-142M training data
Out of Scope
- Medical diagnosis without domain-specific validation
- Facial recognition or biometric identification
- Surveillance applications
Technical Details
Compression Pipeline
Original DINOv2 ViT-L/14 (FP32, 2.3 GB)
β
βββ torchao INT8 dynamic quantization (GPU-native)
β βββ model_int8.pt (298 MB)
β
βββ torch.onnx.export (opset 18, GPU-traced)
βββ model.onnx + model.onnx.data (1.2 GB)
- Quantization: INT8 dynamic activation + INT8 weight via
torchaoon NVIDIA L4 GPU - ONNX Export: Traced on GPU using PyTorch 2.10 dynamo-based exporter, opset 18
- Hardware: NVIDIA L4 24GB, CUDA 13.0, Python 3.14
Attribution
- Original Model:
facebook/dinov2-largeby Meta AI (FAIR) - License: Apache-2.0 β free for commercial and research use
- Paper: DINOv2: Learning Robust Visual Features without Supervision β Oquab et al., 2023
- Dataset: LVD-142M β 142M curated images
- Compressed by: RobotFlowLabs using FORGE
Citation
@article{oquab2023dinov2,
title={DINOv2: Learning Robust Visual Features without Supervision},
author={Oquab, Maxime and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and others},
journal={arXiv preprint arXiv:2304.07193},
year={2023}
}
@misc{robotflowlabs2026anima,
title={ANIMA: Agentic Networked Intelligence for Modular Autonomy},
author={RobotFlowLabs},
year={2026},
url={https://huggingface.co/robotflowlabs}
}
Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.
- Downloads last month
- 11
Model tree for robotflowlabs/dinov2-large-int8
Base model
facebook/dinov2-largeCollection including robotflowlabs/dinov2-large-int8
Paper for robotflowlabs/dinov2-large-int8
DINOv2: Learning Robust Visual Features without Supervision
Evaluation results
- Model Size (MB)self-reported1461.000
- Compression Ratioself-reported1.600
- Original Size (MB)self-reported2322.000