SmolVLM2-500M-DepthAwareVLM
SmolVLM2-500M-DepthAwareVLM extends SmolVLM2-500M-Video-Instruct with a lightweight sidecar pipeline that fuses metric depth maps (from Depth-Anything-V2) and object detection anchors (from YOLOv8-World) directly into the vision-language forward pass, enabling grounded spatial reasoning such as "How far is the car?" without any fine-tuning required for basic depth-hint prompting.
Architecture
Image (RGB)
|
+----------+----------+
| |
SigLIP ViT-SO/14 Depth-Anything-V2
(Vision Encoder) Metric-Outdoor-Small
86.4M params (external, not saved)
| |
Patch embeddings Depth map (H x W, metres)
| |
+----> DepthBridge <--+ <- NEW (262 K params)
Gated residual fusion
gate alpha = 0.0 at init, learns during fine-tuning
|
Connector (pixel-shuffle + MLP)
11.8M params
|
LM token sequence
|
[Optional] ObjectAnchorProjector <- NEW (498 K params)
YOLOv8-World detections -> K anchor tokens appended
|
SmolLM2 Language Model (Llama backbone)
361.9M params
|
Answer
Parameter Breakdown
| Component | Parameters | % of Total |
|---|---|---|
| Vision encoder (SigLIP) | 86,433,024 | 17.006% |
| Connector (pixel-shuffle MLP) | 11,796,480 | 2.321% |
| Language model (SmolLM2) | 361,944,000 | 71.215% |
| DepthBridge (sidecar) | 262,913 | 0.052% |
| ObjectAnchorProjector (sidecar) | 498,240 | 0.098% |
| Sidecar total | 761,153 | 0.150% |
| GRAND TOTAL | 508,243,457 | 100% |
The two sidecar modules add only 0.15% of new parameters on top of the frozen 508M base model.
Sidecar Modules
1. DepthBridge
- Input: Metric depth map
(B, 1, H, W)from Depth-Anything-V2-Metric-Outdoor-Small - Architecture:
Conv2d(1->256, k=16, s=16)->LayerNorm(256)->Linear(256->768) - Fusion: Gated residual:
patch_emb = patch_emb + gate * depth_features - Gate alpha: Initialised at 0.0 (depth is inactive at init, rises naturally during fine-tuning)
- Effect: Vision patches receive metric depth context at the embedding level, before the connector
2. ObjectAnchorProjector
- Input: YOLOv8-World detections — bounding boxes
(K, 4)+ CLIP class embeddings(K, 512)+ depth(K, 1) - Architecture:
Linear(517->960)->LayerNorm(960) - Fusion: K anchor tokens appended to the LM input sequence after image-text merging
- Note: Enable after fine-tuning. Random weights before training add noise; disable with
config.object_integration = False
Inference Pipeline
Input image
|--- Depth-Anything-V2-Metric-Outdoor-Small ---> depth_map (H x W, metres)
|--- YOLOv8-World (open-vocab) ----------------> boxes, class_emb, depth_vals
|
+-> SmolVLM2-500M-DepthAwareVLM
(depth_map fused via DepthBridge)
(detections passed as text hint pre-fine-tuning)
|
Answer: "The car is 10.81 metres away."
Usage
Basic inference (PyTorch)
import torch
import numpy as np
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.models.smolvlm.modeling_smolvlm import DepthBridge
MODEL_ID = "anuragpradhan/SmolVLM2-500M-DepthAwareVLM"
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype=torch.float32)
processor = AutoProcessor.from_pretrained(MODEL_ID)
# depth_integration=True is already in the saved config
# DepthBridge is reconstructed automatically by SmolVLMModel.__init__
image = Image.open("your_image.jpg").convert("RGB")
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "What is happening in this scene?"},
]}
]
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt")
# Optional: pass a metric depth map (normalised to [0,1]) from Depth-Anything-V2
depth_map = inputs.pop("depth_pixel_values", None)
with torch.no_grad():
output_ids = model.generate(**inputs, depth_maps=depth_map, max_new_tokens=200)
n = inputs["input_ids"].shape[1]
answer = processor.batch_decode(output_ids[:, n:], skip_special_tokens=True)[0].strip()
print(answer)
Full sidecar demo
# Clone repo and install editable transformers
git clone https://github.com/huggingface/transformers
cd transformers && pip install -e ".[dev]"
pip install ultralytics num2words
# Run the sidecar demo
cd examples
python sidecar_depth_demo.py your_image.jpg "What is the depth of the car?"
Fine-tuning (sidecar modules only)
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("anuragpradhan/SmolVLM2-500M-DepthAwareVLM")
# Freeze the 508M base model, train only the 761K sidecar params
model.freeze_base_models()
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable params: {trainable:,}") # ~761,153
External Models Required
| Model | Purpose | HF ID |
|---|---|---|
| Depth-Anything-V2-Metric-Outdoor-Small | Metric depth map generation | depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf |
| YOLOv8-World | Open-vocabulary object detection | yolov8s-world.pt (ultralytics) |
Config Flags
| Flag | Default | Effect |
|---|---|---|
depth_integration |
True |
Instantiates DepthBridge; passes depth maps through gated residual |
object_integration |
True |
Instantiates ObjectAnchorProjector; appends anchor tokens to sequence |
depth_hidden_dim |
256 |
Intermediate channels in DepthBridge Conv2d |
object_feature_dim |
512 |
CLIP embedding dimension from YOLOv8-World |
max_objects |
20 |
Max YOLO detections per image |
depth_gate_init |
0.0 |
Initial value of DepthBridge gate (0 = depth inactive at init) |
Limitations
- Not fine-tuned for depth tasks. DepthBridge gate alpha = 0.0 at initialisation; depth fusion is inactive until fine-tuned on metric-depth QA data.
- ObjectAnchorProjector is random-initialised. Enabling it before fine-tuning adds noise; it is disabled by default for inference.
- Text hint dependency. Pre-fine-tuning, depth information is injected via a text prompt hint
(e.g.
"[Depth sensor] The car is 10.81 metres away."). The model reads this textually. - Base model limitations apply. SmolVLM2-500M is a small model; complex spatial reasoning requires the sidecar fine-tuning stage.
Citation
@misc{smolvlm2-depthawarevlm,
title = {SmolVLM2-500M-DepthAwareVLM: Sidecar Depth and Object Grounding for Vision-Language Models},
author = {Anurag Pradhan},
year = {2025},
url = {https://huggingface.co/anuragpradhan/SmolVLM2-500M-DepthAwareVLM},
note = {Built on SmolVLM2-500M-Video-Instruct with DepthBridge and ObjectAnchorProjector sidecar modules}
}
Acknowledgements
- SmolVLM2 by HuggingFace
- Depth Anything V2
- YOLOv8-World
- Downloads last month
- 52
Model tree for anuragpradhan/SmolVLM2-500M-DepthAwareVLM
Base model
HuggingFaceTB/SmolLM2-360M Quantized
HuggingFaceTB/SmolLM2-360M-Instruct Quantized
HuggingFaceTB/SmolVLM-500M-Instruct