SmolVLM2-500M-DepthAwareVLM

SmolVLM2-500M-DepthAwareVLM extends SmolVLM2-500M-Video-Instruct with a lightweight sidecar pipeline that fuses metric depth maps (from Depth-Anything-V2) and object detection anchors (from YOLOv8-World) directly into the vision-language forward pass, enabling grounded spatial reasoning such as "How far is the car?" without any fine-tuning required for basic depth-hint prompting.

Architecture

                  Image (RGB)
                      |
           +----------+----------+
           |                     |
    SigLIP ViT-SO/14         Depth-Anything-V2
    (Vision Encoder)         Metric-Outdoor-Small
    86.4M params             (external, not saved)
           |                     |
    Patch embeddings         Depth map (H x W, metres)
           |                     |
           +----> DepthBridge <--+   <- NEW (262 K params)
                  Gated residual fusion
                  gate alpha = 0.0 at init, learns during fine-tuning
           |
    Connector (pixel-shuffle + MLP)
    11.8M params
           |
    LM token sequence
           |
    [Optional] ObjectAnchorProjector  <- NEW (498 K params)
               YOLOv8-World detections -> K anchor tokens appended
           |
    SmolLM2 Language Model (Llama backbone)
    361.9M params
           |
         Answer

Parameter Breakdown

Component	Parameters	% of Total
Vision encoder (SigLIP)	86,433,024	17.006%
Connector (pixel-shuffle MLP)	11,796,480	2.321%
Language model (SmolLM2)	361,944,000	71.215%
DepthBridge (sidecar)	262,913	0.052%
ObjectAnchorProjector (sidecar)	498,240	0.098%
Sidecar total	761,153	0.150%
GRAND TOTAL	508,243,457	100%

The two sidecar modules add only 0.15% of new parameters on top of the frozen 508M base model.

Sidecar Modules

1. DepthBridge

Input: Metric depth map (B, 1, H, W) from Depth-Anything-V2-Metric-Outdoor-Small
Architecture: Conv2d(1->256, k=16, s=16) -> LayerNorm(256) -> Linear(256->768)
Fusion: Gated residual: patch_emb = patch_emb + gate * depth_features
Gate alpha: Initialised at 0.0 (depth is inactive at init, rises naturally during fine-tuning)
Effect: Vision patches receive metric depth context at the embedding level, before the connector

2. ObjectAnchorProjector

Input: YOLOv8-World detections — bounding boxes (K, 4) + CLIP class embeddings (K, 512) + depth (K, 1)
Architecture: Linear(517->960) -> LayerNorm(960)
Fusion: K anchor tokens appended to the LM input sequence after image-text merging
Note: Enable after fine-tuning. Random weights before training add noise; disable with config.object_integration = False

Inference Pipeline

Input image
   |--- Depth-Anything-V2-Metric-Outdoor-Small ---> depth_map  (H x W, metres)
   |--- YOLOv8-World (open-vocab) ----------------> boxes, class_emb, depth_vals
   |
   +-> SmolVLM2-500M-DepthAwareVLM
           (depth_map fused via DepthBridge)
           (detections passed as text hint pre-fine-tuning)
           |
        Answer: "The car is 10.81 metres away."

Usage

Basic inference (PyTorch)

import torch
import numpy as np
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.models.smolvlm.modeling_smolvlm import DepthBridge

MODEL_ID = "anuragpradhan/SmolVLM2-500M-DepthAwareVLM"

model     = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype=torch.float32)
processor = AutoProcessor.from_pretrained(MODEL_ID)

# depth_integration=True is already in the saved config
# DepthBridge is reconstructed automatically by SmolVLMModel.__init__

image = Image.open("your_image.jpg").convert("RGB")

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "What is happening in this scene?"},
    ]}
]
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt")

# Optional: pass a metric depth map (normalised to [0,1]) from Depth-Anything-V2
depth_map = inputs.pop("depth_pixel_values", None)

with torch.no_grad():
    output_ids = model.generate(**inputs, depth_maps=depth_map, max_new_tokens=200)

n = inputs["input_ids"].shape[1]
answer = processor.batch_decode(output_ids[:, n:], skip_special_tokens=True)[0].strip()
print(answer)

Full sidecar demo

# Clone repo and install editable transformers
git clone https://github.com/huggingface/transformers
cd transformers && pip install -e ".[dev]"
pip install ultralytics num2words

# Run the sidecar demo
cd examples
python sidecar_depth_demo.py your_image.jpg "What is the depth of the car?"

Fine-tuning (sidecar modules only)

from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained("anuragpradhan/SmolVLM2-500M-DepthAwareVLM")

# Freeze the 508M base model, train only the 761K sidecar params
model.freeze_base_models()

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable params: {trainable:,}")  # ~761,153

External Models Required

Model	Purpose	HF ID
Depth-Anything-V2-Metric-Outdoor-Small	Metric depth map generation	`depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf`
YOLOv8-World	Open-vocabulary object detection	`yolov8s-world.pt` (ultralytics)

Config Flags

Flag	Default	Effect
`depth_integration`	`True`	Instantiates DepthBridge; passes depth maps through gated residual
`object_integration`	`True`	Instantiates ObjectAnchorProjector; appends anchor tokens to sequence
`depth_hidden_dim`	`256`	Intermediate channels in DepthBridge Conv2d
`object_feature_dim`	`512`	CLIP embedding dimension from YOLOv8-World
`max_objects`	`20`	Max YOLO detections per image
`depth_gate_init`	`0.0`	Initial value of DepthBridge gate (0 = depth inactive at init)

Limitations

Not fine-tuned for depth tasks. DepthBridge gate alpha = 0.0 at initialisation; depth fusion is inactive until fine-tuned on metric-depth QA data.
ObjectAnchorProjector is random-initialised. Enabling it before fine-tuning adds noise; it is disabled by default for inference.
Text hint dependency. Pre-fine-tuning, depth information is injected via a text prompt hint (e.g. "[Depth sensor] The car is 10.81 metres away."). The model reads this textually.
Base model limitations apply. SmolVLM2-500M is a small model; complex spatial reasoning requires the sidecar fine-tuning stage.

Citation

@misc{smolvlm2-depthawarevlm,
  title   = {SmolVLM2-500M-DepthAwareVLM: Sidecar Depth and Object Grounding for Vision-Language Models},
  author  = {Anurag Pradhan},
  year    = {2025},
  url     = {https://huggingface.co/anuragpradhan/SmolVLM2-500M-DepthAwareVLM},
  note    = {Built on SmolVLM2-500M-Video-Instruct with DepthBridge and ObjectAnchorProjector sidecar modules}
}

Acknowledgements

Downloads last month: 52

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for anuragpradhan/SmolVLM2-500M-DepthAwareVLM

Base model

HuggingFaceTB/SmolLM2-360M

Quantized

HuggingFaceTB/SmolLM2-360M-Instruct

Quantized

HuggingFaceTB/SmolVLM-500M-Instruct

Quantized

HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Finetuned

(74)

this model