SpatialBot: Precise Spatial Understanding with Vision Language Models
Paper • 2406.13642 • Published • 2
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("visual-question-answering", model="RussRobin/SpatialBot-3B-LoRA")# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("RussRobin/SpatialBot-3B-LoRA", dtype="auto")SpatialBot is a VLM with spatial understanding and reasoning abilties, by precisely understanding depth maps and using them to do high-level tasks.
In this HF repo, we provide ckpts of SpatialBot-3B with LoRA, which is based on Phi-2 and SigLIP. It can perform well on general VLM tasks and spatial understanding benchmarks like SpatialBench.
You will also need to download pretrained CKPT.
https://arxiv.org/abs/2406.13642
https://github.com/BAAI-DCAI/SpatialBot
https://huggingface.co/datasets/RussRobin/SpatialBench
# Gated model: Login with a HF token with gated access permission hf auth login