SD-VLM-7B / README.md

Update README.md

fdc6a87 verified about 2 months ago

5.23 kB



	# SD-VLM-8B

	[🌐 Homepage](https://cpystan.github.io/SD_VLM_pages/) \| [🤗 Dataset](https://huggingface.co/datasets/cpystan/MSMU) \| [📖 arXiv](https://arxiv.org/abs/2509.17664) \| [GitHub](https://github.com/cpystan/SD-VLM)


	🎯 The SD-VLM architecture enhances a standard Vision-Language Model (VLM) with 3D spatial awareness through a minimal yet effective modification.

	1. Base VLM: Utilizes the LLaVA-1.5-7B framework, consisting of a CLIP-ViT vision encoder, a Vicuna large language model (LLM), and a linear projector connecting them.

	2. Depth Encoding Core (DPE): The central innovation is the Depth Positional Encoding (DPE) module. It processes an input depth map (from an external estimator like Depth-Anything-V2) to generate depth-aware embeddings (E_depth). These embeddings are then directly added to the standard image features (E_image) from the vision encoder:

	This simple addition injects explicit 3D spatial priors into the model without altering the backbone architecture.

	3. Training Approach: The model is efficiently fine-tuned on the MSMU spatial dataset for one epoch using LoRA, keeping the vision encoder frozen. This allows the LLM and projector to learn how to interpret the depth-enhanced visual features for quantitative reasoning.

	In essence, SD-VLM's structure is defined by a streamlined integration: it upgrades a standard VLM to understand 3D space by fusing depth information into visual features through a parameter-free additive operation, all trained efficiently on targeted data.

	### Model Framework


	<img src="https://huggingface.co/spaces/cpystan/images/resolve/main/framework.png"
	width="100%" />

	### Quick Start!

	```
	from llava.model.builder import load_pretrained_model
	from llava.mm_utils import get_model_name_from_path
	from llava.eval.run_llava import eval_model
	from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
	import copy

	model_path = "cpystan/SD-VLM-7B"

	tokenizer, model, image_processor, context_len = load_pretrained_model(
	model_path=model_path,
	model_base=None,
	model_name=get_model_name_from_path(model_path)
	)

	input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
	image = Image.open(os.path.join(image_folder, image_file)).convert('RGB')
	ori_img = copy.deepcopy(image)
	image_tensor = process_images([image], image_processor, model.config)[0]

	with torch.inference_mode():
	output_ids = model.generate(
	input_ids,
	images=image_tensor.unsqueeze(0).half().to(input_ids.device),
	image_sizes=[image.size],
	do_sample=True if temperature > 0 else False,
	temperature=0.2,
	top_p=None,
	num_beams=1,
	ori_imgs = [ori_img],
	max_new_tokens=1024,
	use_cache=True,)
	response= tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

	```


	## 🏆 Mini-Leaderboard
	We show a mini-leaderboard here. It shows the results of each sub-category and the overall performance.

	# Results on MSMU-Bench

	\| Model \| Existence \| Object<br>Counting \| Scale<br>Est. \| Grounding \| Relative<br>Position \| Absolute<br>Distance \| Scale<br>Comparison \| Ref. Object<br>Est. \| Average \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Large Language Models (LLMs): Text only \|\|\|\|\|\|\|\|\|
	\| GPT-4-Turbo \| 12.76 \| 5.21 \| 13.51 \| 12.64 \| 24.84 \| 7.50 \| 36.79 \| 12.04 \| 15.66 \|
	\| Qwen2.5 \| 4.25 \| 0.00 \| 0.78 \| 13.79 \| 0.62 \| 0.00 \| 16.04 \| 1.57 \| 4.63 \|
	\| DeepSeek-V3 \| 0.00 \| 5.24 \| 1.54 \| 6.90 \| 10.56 \| 0.00 \| 25.47 \| 5.24 \| 7.39 \|
	\| Vision-Language Models (VLMs): Image + Text \|\|\|\|\|\|\|\|\|
	\| GPT-4o \| 44.68 \| 41.67 \| 3.86 \| 27.59 \| 67.08 \| 20.00 \| 54.72 \| 2.09 \| 32.28 \|
	\| Gemini-2 \| 38.30 \| 43.75 \| 23.94 \| 19.54 \| 54.66 \| 12.50 \| 69.81 \| 18.85 \| 35.17 \|
	\| Qwen2.5-VL-72B \| 59.57 \| 35.42 \| 1.54 \| 13.79 \| 57.76 \| 2.50 \| 66.04 \| 9.95 \| 30.82 \|
	\| Qwen2.5-VL-32B \| 29.79 \| 41.67 \| 10.81 \| 18.39 \| 60.25 \| 2.50 \| 46.23 \| 10.99 \| 27.59 \|
	\| Qwen2.5-VL-7B \| 12.76 \| 4.17 \| 0.00 \| 1.15 \| 1.24 \| 0.00 \| 5.66 \| 0.52 \| 3.19 \|
	\| Intern-VL3-78B \| 47.62 \| 42.71 \| 6.47 \| 26.32 \| 56.94 \| 13.33 \| 64.10 \| 16.46 \| 33.63 \|
	\| Intern-VL3-8B \| 36.17 \| 41.67 \| 4.63 \| 18.39 \| 60.25 \| 2.50 \| 49.06 \| 8.38 \| 28.54 \|
	\| LLaVA-1.5-7B \| 1.54 \| 36.46 \| 5.02 \| 20.69 \| 42.86 \| 5.00 \| 38.68 \| 0.52 \| 19.45 \|
	\| Depth-encoded VLMs: Image + Depth + Text \|\|\|\|\|\|\|\|\|
	\| SpatialBot \| 10.64 \| 46.88 \| 15.83 \| 28.74 \| 66.46 \| 5.00 \| 50.94 \| 8.90 \| 29.17 \|
	\| SpatialRGPT \| 10.64 \| 36.46 \| 20.08 \| 17.24 \| 60.25 \| 15.00 \| 62.26 \| 9.95 \| 28.98 \|
	\| SD-VLM-8B \|87.23 \| 47.92 \| 51.35 \| 42.53 \| 75.16 \| 40.00 \| 55.66 \| 46.07 \| 56.31 \|

	# Examples

	<img src="https://huggingface.co/spaces/cpystan/images/resolve/main/result_vis.png"
	width="100%" />


	## Citation

	BibTeX:
	```bibtex
	@inproceedings{chen2025sdvlm,
	title={SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models},
	author={Pingyi Chen and Yujing Lou and Shen Cao and Jinhui Guo and Lubin Fan and Yue Wu and Lin Yang and Lizhuang Ma and Jieping Ye},
	booktitle={NeurIPS},
	year={2025},
	}
	```