lhzzzzzy
/

HiSpatial-3B

Visual Question Answering

spatial-understanding

vision-language-model

Model card Files Files and versions

HiSpatial-3B / README.md

lhzzzzzy's picture

Upload README.md with huggingface_hub

2059eab verified 1 day ago

|

history blame contribute delete

2.78 kB

	---
	license: other
	license_name: gemma-terms-of-use
	license_link: https://ai.google.dev/gemma/terms
	tags:
	- spatial-understanding
	- 3d-vision
	- vision-language-model
	- paligemma2
	base_model: google/paligemma2-3b-pt-448
	pipeline_tag: visual-question-answering
	---

	# HiSpatial-3B

	Official model weights for [HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models](https://arxiv.org/abs/2603.25411) (CVPR 2026).

	- Paper: [arXiv:2603.25411](https://arxiv.org/abs/2603.25411)
	- Code: [github.com/microsoft/HiSpatial](https://github.com/microsoft/HiSpatial)
	- Project Page: [microsoft.github.io/HiSpatial](https://microsoft.github.io/HiSpatial/)

	## Model Description

	HiSpatial-3B is a vision-language model for hierarchical 3D spatial understanding. It extends [PaliGemma2-3B](https://huggingface.co/google/paligemma2-3b-pt-448) with a depth encoder (Conv2dForXYZ) that fuses RGB image features from SigLIP with 3D point cloud features through a combined multi-modal projector.

	\| Component \| Detail \|
	\|-----------\|--------\|
	\| Backbone \| PaliGemma2-3B (448px) \|
	\| Vision Encoder \| SigLIP-SO400M \|
	\| Language Model \| Gemma2-2B \|
	\| Depth Encoder \| Conv2dForXYZ \|
	\| Projector \| CombinedMultiModalProjector \|
	\| Checkpoint Format \| PyTorch state_dict (`weights.pt`, ~14GB) \|

	## Usage

	```python
	from hispatial.inference import MoGeProcessor, HiSpatialPredictor

	# Initialize MoGe depth estimator and HiSpatial predictor
	moge = MoGeProcessor(device_name="cuda")
	predictor = HiSpatialPredictor(model_load_path="lhzzzzzy/HiSpatial-3B")

	# Load an image
	image = "example.jpg"

	# Estimate 3D point cloud from the image
	xyz_values = moge.apply_transform(image)

	# Ask a spatial question
	answer = predictor.query(
	image=image,
	prompt="Which object is closer to the camera, the chair or the table?",
	xyz_values=xyz_values,
	)
	print(answer)
	```

	See the [GitHub repo](https://github.com/microsoft/HiSpatial) for full installation and evaluation instructions.

	## License

	This model is a derivative of [PaliGemma2](https://huggingface.co/google/paligemma2-3b-pt-448), which is based on [Gemma](https://ai.google.dev/gemma). Use of this model is subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms), including the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).

	## Citation

	```bibtex
	@inproceedings{liang2026hispatial,
	title={HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models},
	author={Liang, Huizhi and Shen, Yichao and Deng, Yu and Xu, Sicheng and Feng, Zhiyuan and Zhang, Tong and Liang, Yaobo and Yang, Jiaolong},
	booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
	year={2026}
	}
	```