--- license: other license_name: gemma-terms-of-use license_link: https://ai.google.dev/gemma/terms tags: - spatial-understanding - 3d-vision - vision-language-model - paligemma2 base_model: google/paligemma2-3b-pt-448 pipeline_tag: visual-question-answering --- # HiSpatial-3B Official model weights for **[HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models](https://arxiv.org/abs/2603.25411)** (CVPR 2026). - **Paper:** [arXiv:2603.25411](https://arxiv.org/abs/2603.25411) - **Code:** [github.com/microsoft/HiSpatial](https://github.com/microsoft/HiSpatial) - **Project Page:** [microsoft.github.io/HiSpatial](https://microsoft.github.io/HiSpatial/) ## Model Description HiSpatial-3B is a vision-language model for hierarchical 3D spatial understanding. It extends [PaliGemma2-3B](https://huggingface.co/google/paligemma2-3b-pt-448) with a depth encoder (Conv2dForXYZ) that fuses RGB image features from SigLIP with 3D point cloud features through a combined multi-modal projector. | Component | Detail | |-----------|--------| | Backbone | PaliGemma2-3B (448px) | | Vision Encoder | SigLIP-SO400M | | Language Model | Gemma2-2B | | Depth Encoder | Conv2dForXYZ | | Projector | CombinedMultiModalProjector | | Checkpoint Format | PyTorch state_dict (`weights.pt`, ~14GB) | ## Usage ```python from hispatial.inference import MoGeProcessor, HiSpatialPredictor # Initialize MoGe depth estimator and HiSpatial predictor moge = MoGeProcessor(device_name="cuda") predictor = HiSpatialPredictor(model_load_path="lhzzzzzy/HiSpatial-3B") # Load an image image = "example.jpg" # Estimate 3D point cloud from the image xyz_values = moge.apply_transform(image) # Ask a spatial question answer = predictor.query( image=image, prompt="Which object is closer to the camera, the chair or the table?", xyz_values=xyz_values, ) print(answer) ``` See the [GitHub repo](https://github.com/microsoft/HiSpatial) for full installation and evaluation instructions. ## License This model is a derivative of [PaliGemma2](https://huggingface.co/google/paligemma2-3b-pt-448), which is based on [Gemma](https://ai.google.dev/gemma). Use of this model is subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms), including the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). ## Citation ```bibtex @inproceedings{liang2026hispatial, title={HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models}, author={Liang, Huizhi and Shen, Yichao and Deng, Yu and Xu, Sicheng and Feng, Zhiyuan and Zhang, Tong and Liang, Yaobo and Yang, Jiaolong}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2026} } ```