| --- |
| license: other |
| license_name: gemma-terms-of-use |
| license_link: https://ai.google.dev/gemma/terms |
| tags: |
| - spatial-understanding |
| - 3d-vision |
| - vision-language-model |
| - paligemma2 |
| base_model: google/paligemma2-3b-pt-448 |
| pipeline_tag: visual-question-answering |
| --- |
| |
| # HiSpatial-3B |
|
|
| Official model weights for **[HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models](https://arxiv.org/abs/2603.25411)** (CVPR 2026). |
|
|
| - **Paper:** [arXiv:2603.25411](https://arxiv.org/abs/2603.25411) |
| - **Code:** [github.com/microsoft/HiSpatial](https://github.com/microsoft/HiSpatial) |
| - **Project Page:** [microsoft.github.io/HiSpatial](https://microsoft.github.io/HiSpatial/) |
|
|
| ## Model Description |
|
|
| HiSpatial-3B is a vision-language model for hierarchical 3D spatial understanding. It extends [PaliGemma2-3B](https://huggingface.co/google/paligemma2-3b-pt-448) with a depth encoder (Conv2dForXYZ) that fuses RGB image features from SigLIP with 3D point cloud features through a combined multi-modal projector. |
|
|
| | Component | Detail | |
| |-----------|--------| |
| | Backbone | PaliGemma2-3B (448px) | |
| | Vision Encoder | SigLIP-SO400M | |
| | Language Model | Gemma2-2B | |
| | Depth Encoder | Conv2dForXYZ | |
| | Projector | CombinedMultiModalProjector | |
| | Checkpoint Format | PyTorch state_dict (`weights.pt`, ~14GB) | |
| |
| ## Usage |
| |
| ```python |
| from hispatial.inference import MoGeProcessor, HiSpatialPredictor |
| |
| # Initialize MoGe depth estimator and HiSpatial predictor |
| moge = MoGeProcessor(device_name="cuda") |
| predictor = HiSpatialPredictor(model_load_path="lhzzzzzy/HiSpatial-3B") |
|
|
| # Load an image |
| image = "example.jpg" |
|
|
| # Estimate 3D point cloud from the image |
| xyz_values = moge.apply_transform(image) |
|
|
| # Ask a spatial question |
| answer = predictor.query( |
| image=image, |
| prompt="Which object is closer to the camera, the chair or the table?", |
| xyz_values=xyz_values, |
| ) |
| print(answer) |
| ``` |
| |
| See the [GitHub repo](https://github.com/microsoft/HiSpatial) for full installation and evaluation instructions. |
|
|
| ## License |
|
|
| This model is a derivative of [PaliGemma2](https://huggingface.co/google/paligemma2-3b-pt-448), which is based on [Gemma](https://ai.google.dev/gemma). Use of this model is subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms), including the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{liang2026hispatial, |
| title={HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models}, |
| author={Liang, Huizhi and Shen, Yichao and Deng, Yu and Xu, Sicheng and Feng, Zhiyuan and Zhang, Tong and Liang, Yaobo and Yang, Jiaolong}, |
| booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
| year={2026} |
| } |
| ``` |
|
|