HiSpatial-3B / README.md
lhzzzzzy's picture
Upload README.md with huggingface_hub
2059eab verified
---
license: other
license_name: gemma-terms-of-use
license_link: https://ai.google.dev/gemma/terms
tags:
- spatial-understanding
- 3d-vision
- vision-language-model
- paligemma2
base_model: google/paligemma2-3b-pt-448
pipeline_tag: visual-question-answering
---
# HiSpatial-3B
Official model weights for **[HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models](https://arxiv.org/abs/2603.25411)** (CVPR 2026).
- **Paper:** [arXiv:2603.25411](https://arxiv.org/abs/2603.25411)
- **Code:** [github.com/microsoft/HiSpatial](https://github.com/microsoft/HiSpatial)
- **Project Page:** [microsoft.github.io/HiSpatial](https://microsoft.github.io/HiSpatial/)
## Model Description
HiSpatial-3B is a vision-language model for hierarchical 3D spatial understanding. It extends [PaliGemma2-3B](https://huggingface.co/google/paligemma2-3b-pt-448) with a depth encoder (Conv2dForXYZ) that fuses RGB image features from SigLIP with 3D point cloud features through a combined multi-modal projector.
| Component | Detail |
|-----------|--------|
| Backbone | PaliGemma2-3B (448px) |
| Vision Encoder | SigLIP-SO400M |
| Language Model | Gemma2-2B |
| Depth Encoder | Conv2dForXYZ |
| Projector | CombinedMultiModalProjector |
| Checkpoint Format | PyTorch state_dict (`weights.pt`, ~14GB) |
## Usage
```python
from hispatial.inference import MoGeProcessor, HiSpatialPredictor
# Initialize MoGe depth estimator and HiSpatial predictor
moge = MoGeProcessor(device_name="cuda")
predictor = HiSpatialPredictor(model_load_path="lhzzzzzy/HiSpatial-3B")
# Load an image
image = "example.jpg"
# Estimate 3D point cloud from the image
xyz_values = moge.apply_transform(image)
# Ask a spatial question
answer = predictor.query(
image=image,
prompt="Which object is closer to the camera, the chair or the table?",
xyz_values=xyz_values,
)
print(answer)
```
See the [GitHub repo](https://github.com/microsoft/HiSpatial) for full installation and evaluation instructions.
## License
This model is a derivative of [PaliGemma2](https://huggingface.co/google/paligemma2-3b-pt-448), which is based on [Gemma](https://ai.google.dev/gemma). Use of this model is subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms), including the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
## Citation
```bibtex
@inproceedings{liang2026hispatial,
title={HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models},
author={Liang, Huizhi and Shen, Yichao and Deng, Yu and Xu, Sicheng and Feng, Zhiyuan and Zhang, Tong and Liang, Yaobo and Yang, Jiaolong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
```