Upload README.md with huggingface_hub

2059eab verified about 23 hours ago

2.78 kB

license: other
license_name: gemma-terms-of-use
license_link: https://ai.google.dev/gemma/terms
tags:
  - spatial-understanding
  - 3d-vision
  - vision-language-model
  - paligemma2
base_model: google/paligemma2-3b-pt-448
pipeline_tag: visual-question-answering

HiSpatial-3B

Official model weights for HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models (CVPR 2026).

Paper: arXiv:2603.25411
Code: github.com/microsoft/HiSpatial
Project Page: microsoft.github.io/HiSpatial

Model Description

HiSpatial-3B is a vision-language model for hierarchical 3D spatial understanding. It extends PaliGemma2-3B with a depth encoder (Conv2dForXYZ) that fuses RGB image features from SigLIP with 3D point cloud features through a combined multi-modal projector.

Component	Detail
Backbone	PaliGemma2-3B (448px)
Vision Encoder	SigLIP-SO400M
Language Model	Gemma2-2B
Depth Encoder	Conv2dForXYZ
Projector	CombinedMultiModalProjector
Checkpoint Format	PyTorch state_dict (`weights.pt`, ~14GB)

Usage

from hispatial.inference import MoGeProcessor, HiSpatialPredictor

# Initialize MoGe depth estimator and HiSpatial predictor
moge = MoGeProcessor(device_name="cuda")
predictor = HiSpatialPredictor(model_load_path="lhzzzzzy/HiSpatial-3B")

# Load an image
image = "example.jpg"

# Estimate 3D point cloud from the image
xyz_values = moge.apply_transform(image)

# Ask a spatial question
answer = predictor.query(
    image=image,
    prompt="Which object is closer to the camera, the chair or the table?",
    xyz_values=xyz_values,
)
print(answer)

See the GitHub repo for full installation and evaluation instructions.

License

This model is a derivative of PaliGemma2, which is based on Gemma. Use of this model is subject to the Gemma Terms of Use, including the Gemma Prohibited Use Policy.

Citation

@inproceedings{liang2026hispatial,
  title={HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models},
  author={Liang, Huizhi and Shen, Yichao and Deng, Yu and Xu, Sicheng and Feng, Zhiyuan and Zhang, Tong and Liang, Yaobo and Yang, Jiaolong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}