Instructions to use InternRobotics/G2VLM-Qwen2-VL-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use InternRobotics/G2VLM-Qwen2-VL-2B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="InternRobotics/G2VLM-Qwen2-VL-2B")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("InternRobotics/G2VLM-Qwen2-VL-2B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use InternRobotics/G2VLM-Qwen2-VL-2B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "InternRobotics/G2VLM-Qwen2-VL-2B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "InternRobotics/G2VLM-Qwen2-VL-2B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/InternRobotics/G2VLM-Qwen2-VL-2B
- SGLang
How to use InternRobotics/G2VLM-Qwen2-VL-2B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "InternRobotics/G2VLM-Qwen2-VL-2B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "InternRobotics/G2VLM-Qwen2-VL-2B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "InternRobotics/G2VLM-Qwen2-VL-2B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "InternRobotics/G2VLM-Qwen2-VL-2B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use InternRobotics/G2VLM-Qwen2-VL-2B with Docker Model Runner:
docker model run hf.co/InternRobotics/G2VLM-Qwen2-VL-2B
| license: apache-2.0 | |
| language: | |
| - en | |
| pipeline_tag: image-text-to-text | |
| tags: | |
| - multimodal | |
| library_name: transformers | |
| base_model: | |
| - Qwen/Qwen2-VL-2B | |
| # G2VLM-Qwen2-VL-2B | |
| ## Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning | |
| <p align="left"> | |
| <img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/icon.png" alt="G2VLM" width="200"/> | |
| </p> | |
| <p align="left"> | |
| <a href="https://gordonhu608.github.io/g2vlm.github.io/"> | |
| <img | |
| src="https://img.shields.io/badge/G2VLM-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;" | |
| alt="G2VLM Website" | |
| /> | |
| </a> | |
| <a href="https://arxiv.org/abs/2511.21688"> | |
| <img | |
| src="https://img.shields.io/badge/G2VLM-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;" | |
| alt="G2VLM Paper on arXiv" | |
| /> | |
| </a> | |
| <a href="https://github.com/InternRobotics/G2VLM" target="_blank" style="margin: 2px;"> | |
| <img | |
| alt="Github" src="https://img.shields.io/badge/G2VLM-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;" | |
| alt="G2VLM Codebase" | |
| /> | |
| </a> | |
| </p> | |
| > We present <b>G<sup>2</sup>VLM</b>, a geometry grounded vision-language model proficient in both spatial 3D reconstruction and spatial understanding tasks. For spatial reasoning questions, G<sup>2</sup>VLM can natively predict 3D geometry and employ interleaved reasoning for an answer. | |
| This repository hosts the base model weights <b>BEFORE</b> the training of <b>G<sup>2</sup>VLM</b>, which is technically the same as Qwen2-VL-2B. Here we format it so it's easier for users to reproduce our trainings. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/InternRobotics/G2VLM). | |
| <p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/teaser.png" width="100%"></p> | |
| ## 🧠 Method | |
| G<sup>2</sup>VLM is a unified model that integrates both a geometric perception expert for 3D reconstruction and a semantic perception expert for multimodal understanding and spatial reasoning tasks. All tokens can do shared multi-modal self attention in each transformer block. | |
| <p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/method.png" width="100%"></p> | |
| ## License | |
| G2VLM is licensed under the Apache 2.0 license. | |
| ## ✍️ Citation | |
| ```bibtex | |
| @article{hu2025g2vlmgeometrygroundedvision, | |
| title={G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning}, | |
| author={Wenbo Hu and Jingli Lin and Yilin Long and Yunlong Ran and Lihan Jiang and Yifan Wang and Chenming Zhu and Runsen Xu and Tai Wang and Jiangmiao Pang}, | |
| year={2025}, | |
| eprint={2511.21688}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV}, | |
| url={https://arxiv.org/abs/2511.21688}, | |
| } | |
| ``` |