Instructions to use InternRobotics/G2VLM-Qwen2-VL-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use InternRobotics/G2VLM-Qwen2-VL-2B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="InternRobotics/G2VLM-Qwen2-VL-2B")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("InternRobotics/G2VLM-Qwen2-VL-2B", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use InternRobotics/G2VLM-Qwen2-VL-2B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "InternRobotics/G2VLM-Qwen2-VL-2B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "InternRobotics/G2VLM-Qwen2-VL-2B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/InternRobotics/G2VLM-Qwen2-VL-2B

SGLang

How to use InternRobotics/G2VLM-Qwen2-VL-2B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "InternRobotics/G2VLM-Qwen2-VL-2B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "InternRobotics/G2VLM-Qwen2-VL-2B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "InternRobotics/G2VLM-Qwen2-VL-2B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "InternRobotics/G2VLM-Qwen2-VL-2B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use InternRobotics/G2VLM-Qwen2-VL-2B with Docker Model Runner:
```
docker model run hf.co/InternRobotics/G2VLM-Qwen2-VL-2B
```

G2VLM-Qwen2-VL-2B / README.md

gordonhubackup

upload

9e82a2e about 1 month ago

preview code

raw

history blame contribute delete

3.09 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: image-text-to-text
	tags:
	- multimodal
	library_name: transformers
	base_model:
	- Qwen/Qwen2-VL-2B
	---

	# G2VLM-Qwen2-VL-2B
	## Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

	<p align="left">
	<img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/icon.png" alt="G2VLM" width="200"/>
	</p>


	<p align="left">
	<a href="https://gordonhu608.github.io/g2vlm.github.io/">
	<img
	src="https://img.shields.io/badge/G2VLM-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
	alt="G2VLM Website"
	/>
	</a>
	<a href="https://arxiv.org/abs/2511.21688">
	<img
	src="https://img.shields.io/badge/G2VLM-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
	alt="G2VLM Paper on arXiv"
	/>
	</a>
	<a href="https://github.com/InternRobotics/G2VLM" target="_blank" style="margin: 2px;">
	<img
	alt="Github" src="https://img.shields.io/badge/G2VLM-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
	alt="G2VLM Codebase"
	/>
	</a>
	</p>


	> We present <b>G<sup>2</sup>VLM</b>, a geometry grounded vision-language model proficient in both spatial 3D reconstruction and spatial understanding tasks. For spatial reasoning questions, G<sup>2</sup>VLM can natively predict 3D geometry and employ interleaved reasoning for an answer.


	This repository hosts the base model weights <b>BEFORE</b> the training of <b>G<sup>2</sup>VLM</b>, which is technically the same as Qwen2-VL-2B. Here we format it so it's easier for users to reproduce our trainings. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/InternRobotics/G2VLM).


	<p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/teaser.png" width="100%"></p>



	## 🧠 Method
	G<sup>2</sup>VLM is a unified model that integrates both a geometric perception expert for 3D reconstruction and a semantic perception expert for multimodal understanding and spatial reasoning tasks. All tokens can do shared multi-modal self attention in each transformer block.

	<p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/method.png" width="100%"></p>


	## License
	G2VLM is licensed under the Apache 2.0 license.

	## ✍️ Citation
	```bibtex
	@article{hu2025g2vlmgeometrygroundedvision,
	title={G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning},
	author={Wenbo Hu and Jingli Lin and Yilin Long and Yunlong Ran and Lihan Jiang and Yifan Wang and Chenming Zhu and Runsen Xu and Tai Wang and Jiangmiao Pang},
	year={2025},
	eprint={2511.21688},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2511.21688},
	}
	```