EGM-8B / README.md

Upload folder using huggingface_hub

f5bc2e9 verified 3 days ago

5.08 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	base_model:
	- Qwen/Qwen3-VL-8B-Thinking
	pipeline_tag: image-text-to-text
	tags:
	- visual-grounding
	- multimodal
	- qwen3-vl
	- reinforcement-learning
	- grpo
	---

	# EGM-Qwen3-VL-8B

	<p align="center">
	<a href="https://nvlabs.github.io/EGM">[Project Page]</a>
	<a href="https://github.com/NVlabs/EGM">[Code]</a>
	</p>

	<div align="center">
	<img src="https://nvlabs.github.io/EGM/figure4.jpeg" width="90%"/>
	</div>

	## Model Summary

	EGM-Qwen3-VL-8B is the flagship model of the [EGM (Efficient Visual Grounding Language Models)](https://nvlabs.github.io/EGM) family. It is built on top of [Qwen3-VL-8B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking) and trained with a two-stage pipeline: supervised fine-tuning (SFT) followed by reinforcement learning (RL) using GRPO (Group Relative Policy Optimization).

	EGM demonstrates that by increasing test-time computation, small vision-language models can outperform much larger models in visual grounding tasks while being significantly faster at inference.

	## Key Results

	- 91.4 average IoU on the RefCOCO benchmark (vs. 87.8 for the base Qwen3-VL-8B-Thinking)
	- +3.6 IoU improvement over the base model
	- Outperforms Qwen3-VL-235B-A22B-Instruct (88.2 avg IoU) and Qwen3-VL-235B-A22B-Thinking (90.7 avg IoU)
	- 5.9x faster inference than Qwen3-VL-235B (737ms vs 4,320ms average latency)
	- 18.9x faster than Qwen3-VL-235B-Thinking

	### RefCOCO Benchmark Results

	\| Model \| RefCOCO val \| RefCOCO test-A \| RefCOCO test-B \| RefCOCO+ val \| RefCOCO+ test-A \| RefCOCO+ test-B \| RefCOCOg val \| RefCOCOg test \| Avg \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| Qwen3-VL-8B-Thinking \| 91.0 \| 92.5 \| 86.6 \| 86.2 \| 91.2 \| 80.5 \| 87.8 \| 88.6 \| 87.8 \|
	\| EGM-Qwen3-VL-8B \| 93.9 \| 95.6 \| 91.2 \| 90.5 \| 93.5 \| 86.3 \| 90.8 \| 91.4 \| 91.4 \|
	\| Qwen3-VL-235B-A22B-Instruct \| 90.4 \| 94.6 \| 82.2 \| 86.4 \| 92.1 \| 78.5 \| 90.5 \| 90.5 \| 88.2 \|
	\| Qwen3-VL-235B-A22B-Thinking \| 93.4 \| 94.1 \| 90.6 \| 89.5 \| 91.4 \| 85.2 \| 90.4 \| 90.5 \| 90.7 \|

	## How It Works

	VLMs of different sizes often share the same visual encoder. Small models fall behind large models primarily due to a gap in text understanding capabilities — 62.8% of small model errors stem from complex prompts with multiple relational descriptions. EGM mitigates this gap by generating many mid-quality tokens (from small models) to match the performance of large VLMs that produce fewer but more expensive tokens.

	### Training Pipeline

	1. SFT Stage: A proprietary VLM generates detailed chain-of-thought reasoning steps for visual grounding training data. The base model is fine-tuned on this data. The SFT checkpoint is available as [nvidia/EGM-8B-SFT](https://huggingface.co/nvidia/EGM-8B-SFT).
	2. RL Stage: GRPO is applied with a reward function combining IoU and task success metrics, further improving grounding accuracy.

	## Quickstart

	### Download

	```bash
	pip install -U huggingface_hub
	huggingface-cli download nvidia/EGM-8B --local-dir ./models/EGM-8B
	```

	### Inference with SGLang

	Launch the server:

	```bash
	pip install "sglang[all]>=0.5.5"

	python -m sglang.launch_server \
	--model-path nvidia/EGM-8B \
	--chat-template=qwen3-vl \
	--port 30000
	```

	Send a visual grounding request:

	```python
	import openai
	import base64

	client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

	# Load a local image as base64
	with open("example.jpg", "rb") as f:
	image_base64 = base64.b64encode(f.read()).decode("utf-8")

	response = client.chat.completions.create(
	model="nvidia/EGM-8B",
	messages=[
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}},
	{"type": "text", "text": "Please provide the bounding box coordinate of the region this sentence describes: the person on the left."},
	],
	}
	],
	temperature=0.6,
	top_p=0.95,
	max_tokens=8192,
	)
	print(response.choices[0].message.content)
	```

	## Model Architecture

	\| Component \| Details \|
	\|---\|---\|
	\| Architecture \| Qwen3VLForConditionalGeneration \|
	\| Text Hidden Size \| 4096 \|
	\| Text Layers \| 36 \|
	\| Attention Heads \| 32 (8 KV heads) \|
	\| Text Intermediate Size \| 12,288 \|
	\| Vision Hidden Size \| 1152 \|
	\| Vision Layers \| 27 \|
	\| Patch Size \| 16 x 16 \|
	\| Max Position Embeddings \| 262,144 \|
	\| Vocabulary Size \| 151,936 \|

	## Citation

	```bibtex
	@article{zhan2026EGM,
	author = {Zhan, Guanqi and Li, Changye and Liu, Zhijian and Lu, Yao and Wu, Yi and Han, Song and Zhu, Ligeng},
	title = {EGM: Efficient Visual Grounding Language Models},
	booktitle = {arXiv},
	year = {2026}
	}
	```

	## Acknowledgment

	This repository benefits from [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL), [InternVL](https://github.com/OpenGVLab/InternVL), [verl](https://github.com/volcengine/verl) and [verl-internvl](https://github.com/Weiyun1025/verl-internvl).