File size: 4,757 Bytes
05d024f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b1673ed
 
 
 
05d024f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
579baac
 
 
05d024f
 
579baac
05d024f
579baac
 
 
 
 
05d024f
579baac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05d024f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
language:
  - en
library_name: transformers
base_model:
  - Qwen/Qwen3-VL-4B-Thinking
pipeline_tag: image-text-to-text
tags:
  - visual-grounding
  - multimodal
  - qwen3-vl
  - reinforcement-learning
  - grpo
---

# EGM-Qwen3-VL-4B

<p align="center">
  <a href="https://nvlabs.github.io/EGM">[Project Page]</a> &nbsp;
  <a href="https://github.com/NVlabs/EGM">[Code]</a> &nbsp;
</p>

<div align="center">
  <img src="https://nvlabs.github.io/EGM/figure4.jpeg" width="90%"/>
</div>

## Model Summary

**EGM-Qwen3-VL-4B** is an efficient visual grounding model from the [EGM (Efficient Visual Grounding Language Models)](https://nvlabs.github.io/EGM) family. It is built on top of [Qwen3-VL-4B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking) and trained with a two-stage pipeline: supervised fine-tuning (SFT) followed by reinforcement learning (RL) using GRPO (Group Relative Policy Optimization).

EGM demonstrates that by increasing test-time computation, small vision-language models can **outperform much larger models** in visual grounding tasks while being significantly faster at inference.

## Key Results

- **90.9 average IoU** on the RefCOCO benchmark (vs. 87.2 for the base Qwen3-VL-4B-Thinking)
- **+3.7 IoU improvement** over the base model
- Outperforms Qwen3-VL-235B-A22B-Instruct (88.2 avg IoU) while being dramatically faster

### RefCOCO Benchmark Results

| Model | RefCOCO val | RefCOCO test-A | RefCOCO test-B | RefCOCO+ val | RefCOCO+ test-A | RefCOCO+ test-B | RefCOCOg val | RefCOCOg test | Avg |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-4B-Thinking | 90.0 | 92.7 | 85.6 | 85.2 | 89.5 | 79.3 | 87.0 | 87.7 | 87.2 |
| **EGM-Qwen3-VL-4B** | **93.5** | **95.1** | **90.0** | **89.7** | **93.1** | **84.9** | **90.4** | **90.8** | **90.9** |

## How It Works

VLMs of different sizes often share the same visual encoder. Small models fall behind large models primarily due to a gap in **text understanding** capabilities — 62.8% of small model errors stem from complex prompts with multiple relational descriptions. EGM mitigates this gap by generating many mid-quality tokens (from small models) to match the performance of large VLMs that produce fewer but more expensive tokens.

### Training Pipeline

1. **SFT Stage**: A proprietary VLM generates detailed chain-of-thought reasoning steps for visual grounding training data. The base model is fine-tuned on this data. The SFT checkpoint is available as [nvidia/EGM-4B-SFT](https://huggingface.co/nvidia/EGM-4B-SFT).
2. **RL Stage**: GRPO is applied with a reward function combining IoU and task success metrics, further improving grounding accuracy.

## Quickstart

### Download

```bash
pip install -U huggingface_hub
huggingface-cli download nvidia/EGM-4B --local-dir ./models/EGM-4B
```

### Inference with SGLang

Launch the server:

```bash
pip install "sglang[all]>=0.5.5"

python -m sglang.launch_server \
    --model-path nvidia/EGM-4B \
    --chat-template=qwen3-vl \
    --port 30000
```

Send a visual grounding request:

```python
import openai
import base64

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Load a local image as base64
with open("example.jpg", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="nvidia/EGM-4B",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}},
                {"type": "text", "text": "Please provide the bounding box coordinate of the region this sentence describes: the person on the left."},
            ],
        }
    ],
    temperature=0.6,
    top_p=0.95,
    max_tokens=8192,
)
print(response.choices[0].message.content)
```

## Model Architecture

| Component | Details |
|---|---|
| Architecture | Qwen3VLForConditionalGeneration |
| Text Hidden Size | 2560 |
| Text Layers | 36 |
| Attention Heads | 32 (8 KV heads) |
| Text Intermediate Size | 9728 |
| Vision Hidden Size | 1024 |
| Vision Layers | 24 |
| Patch Size | 16 x 16 |
| Max Position Embeddings | 262,144 |
| Vocabulary Size | 151,936 |

## Citation

```bibtex
@article{zhan2026EGM,
    author = {Zhan, Guanqi and Li, Changye and Liu, Zhijian and Lu, Yao and Wu, Yi and Han, Song and Zhu, Ligeng},
    title = {EGM: Efficient Visual Grounding Language Models},
    booktitle = {arXiv},
    year = {2026}
}
```

## Acknowledgment

This repository benefits from [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL), [InternVL](https://github.com/OpenGVLab/InternVL), [verl](https://github.com/volcengine/verl) and [verl-internvl](https://github.com/Weiyun1025/verl-internvl).