nvidia
/

EGM-4B-SFT

Image-Text-to-Text

visual-grounding

supervised-fine-tuning

Model card Files Files and versions

EGM-4B-SFT / README.md

ligeng-zhu-nv's picture

Upload folder using huggingface_hub

582e2ef verified 6 days ago

|

history blame contribute delete

3.04 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	base_model:
	- Qwen/Qwen3-VL-4B-Thinking
	pipeline_tag: image-text-to-text
	tags:
	- visual-grounding
	- multimodal
	- qwen3-vl
	- supervised-fine-tuning
	---

	# EGM-Qwen3-VL-4B-SFT

	<p align="center">
	<a href="https://nvlabs.github.io/EGM">[Project Page]</a>
	<a href="https://github.com/NVlabs/EGM">[Code]</a>
	</p>

	<div align="center">
	<img src="https://nvlabs.github.io/EGM/figure4.jpeg" width="90%"/>
	</div>

	## Model Summary

	EGM-Qwen3-VL-4B-SFT is the supervised fine-tuning (SFT) checkpoint from the first stage of the [EGM (Efficient Visual Grounding Language Models)](https://nvlabs.github.io/EGM) training pipeline. It is built on top of [Qwen3-VL-4B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking).

	This is an intermediate checkpoint intended for further reinforcement learning training. For the final model with best performance, see [nvidia/EGM-4B](https://huggingface.co/nvidia/EGM-4B).

	## Training Details

	### SFT Stage

	In the SFT stage, a proprietary VLM generates detailed chain-of-thought reasoning steps for visual grounding training data. The base Qwen3-VL-4B-Thinking model is then fine-tuned on this reasoning-augmented data to learn structured visual grounding with explicit reasoning.

	This SFT checkpoint serves as the initialization for the subsequent RL stage (GRPO), which yields the final [EGM-4B](https://huggingface.co/nvidia/EGM-4B) model.

	### How to Use for RL Training

	```bash
	pip install -U huggingface_hub
	huggingface-cli download nvidia/EGM-4B-SFT --local-dir ./models/EGM-4B-SFT
	```

	Then follow the installation and RL training instructions in the [EGM repository](https://github.com/NVlabs/EGM#rl-training).

	## Model Architecture

	\| Component \| Details \|
	\|---\|---\|
	\| Architecture \| Qwen3VLForConditionalGeneration \|
	\| Precision \| bfloat16 \|
	\| Text Hidden Size \| 2560 \|
	\| Text Layers \| 36 \|
	\| Attention Heads \| 32 (8 KV heads) \|
	\| Text Intermediate Size \| 9728 \|
	\| Vision Hidden Size \| 1024 \|
	\| Vision Layers \| 24 \|
	\| Patch Size \| 16 x 16 \|
	\| Max Position Embeddings \| 262,144 \|
	\| Vocabulary Size \| 151,936 \|

	## Related Models

	\| Model \| Description \|
	\|---\|---\|
	\| [nvidia/EGM-4B](https://huggingface.co/nvidia/EGM-4B) \| Final RL-trained model (best performance) \|
	\| [nvidia/EGM-8B-SFT](https://huggingface.co/nvidia/EGM-8B-SFT) \| SFT checkpoint for the 8B variant \|
	\| [nvidia/EGM-8B](https://huggingface.co/nvidia/EGM-8B) \| Final RL-trained 8B model \|

	## Citation

	```bibtex
	@article{zhan2026EGM,
	author = {Zhan, Guanqi and Li, Changye and Liu, Zhijian and Lu, Yao and Wu, Yi and Han, Song and Zhu, Ligeng},
	title = {EGM: Efficient Visual Grounding Language Models},
	booktitle = {arXiv},
	year = {2026}
	}
	```

	## Acknowledgment

	This repository benefits from [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL), [InternVL](https://github.com/OpenGVLab/InternVL), [verl](https://github.com/volcengine/verl) and [verl-internvl](https://github.com/Weiyun1025/verl-internvl).