| --- |
| license: apache-2.0 |
| language: |
| - en |
| library_name: transformers |
| base_model: |
| - Qwen/Qwen3-VL-4B-Thinking |
| pipeline_tag: image-text-to-text |
| tags: |
| - visual-grounding |
| - multimodal |
| - qwen3-vl |
| - supervised-fine-tuning |
| --- |
| |
| # EGM-Qwen3-VL-4B-SFT |
|
|
| <p align="center"> |
| <a href="https://nvlabs.github.io/EGM">[Project Page]</a> |
| <a href="https://github.com/NVlabs/EGM">[Code]</a> |
| </p> |
|
|
| <div align="center"> |
| <img src="https://nvlabs.github.io/EGM/figure4.jpeg" width="90%"/> |
| </div> |
|
|
| ## Model Summary |
|
|
| **EGM-Qwen3-VL-4B-SFT** is the supervised fine-tuning (SFT) checkpoint from the first stage of the [EGM (Efficient Visual Grounding Language Models)](https://nvlabs.github.io/EGM) training pipeline. It is built on top of [Qwen3-VL-4B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking). |
|
|
| This is an **intermediate checkpoint** intended for further reinforcement learning training. For the final model with best performance, see [nvidia/EGM-4B](https://huggingface.co/nvidia/EGM-4B). |
|
|
| ## Training Details |
|
|
| ### SFT Stage |
|
|
| In the SFT stage, a proprietary VLM generates detailed chain-of-thought reasoning steps for visual grounding training data. The base Qwen3-VL-4B-Thinking model is then fine-tuned on this reasoning-augmented data to learn structured visual grounding with explicit reasoning. |
|
|
| This SFT checkpoint serves as the initialization for the subsequent RL stage (GRPO), which yields the final [EGM-4B](https://huggingface.co/nvidia/EGM-4B) model. |
|
|
| ### How to Use for RL Training |
|
|
| ```bash |
| pip install -U huggingface_hub |
| huggingface-cli download nvidia/EGM-4B-SFT --local-dir ./models/EGM-4B-SFT |
| ``` |
|
|
| Then follow the installation and RL training instructions in the [EGM repository](https://github.com/NVlabs/EGM#rl-training). |
|
|
| ## Model Architecture |
|
|
| | Component | Details | |
| |---|---| |
| | Architecture | Qwen3VLForConditionalGeneration | |
| | Precision | bfloat16 | |
| | Text Hidden Size | 2560 | |
| | Text Layers | 36 | |
| | Attention Heads | 32 (8 KV heads) | |
| | Text Intermediate Size | 9728 | |
| | Vision Hidden Size | 1024 | |
| | Vision Layers | 24 | |
| | Patch Size | 16 x 16 | |
| | Max Position Embeddings | 262,144 | |
| | Vocabulary Size | 151,936 | |
|
|
| ## Related Models |
|
|
| | Model | Description | |
| |---|---| |
| | [nvidia/EGM-4B](https://huggingface.co/nvidia/EGM-4B) | Final RL-trained model (best performance) | |
| | [nvidia/EGM-8B-SFT](https://huggingface.co/nvidia/EGM-8B-SFT) | SFT checkpoint for the 8B variant | |
| | [nvidia/EGM-8B](https://huggingface.co/nvidia/EGM-8B) | Final RL-trained 8B model | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{zhan2026EGM, |
| author = {Zhan, Guanqi and Li, Changye and Liu, Zhijian and Lu, Yao and Wu, Yi and Han, Song and Zhu, Ligeng}, |
| title = {EGM: Efficient Visual Grounding Language Models}, |
| booktitle = {arXiv}, |
| year = {2026} |
| } |
| ``` |
|
|
| ## Acknowledgment |
|
|
| This repository benefits from [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL), [InternVL](https://github.com/OpenGVLab/InternVL), [verl](https://github.com/volcengine/verl) and [verl-internvl](https://github.com/Weiyun1025/verl-internvl). |
|
|