|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-3B-Instruct |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization |
|
|
|
|
|
This repository contains the official implementation for the paper [SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization](https://huggingface.co/papers/2512.02631). |
|
|
|
|
|
<div align="center"> |
|
|
<a href="https://github.com/WzcTHU/SeeNav-Agent"><img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code"></a> |
|
|
<a href="https://huggingface.co/wangzc9865/SeeNav-Agent"><img src="https://img.shields.io/badge/π€ -HuggingFace-blue" alt="Hugging Face Model"></a> |
|
|
</div> |
|
|
|
|
|
## Overview |
|
|
We propose **SeeNav-Agent**, a novel LVLM-based embodied navigation framework that includes a zero-shot dual-view visual prompt technique for the input side and an efficient RFT algorithm named SRGPO for post-training. Existing Vision-Language Navigation (VLN) agents often suffer from perception, reasoning, and planning errors, which SeeNav-Agent aims to mitigate through its proposed techniques. |
|
|
|
|
|
## π Highlights |
|
|
|
|
|
* π« **Zero-Shot Visual Prompt:** No extra training for performance improvement with visual prompt. |
|
|
* π² **Efficient Step-Level Advantage Calculation:** Step-Level groups are randomly sampled from the entire batch. |
|
|
* π **Significant Gains:** +20.0pp (GPT4.1+VP) and +5.6pp (Qwen2.5-VL-3B+VP+SRGPO) improvements on EmbodiedBench-Navigation. |
|
|
|
|
|
## π Summary |
|
|
<div style="text-align: center;"> |
|
|
<img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/framework.png" width="100%"> |
|
|
</div> |
|
|
|
|
|
* π¨ **Dual-View Visual Prompt:** We apply visual prompt techniques directly on the input dual-view image to reduce the visual hallucination. |
|
|
* π **Step Reward Group Policy Optimization (SRGPO):** By defining a state-independent verifiable process reward function, we achieve efficient step-level random grouping and advantage estimation. |
|
|
|
|
|
## π Results on EmbodiedBench-Navigation |
|
|
|
|
|
### π Main Results |
|
|
<div align="center"> |
|
|
<img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/results.png" width="50%"/> |
|
|
</div> |
|
|
|
|
|
### ποΈ Training Curves for RFT |
|
|
<div align="center"> |
|
|
<img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/training_curves.png" width="50%"/> |
|
|
</div> |
|
|
|
|
|
### ποΈ Testing Curves for OOD-Scenes |
|
|
<div align="center"> |
|
|
<img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/ood_val.png" width="50%"/> |
|
|
</div> |
|
|
|
|
|
### π¦ Checkpoint |
|
|
|
|
|
| base model | env | π€ link | |
|
|
| :--: | :--: | :--: | |
|
|
| Qwen2.5-VL-3B-Instruct-SRGPO| EmbodiedBench-Nav | [Qwen2.5-VL-3B-Instruct-SRGPO](https://huggingface.co/wangzc9865/SeeNav-Agent) | |
|
|
|
|
|
## π οΈ Usage |
|
|
|
|
|
### Setup |
|
|
|
|
|
1. Setup a seperate environment for evaluation according to: [EmbodiedBench-Nav](https://github.com/EmbodiedBench/EmbodiedBench) and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) to support Qwen2.5-VL-3B-Instruct. |
|
|
|
|
|
2. Setup a seperate training environment according to: [verl-agent](https://github.com/langfengQ/verl-agent) and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) to support Qwen2.5-VL-3B-Instruct. |
|
|
|
|
|
### Evaluation |
|
|
|
|
|
Use the following command to evaluate the model on EmbodiedBench: |
|
|
|
|
|
```bash |
|
|
conda activate <your_env_for_eval> |
|
|
cd SeeNav |
|
|
python testEBNav.py |
|
|
``` |
|
|
|
|
|
Hint: you need to first set your endpoint, API-key and api_version in [`SeeNav/planner/models/remote_model.py`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/SeeNav/planner/models/remote_model.py) |
|
|
|
|
|
### Training |
|
|
|
|
|
[`verl-agent/examples/srgpo_trainer`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/verl-agent/examples/srgpo_trainer) contains example scripts for SRGPO-based training on EmbodiedBench-Navigation. |
|
|
|
|
|
1. Modify [`run_ebnav.sh`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/verl-agent/examples/srgpo_trainer/run_ebnav.sh) according to your setup. |
|
|
|
|
|
2. Run the following command: |
|
|
|
|
|
```bash |
|
|
conda activate <your_env_for_train> |
|
|
cd verl-agent |
|
|
bash examples/srgpo_trainer/run_ebnav.sh |
|
|
``` |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you find this work helpful in your research, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@article{wang2025seenav, |
|
|
title={SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization}, |
|
|
author={Zhengcheng Wang and Zichuan Lin and Yijun Yang and Haobo Fu and Deheng Ye}, |
|
|
journal={arXiv preprint arXiv:2512.02631}, |
|
|
year={2025} |
|
|
} |
|
|
``` |