--- base_model: - Qwen/Qwen2.5-VL-3B-Instruct pipeline_tag: image-text-to-text library_name: transformers --- # SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization This repository contains the official implementation for the paper [SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization](https://huggingface.co/papers/2512.02631).

## Overview We propose **SeeNav-Agent**, a novel LVLM-based embodied navigation framework that includes a zero-shot dual-view visual prompt technique for the input side and an efficient RFT algorithm named SRGPO for post-training. Existing Vision-Language Navigation (VLN) agents often suffer from perception, reasoning, and planning errors, which SeeNav-Agent aims to mitigate through its proposed techniques. ## 🚀 Highlights * 🚫 **Zero-Shot Visual Prompt:** No extra training for performance improvement with visual prompt. * 🗲 **Efficient Step-Level Advantage Calculation:** Step-Level groups are randomly sampled from the entire batch. * 📈 **Significant Gains:** +20.0pp (GPT4.1+VP) and +5.6pp (Qwen2.5-VL-3B+VP+SRGPO) improvements on EmbodiedBench-Navigation. ## 📖 Summary

* 🎨 **Dual-View Visual Prompt:** We apply visual prompt techniques directly on the input dual-view image to reduce the visual hallucination. * 🔁 **Step Reward Group Policy Optimization (SRGPO):** By defining a state-independent verifiable process reward function, we achieve efficient step-level random grouping and advantage estimation. ## 📋 Results on EmbodiedBench-Navigation ### 📝 Main Results

### 🖌️ Training Curves for RFT

### 🖍️ Testing Curves for OOD-Scenes

### 📦 Checkpoint | base model | env | 🤗 link | | :--: | :--: | :--: | | Qwen2.5-VL-3B-Instruct-SRGPO| EmbodiedBench-Nav | [Qwen2.5-VL-3B-Instruct-SRGPO](https://huggingface.co/wangzc9865/SeeNav-Agent) | ## 🛠️ Usage ### Setup 1. Setup a seperate environment for evaluation according to: [EmbodiedBench-Nav](https://github.com/EmbodiedBench/EmbodiedBench) and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) to support Qwen2.5-VL-3B-Instruct. 2. Setup a seperate training environment according to: [verl-agent](https://github.com/langfengQ/verl-agent) and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) to support Qwen2.5-VL-3B-Instruct. ### Evaluation Use the following command to evaluate the model on EmbodiedBench: ```bash conda activate cd SeeNav python testEBNav.py ``` Hint: you need to first set your endpoint, API-key and api_version in [`SeeNav/planner/models/remote_model.py`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/SeeNav/planner/models/remote_model.py) ### Training [`verl-agent/examples/srgpo_trainer`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/verl-agent/examples/srgpo_trainer) contains example scripts for SRGPO-based training on EmbodiedBench-Navigation. 1. Modify [`run_ebnav.sh`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/verl-agent/examples/srgpo_trainer/run_ebnav.sh) according to your setup. 2. Run the following command: ```bash conda activate cd verl-agent bash examples/srgpo_trainer/run_ebnav.sh ``` ## 📚 Citation If you find this work helpful in your research, please consider citing: ```bibtex @article{wang2025seenav, title={SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization}, author={Zhengcheng Wang and Zichuan Lin and Yijun Yang and Haobo Fu and Deheng Ye}, journal={arXiv preprint arXiv:2512.02631}, year={2025} } ```