File size: 4,523 Bytes
fa87ddb
 
 
b102434
 
20a757d
b102434
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
---

# SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

This repository contains the official implementation for the paper [SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization](https://huggingface.co/papers/2512.02631).

<div align="center">
    <a href="https://github.com/WzcTHU/SeeNav-Agent"><img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code"></a>
    <a href="https://huggingface.co/wangzc9865/SeeNav-Agent"><img src="https://img.shields.io/badge/πŸ€—&nbsp;-HuggingFace-blue" alt="Hugging Face Model"></a>
</div>

## Overview
We propose **SeeNav-Agent**, a novel LVLM-based embodied navigation framework that includes a zero-shot dual-view visual prompt technique for the input side and an efficient RFT algorithm named SRGPO for post-training. Existing Vision-Language Navigation (VLN) agents often suffer from perception, reasoning, and planning errors, which SeeNav-Agent aims to mitigate through its proposed techniques.

## πŸš€ Highlights

*   🚫 **Zero-Shot Visual Prompt:** No extra training for performance improvement with visual prompt.
*   πŸ—² **Efficient Step-Level Advantage Calculation:** Step-Level groups are randomly sampled from the entire batch.
*   πŸ“ˆ **Significant Gains:** +20.0pp (GPT4.1+VP) and +5.6pp (Qwen2.5-VL-3B+VP+SRGPO) improvements on EmbodiedBench-Navigation.

## πŸ“– Summary
<div style="text-align: center;">
  <img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/framework.png" width="100%">
</div>

*   🎨 **Dual-View Visual Prompt:** We apply visual prompt techniques directly on the input dual-view image to reduce the visual hallucination.
*   πŸ” **Step Reward Group Policy Optimization (SRGPO):** By defining a state-independent verifiable process reward function, we achieve efficient step-level random grouping and advantage estimation.

## πŸ“‹ Results on EmbodiedBench-Navigation

### πŸ“ Main Results
<div align="center">
    <img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/results.png" width="50%"/>
</div>

### πŸ–ŒοΈ Training Curves for RFT
<div align="center">
    <img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/training_curves.png" width="50%"/>
</div>

### πŸ–οΈ Testing Curves for OOD-Scenes
<div align="center">
    <img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/ood_val.png" width="50%"/>
</div>

### πŸ“¦ Checkpoint

| base model | env | πŸ€— link |
| :--: | :--: | :--: |
| Qwen2.5-VL-3B-Instruct-SRGPO| EmbodiedBench-Nav | [Qwen2.5-VL-3B-Instruct-SRGPO](https://huggingface.co/wangzc9865/SeeNav-Agent) |

## πŸ› οΈ Usage

### Setup

1.  Setup a seperate environment for evaluation according to: [EmbodiedBench-Nav](https://github.com/EmbodiedBench/EmbodiedBench) and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) to support Qwen2.5-VL-3B-Instruct.

2.  Setup a seperate training environment according to: [verl-agent](https://github.com/langfengQ/verl-agent) and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) to support Qwen2.5-VL-3B-Instruct.

### Evaluation

Use the following command to evaluate the model on EmbodiedBench:

```bash
conda activate <your_env_for_eval>
cd SeeNav
python testEBNav.py
```

Hint: you need to first set your endpoint, API-key and api_version in [`SeeNav/planner/models/remote_model.py`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/SeeNav/planner/models/remote_model.py)

### Training

[`verl-agent/examples/srgpo_trainer`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/verl-agent/examples/srgpo_trainer) contains example scripts for SRGPO-based training on EmbodiedBench-Navigation.

1.  Modify [`run_ebnav.sh`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/verl-agent/examples/srgpo_trainer/run_ebnav.sh) according to your setup.

2.  Run the following command:

```bash
conda activate <your_env_for_train>
cd verl-agent
bash examples/srgpo_trainer/run_ebnav.sh
```

## πŸ“š Citation

If you find this work helpful in your research, please consider citing:

```bibtex
@article{wang2025seenav,
  title={SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization},
  author={Zhengcheng Wang and Zichuan Lin and Yijun Yang and Haobo Fu and Deheng Ye},
  journal={arXiv preprint arXiv:2512.02631},
  year={2025}
}
```