How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="wangzc9865/SeeNav-Agent")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)
# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("wangzc9865/SeeNav-Agent")
model = AutoModelForImageTextToText.from_pretrained("wangzc9865/SeeNav-Agent")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

This repository contains the official implementation for the paper SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization.

GitHub Code Hugging Face Model

Overview

We propose SeeNav-Agent, a novel LVLM-based embodied navigation framework that includes a zero-shot dual-view visual prompt technique for the input side and an efficient RFT algorithm named SRGPO for post-training. Existing Vision-Language Navigation (VLN) agents often suffer from perception, reasoning, and planning errors, which SeeNav-Agent aims to mitigate through its proposed techniques.

πŸš€ Highlights

  • 🚫 Zero-Shot Visual Prompt: No extra training for performance improvement with visual prompt.
  • πŸ—² Efficient Step-Level Advantage Calculation: Step-Level groups are randomly sampled from the entire batch.
  • πŸ“ˆ Significant Gains: +20.0pp (GPT4.1+VP) and +5.6pp (Qwen2.5-VL-3B+VP+SRGPO) improvements on EmbodiedBench-Navigation.

πŸ“– Summary

  • 🎨 Dual-View Visual Prompt: We apply visual prompt techniques directly on the input dual-view image to reduce the visual hallucination.
  • πŸ” Step Reward Group Policy Optimization (SRGPO): By defining a state-independent verifiable process reward function, we achieve efficient step-level random grouping and advantage estimation.

πŸ“‹ Results on EmbodiedBench-Navigation

πŸ“ Main Results

πŸ–ŒοΈ Training Curves for RFT

πŸ–οΈ Testing Curves for OOD-Scenes

πŸ“¦ Checkpoint

base model env πŸ€— link
Qwen2.5-VL-3B-Instruct-SRGPO EmbodiedBench-Nav Qwen2.5-VL-3B-Instruct-SRGPO

πŸ› οΈ Usage

Setup

  1. Setup a seperate environment for evaluation according to: EmbodiedBench-Nav and Qwen3-VL to support Qwen2.5-VL-3B-Instruct.

  2. Setup a seperate training environment according to: verl-agent and Qwen3-VL to support Qwen2.5-VL-3B-Instruct.

Evaluation

Use the following command to evaluate the model on EmbodiedBench:

conda activate <your_env_for_eval>
cd SeeNav
python testEBNav.py

Hint: you need to first set your endpoint, API-key and api_version in SeeNav/planner/models/remote_model.py

Training

verl-agent/examples/srgpo_trainer contains example scripts for SRGPO-based training on EmbodiedBench-Navigation.

  1. Modify run_ebnav.sh according to your setup.

  2. Run the following command:

conda activate <your_env_for_train>
cd verl-agent
bash examples/srgpo_trainer/run_ebnav.sh

πŸ“š Citation

If you find this work helpful in your research, please consider citing:

@article{wang2025seenav,
  title={SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization},
  author={Zhengcheng Wang and Zichuan Lin and Yijun Yang and Haobo Fu and Deheng Ye},
  journal={arXiv preprint arXiv:2512.02631},
  year={2025}
}
Downloads last month
9
Safetensors
Model size
4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for wangzc9865/SeeNav-Agent

Finetuned
(749)
this model

Paper for wangzc9865/SeeNav-Agent