nielsr's picture
nielsr HF Staff
Update model card metadata and add library info for OneVL
dacd423 verified
|
raw
history blame
4.52 kB
metadata
base_model:
  - Qwen/Qwen3-VL-4B-Instruct
language:
  - en
license: apache-2.0
pipeline_tag: image-to-image
library_name: transformers
tags:
  - autonomous-driving
  - vision-language-action
  - chain-of-thought
  - trajectory-prediction
  - VLA

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

πŸ“„ Paper (arXiv) | πŸ’» GitHub | 🌐 Project Page

OneVL is a Vision-Language-Action (VLA) framework for autonomous driving that achieves state-of-the-art trajectory prediction accuracy while matching the inference latency of answer-only autoregressive models.

Overview

Prior latent Chain-of-Thought (CoT) methods compress reasoning into opaque hidden states β€” fast, but consistently underperform explicit CoT on driving tasks. OneVL identifies the root cause: purely linguistic latents encode abstract semantic labels rather than the spatiotemporal causal dynamics that govern real driving scenes. OneVL addresses this with dual-modal auxiliary decoders that force compact latent tokens to encode both human-readable reasoning and future scene dynamics simultaneously.

At inference, both decoders are discarded and all latents are prefilled into the prompt context in a single parallel pass β€” matching answer-only AR prediction speed while recovering the interpretability of explicit CoT in both vision and language.

Architecture

OneVL augments Qwen3-VL-4B-Instruct with:

  • Latent Token Interface: 4 visual latent tokens + 2 language latent tokens inserted in the assistant response before the answer.
  • Visual Auxiliary Decoder: Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (using the Emu3.5 IBQ codebook), acting as a world model.
  • Language Auxiliary Decoder: Reconstructs explicit CoT reasoning text from language latent hidden states.
  • Prefill Inference: Both decoders are discarded at inference; latent tokens are processed in one parallel pass with only the trajectory generated autoregressively.

Results

OneVL is the first latent CoT method to surpass explicit autoregressive CoT across major driving benchmarks.

NAVSIM

Method Model Size PDM-score ↑ Latency (s) ↓ Interpretability
AR CoT+Answer 4B 88.29 6.58 Language
OneVL 4B 88.84 4.46 Vision + Language

ROADWork

Method ADE (px) ↓ FDE (px) ↓ Latency (s) ↓
AR CoT+Answer 13.18 29.98 10.74
OneVL 12.49 28.80 4.71

Usage

Requirements

  • transformers >= 4.57.0 (required for Qwen3VLForConditionalGeneration)

Inference (Trajectory Prediction Only)

python infer_onevl.py \
    --model_path /path/to/OneVL-checkpoint \
    --test_set_path test_data/navsim_test.json \
    --image_base_path "" \
    --output_path output/navsim/results.json \
    --device cuda:0 \
    --num_latent 2 --num_latent_vis 4 \
    --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0

Inference with Language + Visual Explanation

python infer_onevl.py \
    --model_path /path/to/OneVL-checkpoint \
    --test_set_path test_data/navsim_test.json \
    --image_base_path "" \
    --output_path output/navsim/results_explain.json \
    --device cuda:0 \
    --num_latent 2 --num_latent_vis 4 \
    --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0 \
    --decoder_explain --aux_visual_condition \
    --c_thought 2 --max_explain_tokens 1024 \
    --visual_decoder_explain --visual_aux_visual_condition \
    --c_thought_visual 4 --max_visual_tokens 2560

Citation

@article{lu2026onevl,
  title={OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation},
  author={Lu, Jinghui and Guan, Jiayi and Huang, Zhijian and Li, Jinlong and Li, Guang and Kong, Lingdong and Li, Yingyan and Wang, Han and Xu, Shaoqing and Luo, Yuechen and others},
  journal={arXiv preprint arXiv:2604.18486},
  year={2026},
  url={https://arxiv.org/abs/2604.18486}
}

License

Released under the Apache 2.0 License. Model weights are built on Qwen3-VL-4B-Instruct and the visual tokenizer is from Emu3.5-VisionTokenizer.