File size: 6,035 Bytes

c37d126

---
library_name: hermes++
license: apache-2.0
tags:
- Driving World Model
- Unified Understanding and Generation
pipeline_tag: image-to-3d
---

<div align="center">

<h2>HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation</h2>

<a href="https://lmd0311.github.io/">Xin Zhou</a><sup>1</sup>,
<a href="https://dk-liang.github.io/">Dingkang Liang</a><sup>1</sup>,
<a href="https://scholar.google.com/citations?user=PVMQa-IAAAAJ&amp;hl=en">Xiwu Chen</a><sup>2</sup>,
Feiyang Tan<sup>2</sup>,
Dingyuan Zhang<sup>1</sup>,
<a href="https://scholar.google.com/citations?user=4uE10I0AAAAJ&amp;hl=en">Hengshuang Zhao</a><sup>3</sup>,
<a href="https://scholar.google.com/citations?user=UeltiQ4AAAAJ&amp;hl=en">Xiang Bai</a><sup>1</sup>

<p>
<sup>1</sup>Huazhong University of Science and Technology, <sup>2</sup>Mach Drive, <sup>3</sup>The University of Hong Kong
</p>

<p>
  <a href="https://github.com/H-EmbodVis/HERMESV2"><img src="https://img.shields.io/badge/HERMES++-Code-181717?logo=github" alt="HERMES++ Code"></a>
  <a href="https://h-embodvis.github.io/HERMESV2/"><img src="https://img.shields.io/badge/HERMES++-Project_Page-2c7a3f?logo=githubpages" alt="HERMES++ Project Page"></a>
  <a href="https://github.com/LMD0311/HERMES"><img src="https://img.shields.io/badge/HERMES-Conference_Code_(ICCV25)-181717?logo=github" alt="HERMES Conference Code"></a>
  <a href="https://arxiv.org/abs/2501.14729"><img src="https://img.shields.io/badge/HERMES_(ICCV25)-arXiv-b31b1b?logo=arxiv" alt="HERMES Conference arXiv"></a>
</p>

</div>

## Abstract

Driving world models are important for autonomous driving because they simulate environmental dynamics and predict how a scene will evolve. Existing methods usually focus on future scene generation, while comprehensive 3D scene understanding is often handled by separate vision-language models. This separation leaves a gap between semantic interpretation and physical simulation. HERMES++ is a unified driving world model that integrates 3D scene understanding and future geometry prediction in a single framework. It uses a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information into a structure compatible with Large Language Models (LLMs). LLM-enhanced world queries transfer semantic knowledge from the understanding branch to the generation branch, while a Current-to-Future Link conditions future geometric evolution on both scene context and language reasoning. HERMES++ further introduces Joint Geometric Optimization, combining explicit point-cloud constraints and implicit latent regularization to preserve structural consistency. Extensive experiments show that HERMES++ achieves strong performance on both future point cloud prediction and 3D scene understanding.

## TL; DR

- **Unified driving world model:** jointly supports 3D scene understanding and future geometry prediction.
- **BEV representation for LLMs:** compresses multi-view visual inputs into spatially consistent BEV tokens.
- **LLM-enhanced world queries:** transfer semantic and world knowledge from language reasoning to future generation.
- **Current-to-Future Link:** bridges current scene understanding and future geometric evolution.
- **Textual Injection:** uses text embeddings as conditioning signals for future scene generation.
- **Joint Geometric Optimization:** aligns latent features with geometry-aware priors through explicit and implicit constraints.

## Method Overview

HERMES++ unifies understanding and generation around a shared BEV representation:

1. Multi-view images are encoded and projected into BEV space.
2. BEV features are compressed into LLM-compatible visual tokens.
3. The LLM performs scene understanding and enriches world queries with semantic knowledge.
4. The Current-to-Future Link generates future latent representations conditioned on current BEV features, textual semantics, and future ego-motion.
5. A future geometry decoder predicts future point clouds, optimized with Joint Geometric Optimization.

## Demo

<div align="center">
  <img src="figures/gifs/hermespp_demo_1.gif" width="90%" alt="HERMES++ Demo 1">
  <br>
  <em>Demo 1</em>
</div>
<div align="center">
  <img src="figures/gifs/hermespp_demo_2.gif" width="90%" alt="HERMES++ Demo 2">
  <br>
  <em>Demo 2</em>
</div>
<div align="center">
  <img src="figures/gifs/hermespp_demo_3.gif" width="90%" alt="HERMES++ Demo 3">
  <br>
  <em>Demo 3</em>
</div>

## Checkpoints

The released checkpoints are stored under the `ckpt/` directory:

- `ckpt/hermes++.stage1.pth`
- `ckpt/hermes++.stage2.1.pth`
- `ckpt/hermes++.stage2.2.pth`
- `ckpt/hermes++.stage3.pth`

Please refer to the [GitHub repository](https://github.com/H-EmbodVis/HERMESV2) for code, environment setup, data preparation, and evaluation details.

## Links

- HERMES++ code: [https://github.com/H-EmbodVis/HERMESV2](https://github.com/H-EmbodVis/HERMESV2)
- HERMES++ project page: [https://h-embodvis.github.io/HERMESV2/](https://h-embodvis.github.io/HERMESV2/)
- HERMES conference version: [https://github.com/LMD0311/HERMES](https://github.com/LMD0311/HERMES)
- HERMES conference paper: [https://arxiv.org/abs/2501.14729](https://arxiv.org/abs/2501.14729)

## License

The code and released model files are provided under the Apache 2.0 license.

## Citation

```bibtex
@article{zhou2026hermespp,
  title={HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation},
  author={Zhou, Xin and Liang, Dingkang and Chen, Xiwu and Tan, Feiyang and Zhang, Dingyuan and Zhao, Hengshuang and Bai, Xiang},
  journal={arXiv preprint},
  year={2026}
}

@inproceedings{zhou2025hermes,
  title={HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation},
  author={Zhou, Xin and Liang, Dingkang and Tu, Sifan and Chen, Xiwu and Ding, Yikang and Zhang, Dingyuan and Tan, Feiyang and Zhao, Hengshuang and Bai, Xiang},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}
```