| --- |
| library_name: hermes++ |
| license: apache-2.0 |
| tags: |
| - Driving World Model |
| - Unified Understanding and Generation |
| pipeline_tag: image-to-3d |
| --- |
| |
| <div align="center"> |
|
|
| <h2>HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation</h2> |
|
|
| <a href="https://lmd0311.github.io/">Xin Zhou</a><sup>1</sup>, |
| <a href="https://dk-liang.github.io/">Dingkang Liang</a><sup>1</sup>, |
| <a href="https://scholar.google.com/citations?user=PVMQa-IAAAAJ&hl=en">Xiwu Chen</a><sup>2</sup>, |
| Feiyang Tan<sup>2</sup>, |
| Dingyuan Zhang<sup>1</sup>, |
| <a href="https://scholar.google.com/citations?user=4uE10I0AAAAJ&hl=en">Hengshuang Zhao</a><sup>3</sup>, |
| <a href="https://scholar.google.com/citations?user=UeltiQ4AAAAJ&hl=en">Xiang Bai</a><sup>1</sup> |
|
|
| <p> |
| <sup>1</sup>Huazhong University of Science and Technology, <sup>2</sup>Mach Drive, <sup>3</sup>The University of Hong Kong |
| </p> |
|
|
| <p> |
| <a href="https://github.com/H-EmbodVis/HERMESV2"><img src="https://img.shields.io/badge/HERMES++-Code-181717?logo=github" alt="HERMES++ Code"></a> |
| <a href="https://h-embodvis.github.io/HERMESV2/"><img src="https://img.shields.io/badge/HERMES++-Project_Page-2c7a3f?logo=githubpages" alt="HERMES++ Project Page"></a> |
| <a href="https://github.com/LMD0311/HERMES"><img src="https://img.shields.io/badge/HERMES-Conference_Code_(ICCV25)-181717?logo=github" alt="HERMES Conference Code"></a> |
| <a href="https://arxiv.org/abs/2501.14729"><img src="https://img.shields.io/badge/HERMES_(ICCV25)-arXiv-b31b1b?logo=arxiv" alt="HERMES Conference arXiv"></a> |
| </p> |
|
|
| </div> |
|
|
| ## Abstract |
|
|
| Driving world models are important for autonomous driving because they simulate environmental dynamics and predict how a scene will evolve. Existing methods usually focus on future scene generation, while comprehensive 3D scene understanding is often handled by separate vision-language models. This separation leaves a gap between semantic interpretation and physical simulation. HERMES++ is a unified driving world model that integrates 3D scene understanding and future geometry prediction in a single framework. It uses a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information into a structure compatible with Large Language Models (LLMs). LLM-enhanced world queries transfer semantic knowledge from the understanding branch to the generation branch, while a Current-to-Future Link conditions future geometric evolution on both scene context and language reasoning. HERMES++ further introduces Joint Geometric Optimization, combining explicit point-cloud constraints and implicit latent regularization to preserve structural consistency. Extensive experiments show that HERMES++ achieves strong performance on both future point cloud prediction and 3D scene understanding. |
|
|
| ## TL; DR |
|
|
| - **Unified driving world model:** jointly supports 3D scene understanding and future geometry prediction. |
| - **BEV representation for LLMs:** compresses multi-view visual inputs into spatially consistent BEV tokens. |
| - **LLM-enhanced world queries:** transfer semantic and world knowledge from language reasoning to future generation. |
| - **Current-to-Future Link:** bridges current scene understanding and future geometric evolution. |
| - **Textual Injection:** uses text embeddings as conditioning signals for future scene generation. |
| - **Joint Geometric Optimization:** aligns latent features with geometry-aware priors through explicit and implicit constraints. |
|
|
| ## Method Overview |
|
|
| HERMES++ unifies understanding and generation around a shared BEV representation: |
|
|
| 1. Multi-view images are encoded and projected into BEV space. |
| 2. BEV features are compressed into LLM-compatible visual tokens. |
| 3. The LLM performs scene understanding and enriches world queries with semantic knowledge. |
| 4. The Current-to-Future Link generates future latent representations conditioned on current BEV features, textual semantics, and future ego-motion. |
| 5. A future geometry decoder predicts future point clouds, optimized with Joint Geometric Optimization. |
|
|
| ## Demo |
|
|
| <div align="center"> |
| <img src="figures/gifs/hermespp_demo_1.gif" width="90%" alt="HERMES++ Demo 1"> |
| <br> |
| <em>Demo 1</em> |
| </div> |
| <div align="center"> |
| <img src="figures/gifs/hermespp_demo_2.gif" width="90%" alt="HERMES++ Demo 2"> |
| <br> |
| <em>Demo 2</em> |
| </div> |
| <div align="center"> |
| <img src="figures/gifs/hermespp_demo_3.gif" width="90%" alt="HERMES++ Demo 3"> |
| <br> |
| <em>Demo 3</em> |
| </div> |
|
|
| ## Checkpoints |
|
|
| The released checkpoints are stored under the `ckpt/` directory: |
|
|
| - `ckpt/hermes++.stage1.pth` |
| - `ckpt/hermes++.stage2.1.pth` |
| - `ckpt/hermes++.stage2.2.pth` |
| - `ckpt/hermes++.stage3.pth` |
|
|
| Please refer to the [GitHub repository](https://github.com/H-EmbodVis/HERMESV2) for code, environment setup, data preparation, and evaluation details. |
|
|
| ## Links |
|
|
| - HERMES++ code: [https://github.com/H-EmbodVis/HERMESV2](https://github.com/H-EmbodVis/HERMESV2) |
| - HERMES++ project page: [https://h-embodvis.github.io/HERMESV2/](https://h-embodvis.github.io/HERMESV2/) |
| - HERMES conference version: [https://github.com/LMD0311/HERMES](https://github.com/LMD0311/HERMES) |
| - HERMES conference paper: [https://arxiv.org/abs/2501.14729](https://arxiv.org/abs/2501.14729) |
|
|
| ## License |
|
|
| The code and released model files are provided under the Apache 2.0 license. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{zhou2026hermespp, |
| title={HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation}, |
| author={Zhou, Xin and Liang, Dingkang and Chen, Xiwu and Tan, Feiyang and Zhang, Dingyuan and Zhao, Hengshuang and Bai, Xiang}, |
| journal={arXiv preprint}, |
| year={2026} |
| } |
| |
| @inproceedings{zhou2025hermes, |
| title={HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation}, |
| author={Zhou, Xin and Liang, Dingkang and Tu, Sifan and Chen, Xiwu and Ding, Yikang and Zhang, Dingyuan and Tan, Feiyang and Zhao, Hengshuang and Bai, Xiang}, |
| booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, |
| year={2025} |
| } |
| ``` |
|
|