HERMESV2 / README.md

Initial clean release

c37d126 13 days ago

6.04 kB

	---
	library_name: hermes++
	license: apache-2.0
	tags:
	- Driving World Model
	- Unified Understanding and Generation
	pipeline_tag: image-to-3d
	---

	<div align="center">

	<h2>HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation</h2>

	<a href="https://lmd0311.github.io/">Xin Zhou</a><sup>1</sup>,
	<a href="https://dk-liang.github.io/">Dingkang Liang</a><sup>1</sup>,
	<a href="https://scholar.google.com/citations?user=PVMQa-IAAAAJ&hl=en">Xiwu Chen</a><sup>2</sup>,
	Feiyang Tan<sup>2</sup>,
	Dingyuan Zhang<sup>1</sup>,
	<a href="https://scholar.google.com/citations?user=4uE10I0AAAAJ&hl=en">Hengshuang Zhao</a><sup>3</sup>,
	<a href="https://scholar.google.com/citations?user=UeltiQ4AAAAJ&hl=en">Xiang Bai</a><sup>1</sup>

	<p>
	<sup>1</sup>Huazhong University of Science and Technology, <sup>2</sup>Mach Drive, <sup>3</sup>The University of Hong Kong
	</p>

	<p>
	<a href="https://github.com/H-EmbodVis/HERMESV2"><img src="https://img.shields.io/badge/HERMES++-Code-181717?logo=github" alt="HERMES++ Code"></a>
	<a href="https://h-embodvis.github.io/HERMESV2/"><img src="https://img.shields.io/badge/HERMES++-Project_Page-2c7a3f?logo=githubpages" alt="HERMES++ Project Page"></a>
	<a href="https://github.com/LMD0311/HERMES"><img src="https://img.shields.io/badge/HERMES-Conference_Code_(ICCV25)-181717?logo=github" alt="HERMES Conference Code"></a>
	<a href="https://arxiv.org/abs/2501.14729"><img src="https://img.shields.io/badge/HERMES_(ICCV25)-arXiv-b31b1b?logo=arxiv" alt="HERMES Conference arXiv"></a>
	</p>

	</div>

	## Abstract

	Driving world models are important for autonomous driving because they simulate environmental dynamics and predict how a scene will evolve. Existing methods usually focus on future scene generation, while comprehensive 3D scene understanding is often handled by separate vision-language models. This separation leaves a gap between semantic interpretation and physical simulation. HERMES++ is a unified driving world model that integrates 3D scene understanding and future geometry prediction in a single framework. It uses a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information into a structure compatible with Large Language Models (LLMs). LLM-enhanced world queries transfer semantic knowledge from the understanding branch to the generation branch, while a Current-to-Future Link conditions future geometric evolution on both scene context and language reasoning. HERMES++ further introduces Joint Geometric Optimization, combining explicit point-cloud constraints and implicit latent regularization to preserve structural consistency. Extensive experiments show that HERMES++ achieves strong performance on both future point cloud prediction and 3D scene understanding.

	## TL; DR

	- Unified driving world model: jointly supports 3D scene understanding and future geometry prediction.
	- BEV representation for LLMs: compresses multi-view visual inputs into spatially consistent BEV tokens.
	- LLM-enhanced world queries: transfer semantic and world knowledge from language reasoning to future generation.
	- Current-to-Future Link: bridges current scene understanding and future geometric evolution.
	- Textual Injection: uses text embeddings as conditioning signals for future scene generation.
	- Joint Geometric Optimization: aligns latent features with geometry-aware priors through explicit and implicit constraints.

	## Method Overview

	HERMES++ unifies understanding and generation around a shared BEV representation:

	1. Multi-view images are encoded and projected into BEV space.
	2. BEV features are compressed into LLM-compatible visual tokens.
	3. The LLM performs scene understanding and enriches world queries with semantic knowledge.
	4. The Current-to-Future Link generates future latent representations conditioned on current BEV features, textual semantics, and future ego-motion.
	5. A future geometry decoder predicts future point clouds, optimized with Joint Geometric Optimization.

	## Demo

	<div align="center">
	<img src="figures/gifs/hermespp_demo_1.gif" width="90%" alt="HERMES++ Demo 1">
	<br>
	<em>Demo 1</em>
	</div>
	<div align="center">
	<img src="figures/gifs/hermespp_demo_2.gif" width="90%" alt="HERMES++ Demo 2">
	<br>
	<em>Demo 2</em>
	</div>
	<div align="center">
	<img src="figures/gifs/hermespp_demo_3.gif" width="90%" alt="HERMES++ Demo 3">
	<br>
	<em>Demo 3</em>
	</div>

	## Checkpoints

	The released checkpoints are stored under the `ckpt/` directory:

	- `ckpt/hermes++.stage1.pth`
	- `ckpt/hermes++.stage2.1.pth`
	- `ckpt/hermes++.stage2.2.pth`
	- `ckpt/hermes++.stage3.pth`

	Please refer to the [GitHub repository](https://github.com/H-EmbodVis/HERMESV2) for code, environment setup, data preparation, and evaluation details.

	## Links

	- HERMES++ code: [https://github.com/H-EmbodVis/HERMESV2](https://github.com/H-EmbodVis/HERMESV2)
	- HERMES++ project page: [https://h-embodvis.github.io/HERMESV2/](https://h-embodvis.github.io/HERMESV2/)
	- HERMES conference version: [https://github.com/LMD0311/HERMES](https://github.com/LMD0311/HERMES)
	- HERMES conference paper: [https://arxiv.org/abs/2501.14729](https://arxiv.org/abs/2501.14729)

	## License

	The code and released model files are provided under the Apache 2.0 license.

	## Citation

	```bibtex
	@article{zhou2026hermespp,
	title={HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation},
	author={Zhou, Xin and Liang, Dingkang and Chen, Xiwu and Tan, Feiyang and Zhang, Dingyuan and Zhao, Hengshuang and Bai, Xiang},
	journal={arXiv preprint},
	year={2026}
	}

	@inproceedings{zhou2025hermes,
	title={HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation},
	author={Zhou, Xin and Liang, Dingkang and Tu, Sifan and Chen, Xiwu and Ding, Yikang and Zhang, Dingyuan and Tan, Feiyang and Zhao, Hengshuang and Bai, Xiang},
	booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
	year={2025}
	}
	```