88.6 kB

Title: Scene as Occupancy

URL Source: https://arxiv.org/html/2306.02851

Markdown Content: Chonghao Sima 1,3⁣∗†1 3∗absent†{}^{1,3\ast\dagger}start_FLOATSUPERSCRIPT 1 , 3 ∗ † end_FLOATSUPERSCRIPT, Wenwen Tong 2⁣∗2∗{}^{2\ast}start_FLOATSUPERSCRIPT 2 ∗ end_FLOATSUPERSCRIPT, Tai Wang 1,4 1 4{}^{1,4}start_FLOATSUPERSCRIPT 1 , 4 end_FLOATSUPERSCRIPT, Li Chen 1,3 1 3{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT, Silei Wu 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT,

Hanming Deng 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Yi Gu 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Lewei Lu 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Ping Luo 3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Dahua Lin 1,4 1 4{}^{1,4}start_FLOATSUPERSCRIPT 1 , 4 end_FLOATSUPERSCRIPT, Hongyang Li 1⁣†1†{}^{1\dagger}start_FLOATSUPERSCRIPT 1 † end_FLOATSUPERSCRIPT

1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Shanghai AI Laboratory 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT SenseTime Research

3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT The University of Hong Kong 4 4{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT The Chinese University of Hong Kong

∗∗{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT Equal contribution ††{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT Project lead

https://github.com/OpenDriveLab/OccNet

Abstract

Human driver can easily describe the complex traffic scene by visual system. Such an ability of precise perception is essential for driver’s planning. To achieve this, a geometry-aware representation that quantizes the physical 3D scene into structured grid map with semantic labels per cell, termed as 3D Occupancy, would be desirable. Compared to the form of bounding box, a key insight behind occupancy is that it could capture the fine-grained details of critical obstacles in the scene, and thereby facilitate subsequent tasks. Prior or concurrent literature mainly concentrate on a single scene completion task, where we might argue that the potential of this occupancy representation might obsess broader impact. In this paper, we propose OccNet, a multi-view vision-centric pipeline with a cascade and temporal voxel decoder to reconstruct 3D occupancy. At the core of OccNet is a general occupancy embedding to represent 3D physical world. Such a descriptor could be applied towards a wide span of driving tasks, including detection, segmentation and planning. To validate the effectiveness of this new representation and our proposed algorithm, we propose OpenOcc, the first dense high-quality 3D occupancy benchmark built on top of nuScenes. Empirical experiments show that there are evident performance gain across multiple tasks, e.g., motion planning could witness a collision rate reduction by 15%-58%, demonstrating the superiority of our method.

1 Introduction

Figure 1: Scene as Occupancy. Representing objects as ViDAR (a) or 3D occupancy (b) has been endorsed by industry [1, 2], due to the fact that conventional 3D bounding box cannot describe in detail irregular vehicles in daily driving scenes, e.g., protruding tail in (a) or (c). Defining the 3D world as Occupancy in (d) serves better to represent obstacles and avoid collision. In this paper, we envision Occupancy as a general Scene Descriptor as in (e) for a wide span of driving tasks beyond detection, such as planning, and witness performance gain compared to previous alternatives.

When you are driving on the road, how would you describe the scene in 3D space through your eyes? Human driver can easily describe the environment by “There is a Benz on the left side of my car in around 5 inches”, “There is a truck carrying huge protruding gas pipe on the rear, in around 50 meters ahead” and so on. Having the ability to describe the real world in a “There is” form is essential for making safe autonomous driving (AD) a reality. This is non-trivial for vision-centric AD systems due to the diverse range of entities present in the Scene, including vehicles such as cars, SUVs, and construction trucks, as well as static barriers, pedestrians, background buildings and vegetation. Quantizing the 3D scene into structured cells with semantic labels attached, termed as 3D Occupancy, is an intuitive solution, and this form is also advocated in the industry communities such as Mobileye[1] and Tesla[2] . Compared to the 3D box that oversimplifies the shape of objects, 3D occupancy is geometry-aware, depicting different objects and background shapes via the 3D cube collections with different geometric structure. As illustrated in Figure1(c-d), 3D box can only describe the main body of the construction vehicle, while 3D occupancy can preserve the detail of its crane arm. Other conventional alternatives, such as point cloud segmentation and bird’s-eye-view (BEV) segmentation, while being widely deployed in the context of AD, have their limitations in cost and granularity, respectively. A detailed comparison can be referred in Table1. Such evident advantages of 3D occupancy encourage an investigation into its potential for augmenting conventional perception tasks and downstream planning.

Similar works have discussed 3D occupancy at an initial stage. Occupancy grid map, a similar concept in Robotics, is a typical representation in mobile navigation[30] but only serves as the search space of planning. 3D semantic scene completion (SSC) [34] can be regarded as a perception task to evaluate the idea of 3D occupancy. Exploiting temporal information as geometric prior is intuitive for the vision-centric models to reconstruct the geometry-aware 3D occupancy, yet previous attempts[17, 20, 5, 27] have failed to address this. A coarse-to-fine approach is also favorable in improving 3D geometric representation at affordable cost, while it is ignored by one-stage methods[17, 27, 5]. In addition, the community still seeks a practical approach to evaluate 3D occupancy in a full-stack autonomous driving spirit as vision-centric solutions[14] prevail.

Towards these issues aforementioned, we propose OccNet, a multi-view vision-centric pipeline with a cascade voxel decoder to reconstruct 3D occupancy with the aid of temporal clues, and task-specific heads supporting a wide range of driving tasks. The core of OccNet is a compact and representative 3D occupancy embedding to describe the 3D scene. To achieve this, unlike straightforward voxel feature generation from image features or sole use of BEV feature as in previous literature[21, 7, 36], OccNet employs a cascade fashion to decode 3D occupancy feature from BEV feature. The decoder adopts a progressive scheme to recover the height information with voxel-based temporal self-attention and spatial cross-attention, bundled alongside a deformable 3D attention module for efficiency. Equipped with such a 3D occupancy descriptor, OccNet simultaneously supports general 3D perception tasks and facilitates downstream planning task, i.e., 3D occupancy prediction, 3D detection, BEV segmentation, and motion planning. For fair comparison across methods, we build OpenOcc, a 3D occupancy benchmark with dense and high-quality annotations, based on nuScenes dataset[4, 10]. It comprises 34149 annotated frames with over 1.4 billion 3D occupancy cells, each assigned to one of 16 classes to describe foreground objects and background stuff. Such dense and semantic-rich annotations leverage vision models towards superior 3D geometry learning, compared to the sparse alternative. It takes object motion into consideration with directional flow annotations as well, being extensible to the planning task.

We evaluate OccNet on OpenOcc benchmark, and empirical studies demonstrate the superiority of 3D occupancy as a scene representation over traditional alternatives from three aspects: 1) Better perception. 3D occupancy facilitates the acquisition of 3D geometry from vision-only models, as evidenced by the point cloud segmentation performance comparable with LiDAR-based methods and the enhanced 3D detection performance with occupancy-based pre-training or joint-training. 2) Better Planning. More accurate perception also translates into improved planning performance. 3) Dense is better. Dense 3D occupancy proves more effective than sparse form in supervising vision-only models. On the OpenOcc benchmark, OccNet outperforms state-of-the-art, e.g. TPVFormer[17], with a relative improvement of 14% in the semantic scene completion task. Compared with FCOS3D[37], the detection model performance pre-trained on OccNet increases by about 10 points when fine-tuned on small-scale data. For the motion planning task based on 3D occupancy, we can reduce the collision rate by 15%-58% compared with the planning policy based on BEV segmentation or 3D boxes.

To sum up, our contributions are two folds: (1) We propose OccNet, a vision-centric pipeline with a cascade voxel decoder to generate 3D occupancy using temporal clues. It better captures the fine-grained details of the physical world and supports a wide range of driving tasks. (2) Based on the proposed OpenOcc benchmark with dense and high-quality annotations, we demonstrate the effectiveness of OccNet with an evident performance gain upon perception and planning tasks. An initial conclusion is that 3D occupancy, as scene representation, is superior to conventional alternatives.

Representation Output space Foreground object Background& Mapping Description Granularity Require point cloud input 3D Box 3D✓-0.4∼12⁢m similar-to 0.4 12 𝑚 0.4\sim 12m 0.4 ∼ 12 italic_m- BEV Seg.BEV✓✓0.5⁢m∼1⁢m similar-to 0.5 𝑚 1 𝑚 0.5m\sim 1m 0.5 italic_m ∼ 1 italic_m- Point cloud 3D✓-∼0.02⁢m similar-to absent 0.02 𝑚\sim 0.02m∼ 0.02 italic_m✓ 3D Occupancy 3D✓✓0.25∼0.5⁢m similar-to 0.25 0.5 𝑚 0.25\sim 0.5m 0.25 ∼ 0.5 italic_m-

Table 1: Comparison on different representations. 3D Occupancy unifies foreground objects and background stuff into a fine-grain and dense voxel space, and is input-modality-agnostic.

Figure 2: OccNet pipeline. The core of OccNet is to obtain a representative Occupancy Descriptor and apply it for various driving tasks. Our proposed algorithm consists of two stages. I. Reconstruction of Occupancy. Given multiple visual inputs, we first generate features from the BEV encoder. Voxel Decoder is performed in a cascade fashion where voxels are refined progressively. A 3D deformable attention (att.) unit serves similar functionality as does in 2D case. Temporal voxels V t−1 subscript 𝑉 𝑡 1 V_{t-1}italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT are also incorporated. Some connections are omitted for brevity. See context for details. II. Exploitation of Occupancy. Equipped with the occupancy descriptor, we can proceed tasks including semantic scene completion and 3D object detection. Compacting them in BEV space would obtain a BEV segmentation map, which can be directly fed into the planning pipeline [13]. Such a design can ensure desirable improvement in planning task.

2 Related Work

3D object detection[33, 37, 21, 25] adopts 3D boxes as the objective of perception in AD since the box-form is well structured for downstream rule-based approaches. Such a representation abstracts 3D objects with different shapes into standardized cuboids, hence only cares about foreground objects and oversimplifies object shape. In contrast, 3D occupancy is a fine-grained description of the physical world and can differentiate objects with various shapes.

LiDAR segmentation[41, 29] is tasked as point-level 3D scene understanding. It requires point cloud as input, which is expensive and less portable. Since LiDAR inherently suffers from limited sensing range and sparsity in 3D scene description, it is not friendly to holistic 3D scene semantic understanding[34] using such a pipeline.

3D reconstruction and rendering. Inferring the 3D geometry of objects or scenes from 2D images [11, 28] is prevailing yet challenging for many years in computer vision. Most approaches in this domain[31, 6, 35] cope with a single object or scene. For AD application, this is not feasible since it requires strong generalization ability. Note that 3D reconstruction and rendering concentrates more on the quality of the scene geometry and visual appearance. It pays less attention to model efficiency and semantic understanding.

Semantic Scene Completion. The definition of occupancy prediction discussed in this work shares the most resemblance with SSC [34]. MonoScene[5] first adopts U-Net to infer from a single monocular RGB image the dense 3D occupancy with semantic labels. There is a burst of related works released in arXiv recently. We deem them as concurrent and briefly discuss below. VoxFormer [20] utilizes the depth estimation to set voxel queries in a two-stage framework. OccDepth[27] also adopts a depth-aware spirit in a stereo setting with distillation to predict semantic occupancy. TPVFormer[17] employs LiDAR-based sparse 3D occupancy as the supervision and proposes a tri-perspective view representation to obtain features. Wang et al.[38] provides a well human-crafted occupancy benchmark that could facilitate the community.

Despite different settings from ours with work conducted on Semantic-KITTI[3] and NYUv2[32] (monocular or RGB-D), prior or concurrent literature unanimously neglect the adoption of temporal context. Utilizing history voxel features is straightforward; it is verified by Tesla [2]. Yet there is no technical details or report to the public. Moreover, we position our work to be the first to investigate occupancy as a general descriptor that could enhance multiple tasks beyond detection.

3 Methodology

In this paper, we propose an effective and general framework, named OccNet, which obtains robust occupancy features from images and supports multiple driving tasks, as shown in Figure2. Our method comprises two stages, Reconstruction of Occupancy and Exploitation of Occupancy. We term the bridging part as Occupancy Descriptor, a unified description of the driving scene.

Reconstruction of Occupancy. The goal of this stage is to obtain a representative occupancy descriptor for supporting downstream tasks. Motivated by the fast development in BEV perception[21, 7, 22], OccNet is designed to exploit that gain for the voxel-wise prediction task in 3D space. To achieve this, the sole usage of BEV feature in downstream tasks, as the simplest architecture, is not suitable for height-aware task in 3D space. Going from one extreme to another, directly constructing voxel feature from images has huge computational cost. We term these two extreme as BEVNet and VoxelNet, and the design of OccNet finds a balance between them, achieving the best performance with affordable cost. The reconstruction stage first extracts multi-view feature F t subscript 𝐹 𝑡 F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from surrounding images, and feeds them into BEV encoder along with history BEV feature B t−1 subscript 𝐵 𝑡 1 B_{t-1}italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and current BEV query Q t subscript 𝑄 𝑡 Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to get current BEV feature. The BEV encoder follows the structure of BEVFormer [21], where history BEV feature B t−1 subscript 𝐵 𝑡 1 B_{t-1}italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, current BEV query Q t subscript 𝑄 𝑡 Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and image feature F t subscript 𝐹 𝑡 F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT go through a spatial-temporal-transformer block to get current BEV feature. Then, the image feature, the history and current BEV feature are together decoded into occupancy descriptor via Cascade Voxel Decoder. Details of the decoder is presented in Sec.3.1.

Exploitation of Occupancy. A wide range of driving tasks can be deployed based on the reconstructed occupancy descriptor. Inspired by Uni-AD[14], an explicit design of each representation is preferred. Intuitively, 3D semantic scene completion[34] and 3D object detection are attached upon the occupancy descriptor. Squeezing 3D occupancy grid map and 3D boxes along the height generates a BEV segmentation map. Such a map can be directly fed into motion planning head, along with sampler of high-level command, resulting in the ego-vehicle trajectory via argmin and GRU module. Detailed illustration is provided in Sec.3.2.

3.1 Cascade Voxel Decoder

To obtain a better voxel feature effectively and efficiently, we design a cascade structure in the decoder to progressively recover the height information in voxel feature.

From BEV to Cascaded Voxel. Based on the observation that directly using BEV feature or directly reconstructing voxel feature from perspective view suffers from performance or efficiency drop (see our ablation in Table9), we break this reconstruction from BEV feature (B t∈ℝ H×W×C BEV subscript 𝐵 𝑡 superscript ℝ 𝐻 𝑊 subscript 𝐶 BEV B_{t}\in\mathbb{R}^{H\times W\times C_{\text{BEV}}}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT end_POSTSUPERSCRIPT) to the desired voxel feature (V t∈ℝ Z×H×W×C Voxel subscript 𝑉 𝑡 superscript ℝ 𝑍 𝐻 𝑊 subscript 𝐶 Voxel V_{t}\in\mathbb{R}^{Z\times H\times W\times C_{\text{Voxel}}}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Z × italic_H × italic_W × italic_C start_POSTSUBSCRIPT Voxel end_POSTSUBSCRIPT end_POSTSUPERSCRIPT) into N 𝑁 N italic_N steps, named a cascade structure. Here H 𝐻 H italic_H and W 𝑊 W italic_W are the 2D spatial shape of BEV space, C 𝐶 C italic_C the feature dimension and Z 𝑍 Z italic_Z the desired height of voxel space. Between the input BEV feature and the desired cascaded voxel feature, we term the intermediate voxel feature with different height as V t,i′∈ℝ Z i×H×W×C i superscript subscript 𝑉 𝑡 𝑖′superscript ℝ subscript 𝑍 𝑖 𝐻 𝑊 subscript 𝐶 𝑖 V_{t,i}^{{}^{\prime}}\in\mathbb{R}^{Z_{i}\times H\times W\times C_{i}}italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Z i subscript 𝑍 𝑖 Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and C i subscript 𝐶 𝑖 C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are uniformly distributed between {1,N}1 𝑁{1,N}{ 1 , italic_N } and {C BEV,C Voxel}subscript 𝐶 BEV subscript 𝐶 Voxel{C_{\text{BEV}},C_{\text{Voxel}}}{ italic_C start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT Voxel end_POSTSUBSCRIPT } respectively. As shown in Figure2, the B t−1 subscript 𝐵 𝑡 1 B_{t-1}italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and B t subscript 𝐵 𝑡 B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are lifted into V t−1,i′superscript subscript 𝑉 𝑡 1 𝑖′V_{t-1,i}^{{}^{\prime}}italic_V start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and V t,i′superscript subscript 𝑉 𝑡 𝑖′V_{t,i}^{{}^{\prime}}italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT via feed-forward network, go through the i-t⁢h 𝑡 ℎ th italic_t italic_h voxel decoder to obtain a refined V t,i′superscript subscript 𝑉 𝑡 𝑖′V_{t,i}^{{}^{\prime}}italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, and the later steps follow the same scheme. Each voxel decoder comprises voxel-based temporal self-attention and voxel-based spatial cross-attention modules, and refines V t,i′superscript subscript 𝑉 𝑡 𝑖′V_{t,i}^{{}^{\prime}}italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT with history V t−1,i′superscript subscript 𝑉 𝑡 1 𝑖′V_{t-1,i}^{{}^{\prime}}italic_V start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and image feature F t subscript 𝐹 𝑡 F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT respectively. Step by step, the model gradually increases Z i subscript 𝑍 𝑖 Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and decreases C i subscript 𝐶 𝑖 C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to learn the final occupancy descriptor V t subscript 𝑉 𝑡 V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT effectively and efficiently.

Voxel based Temporal Self-Attention. The temporal information is crucial to represent the driving scene accurately [21]. Given the history voxel feature V t−1,i′superscript subscript 𝑉 𝑡 1 𝑖′V_{t-1,i}^{{}^{\prime}}italic_V start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, we align it to the current occupancy features V t,i′superscript subscript 𝑉 𝑡 𝑖′V_{t,i}^{{}^{\prime}}italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT via the position of ego-vehicle. For a typical self-attention, each query attends to every key and value, so the computation cost is very huge and even increases Z 2 superscript 𝑍 2 Z^{2}italic_Z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT times in 3D space compared to the 2D case. To alleviate the computation cost, we design a voxel-based efficient attention, termed as 3D Deformable Attention (3D-DA in short), to handle the computational burden. By applying it in the voxel-based temporal self-attention, we ensure that each voxel query only needs to interact with local voxels of interest, making the computational cost affordable.

3D Deformable Attention. We extend the traditional 2D deformable attention [40] to 3D form. Given a voxel feature V t,i′∈ℝ Z i×H×W×C i superscript subscript 𝑉 𝑡 𝑖′superscript ℝ subscript 𝑍 𝑖 𝐻 𝑊 subscript 𝐶 𝑖 V_{t,i}^{{}^{\prime}}\in\mathbb{R}^{Z_{i}\times H\times W\times C_{i}}italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, a voxel query with feature 𝒒∈ℝ i C 𝒒 subscript superscript ℝ 𝐶 𝑖\boldsymbol{q}\in\mathbb{R}^{C}_{i}bold_italic_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 3D referent point 𝒑 𝒑\boldsymbol{p}bold_italic_p, the 3D deformable attention is represented by:

3⁢D−DA⁡(𝒒,𝒑,V t,i′)=∑m=1 M W m⁢∑k=1 K A m⁢k⁢W k′⁢V t,i′⁢(𝒑+Δ⁢𝒑 m⁢k),3 D DA 𝒒 𝒑 superscript subscript 𝑉 𝑡 𝑖′superscript subscript 𝑚 1 𝑀 subscript 𝑊 𝑚 superscript subscript 𝑘 1 𝐾 subscript 𝐴 𝑚 𝑘 superscript subscript 𝑊 𝑘′superscript subscript 𝑉 𝑡 𝑖′𝒑 Δ subscript 𝒑 𝑚 𝑘\operatorname{3D-DA}(\boldsymbol{q},\boldsymbol{p},V_{t,i}^{{}^{\prime}})=\sum% {m=1}^{M}W{m}\sum_{k=1}^{K}A_{mk}W_{k}^{{}^{\prime}}V_{t,i}^{{}^{\prime}}(% \boldsymbol{p}+\Delta\boldsymbol{p}_{mk}),start_OPFUNCTION 3 roman_D - roman_DA end_OPFUNCTION ( bold_italic_q , bold_italic_p , italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_m italic_k end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( bold_italic_p + roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_m italic_k end_POSTSUBSCRIPT ) ,(1)

where M 𝑀 M italic_M is the number of attention heads, K 𝐾 K italic_K is the sampled key number with K≪Z i⁢H⁢W much-less-than 𝐾 subscript 𝑍 𝑖 𝐻 𝑊 K\ll Z_{i}HW italic_K ≪ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_H italic_W, W m∈ℝ C i×(C i/M)subscript 𝑊 𝑚 superscript ℝ subscript 𝐶 𝑖 subscript 𝐶 𝑖 𝑀 W_{m}\in\mathbb{R}^{C_{i}\times(C_{i}/M)}italic_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_M ) end_POSTSUPERSCRIPT and W k∈ℝ(C i/M)×C i subscript 𝑊 𝑘 superscript ℝ subscript 𝐶 𝑖 𝑀 subscript 𝐶 𝑖 W_{k}\in\mathbb{R}^{(C_{i}/M)\times C_{i}}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_M ) × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the learning weights, A m⁢k subscript 𝐴 𝑚 𝑘 A_{mk}italic_A start_POSTSUBSCRIPT italic_m italic_k end_POSTSUBSCRIPT is the normalized attention weight, and 𝒑+Δ⁢𝒑 m⁢k 𝒑 Δ subscript 𝒑 𝑚 𝑘\boldsymbol{p}+\Delta\boldsymbol{p}_{mk}bold_italic_p + roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_m italic_k end_POSTSUBSCRIPT is the learnable sample point position in 3D space, in which the feature is computed by trilinear interpolation from the voxel feature.

Voxel-based Spatial Cross-Attention. In the cross attention, the voxel feature V t,i′superscript subscript 𝑉 𝑡 𝑖′V_{t,i}^{{}^{\prime}}italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT interacts with the multi-scale image features F t subscript 𝐹 𝑡 F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with 2D deformable attention[40]. Each i-t⁢h 𝑡 ℎ th italic_t italic_h decoder directly samples N r⁢e⁢f,i subscript 𝑁 𝑟 𝑒 𝑓 𝑖 N_{ref,i}italic_N start_POSTSUBSCRIPT italic_r italic_e italic_f , italic_i end_POSTSUBSCRIPT 3D points from the corresponding voxel to the image view, and interact with the sampled image feature. Such a design maintains the height information and ensures the learning of voxel-wise feature.

3.2 Exploiting Occupancy on Various Tasks

The OccNet depicts the scene in 3D space with fine-grained occupancy descriptor, which can be fed into various driving tasks without excessive computational overhead.

Semantic Scene Completion. For simplicity, we design the MLP head to predict the semantic label of each voxel, and apply the Focal loss[24] to balance the huge numerical inequality between occupied and empty voxels. In addition, the flow head with L 1 subscript 𝐿 1 L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss are attached to estimate the flow velocity per occupied voxels.

3D Object Detection. Inspired by the head design in BEVFormer[21], we compact the occupancy descriptor into BEV, then apply a query-based detection head (an invariant of Deformable DETR[40]) to predict the 3D boxes.

BEV segmentation. Following the spatial-temporal-fusion perception structure in ST-P3[13], map representation and semantic segmentation are predicted from the BEV feature as in 3D object detection. The BEV segmentation head includes the drivable-area head and the lane head for map representation, the vehicle segmentation head and the pedestrian segmentation head for semantic segmentation.

Motion Planning. For motion planning task, either the predicted occupancy results in SSC or 3D bounding box can be transformed into the BEV segmentation, as shown in2. The 3D occupancy results is squeezed along the height dimension and the 3D boxes as well. All the semantic labels per BEV cell from either 3D occupancy or 3D boxes are turned into a 0-1 format, where 1 indicates the cell is occupied and 0 for empty. Then, such a BEV segmentation map is applied to the safety cost function, and we compute the safety, comfort and progress cost on the sampled trajectories. Note that compared to 3D boxes, the richer background information in occupancy scene completion leads to the more comprehensive safety cost function, and thus the safety cost value is needed to be normalized between these two kinds of BEV segmentation. All candidate trajectories are sampled by random velocity, acceleration, and curvature. Under the guidance of high-level command including forward, turn left and turn right, the trajectory corresponding to the specific command with the lowest cost will be output. GRU refinement enabled with the front-view vision feature is further performed on this trajectory as ST-P3[13] to obtain the final trajectory.

4 OpenOcc: 3D Occupancy Benchmark

To fairly evaluate the performance of occupancy across literature, we introduce the first 3D occupancy benchmark named as OpenOcc built on top of the prevailing nuScenes dataset[4, 10]. Compared with existing counterparts such as SemanticKITTI[3] with only front camera, OpenOcc provides surrounding camera views with the corresponding 3D occupancy and flow annotations.

4.1 Benchmark Overview

We generate occupancy data with dense and high quality occupancy annotations utilizing the sparse LiDAR information and 3D boxes. It comprises 34149 annotated frames for all 700 training and 150 validation scenes. We annotate over 1.4 billion voxels and 16 classes in the benchmark, including 10 foreground objects and 6 background stuffs. Moreover, we take the foreground object motion into consideration with additional flow annotation of object voxels. We compare our occupancy data with other benchmark in Table2, indicating that our benchmark can provide the most complete representation of the scene including the occupancy and flow information. As depicted in Figure3, SparseOcc [17] only utilized the sparse key frame LiDAR data to voxelize the 3D space, which is too sparse to represent the 3D scene. In comparison, our occupancy can represent the complete scene with flow information and capture the local fine grained scene geometry with high quality.

Dataset Multi-view Scenes Flow Density SemanticKITTI [3]-22-- SparseOcc[17]✓850-∼0.11 similar-to absent 0.11\sim 0.11∼ 0.11 OccData[9]✓850-∼0.76 similar-to absent 0.76\sim 0.76∼ 0.76 OpenOcc (Ours)✓850✓1

Table 2: Comparison of OpenOcc with existing benchmarks.Multi-view denotes the dataset that use muti-view image as input. Flow represent the flow annotation is given in the dataset. The density measures the voxel density in the dataset.

Figure 3: Visual comparison on 3D occupancy annotations. Compared to (a) sparse occupancy[17] and (b) OccData[9], we generate (c) dense and high-quality annotations with (d) the additional flow annotation of foreground objects, which can be applied for motion planning.

Method Backbone IoU g⁢e⁢o 𝑔 𝑒 𝑜{}_{geo}start_FLOATSUBSCRIPT italic_g italic_e italic_o end_FLOATSUBSCRIPT mIoU barrier bicycle bus car const. veh.motorcycle pedestrian traffic cone trailer truck driv. surf.other flat sidewalk terrain manmade vegetation BEVDet4D [15]ResNet50 18.27 9.85 13.56 0.00 13.04 26.98 0.61 1.20 6.76 0.93 1.93 12.63 27.23 11.09 13.64 12.04 6.42 9.56 BEVDepth [19]ResNet50 23.45 11.88 15.15 0.02 20.75 27.05 1.10 2.01 9.69 1.45 1.91 14.31 31.92 7.88 17.08 16.27 8.76 14.75 BEVDet [16]ResNet50 27.46 12.49 16.06 0.11 18.27 21.09 2.62 1.42 7.78 1.08 3.4 13.76 33.89 10.84 17.55 22.03 11.72 18.15 OccNet (ours)ResNet50 37.69 19.48 20.63 5.52 24.16 27.72 9.79 7.73 13.38 7.18 10.68 18.00 46.13 20.6 26.75 29.37 16.90 27.21 TPVFormer∗∗{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT[17]ResNet101 37.47 23.67 27.95 12.75 33.24 38.70 12.41 17.84 11.65 8.49 16.42 26.47 47.88 25.43 30.62 30.18 15.51 23.12 OccNet (ours)ResNet101 41.08 26.98 29.77 16.89 34.16 37.35 15.58 21.92 21.29 16.75 16.37 26.23 50.74 27.93 31.98 33.24 20.80 30.68

Table 3: 3D Occupancy Prediction in terms of Semantic Scene Completion. The semantic occupancy prediction and geometric prediction metrics are compared for models with RGB input. OccNet significantly outperforms previous SOTAs in terms of mIoU and IoU g⁢e⁢o subscript IoU 𝑔 𝑒 𝑜\text{IoU}_{geo}IoU start_POSTSUBSCRIPT italic_g italic_e italic_o end_POSTSUBSCRIPT. Methods with ∗∗{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT stands for training and evaluating on OpenOcc dataset.

Figure 4: Qualitative results of occupancy prediction. Our method outperforms TPVFormer [17] in terms of scene details and the semantic classification accuracy of foreground objects, such as the pedestrian in the dashed region.

4.2 Generating High-quality Annotation

Independent Accumulation of Background and Foreground. To generate dense representation, it is intuitive to accumulate all sparse LiDAR points from the key frame and intermediate frame to obtain the dense representation [3]. However, directly accumulating points from intermediate frame by coordinate transformation is problematic owing to the existence of moving objects. We propose to split the LiDAR point into the static background points and foreground points based on 3D box and accumulate them separately. Then we can accumulate static background points in the global world system and object points in the object coordinate system to generate dense points.

Generation of Annotation. Given dense background and object points, we first voxelize the 3D space and label the voxel based on the majority vote of labelled points in the voxel. Different with existing benchmark with only occupancy labels, we annotate the flow velocity of voxel based on the 3D box velocity to faciliatate the downstream task such as motion planning. Only Using key frame will cause sparsity of generated occupancy data, thus we annotate the voxel with unlabeled LiDAR points from intermediate frame based on the surrounding labelled voxels to further improve the data density. In addition, as nuScenes has the issue of missing translation in the z-axis, we refine the occupancy data by completing the scene, such as filling the holes on the road for higher quality. Moreover, we set part of voxels as invisible from the camera view by tracing the ray, which is more applicable for the task with camera input.

5 Experiments

Benchmark Details. We select a volume of 𝒱=[−50⁢m,50⁢m]×[−50⁢m,50⁢m]×[−5⁢m,3⁢m]𝒱 50 m 50 m 50 m 50 m 5 m 3 m\mathcal{V}=[-50\text{m},50\text{m}]\times[-50\text{m},50\text{m}]\times[-5% \text{m},3\text{m}]caligraphic_V = [ - 50 m , 50 m ] × [ - 50 m , 50 m ] × [ - 5 m , 3 m ] in LiDAR coordinate system for occupancy data generation, and voxelize the 3D space by the resolution of Δ⁢s=0.5⁢m Δ 𝑠 0.5 m\Delta s=0.5\text{m}roman_Δ italic_s = 0.5 m into 200×200×16 200 200 16 200\times 200\times 16 200 × 200 × 16 voxels to represent the 3D space. Evaluation metric can be referred in Supplementary.

OccNet Details. Following the experimental setting of BEVFormer [21], we use two types of backbone: ResNet50 [12] initialized from ImageNet [8], and ResNet101-DCN [12] initialized from FCOS3D [37]. We define the BEV feature as B t subscript 𝐵 𝑡 B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with H=200 𝐻 200 H=200 italic_H = 200, W=200 𝑊 200 W=200 italic_W = 200, and C BEV=256 subscript 𝐶 BEV 256 C_{\text{BEV}}=256 italic_C start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT = 256. For the decoder, we design N=4 𝑁 4 N=4 italic_N = 4 occupancy feature maps V t,i′∈ℝ Z i×H×W×C i superscript subscript 𝑉 𝑡 𝑖′superscript ℝ subscript 𝑍 𝑖 𝐻 𝑊 subscript 𝐶 𝑖 V_{t,i}^{{}^{\prime}}\in\mathbb{R}^{Z_{i}\times H\times W\times C_{i}}italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with Z i=2 i subscript 𝑍 𝑖 superscript 2 𝑖 Z_{i}=2^{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, C 1=C 2=128 subscript 𝐶 1 subscript 𝐶 2 128 C_{1}=C_{2}=128 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 128, C 3=C 4=64 subscript 𝐶 3 subscript 𝐶 4 64 C_{3}=C_{4}=64 italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 64. For the voxel-spatial cross attention, we sample N r⁢e⁢f,i=4 subscript 𝑁 𝑟 𝑒 𝑓 𝑖 4 N_{ref,i}=4 italic_N start_POSTSUBSCRIPT italic_r italic_e italic_f , italic_i end_POSTSUBSCRIPT = 4 points in each queried voxel. By default, we train OccNet with 24 epochs with a learning rate of 2×10−4 2 superscript 10 4 2\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

5.1 Main Results

Method Input mIOU barrier bicycle bus car const. veh.motorcycle pedestrian traffic cone trailer truck driv. surf.other flat sidewalk terrain manmade vegetation RangeNet++ [29]LiDAR 65.50 66.00 21.30 77.20 80.90 30.20 66.80 69.60 52.10 54.20 72.20 94.10 66.60 63.50 70.10 83.10 79.80 Cylinder3D [41]LiDAR 76.10 76.40 40.30 91.20 93.80 51.30 78.00 78.90 64.90 62.10 84.40 96.80 71.60 76.40 75.40 90.50 87.40 TPVFormer∗∗{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT[17]Camera 58.45 65.99 24.50 80.88 74.28 47.04 47.09 33.42 14.52 53.96 70.79 88.55 61.63 59.46 63.15 75.76 74.17 OccNet (ours)Camera 60.46 66.95 32.58 77.37 73.88 37.62 50.87 51.45 33.69 52.20 67.08 88.72 57.99 58.04 63.06 78.91 76.97

Table 4: The performance of OccNet (ResNet101) on nuScenes validation set for LiDAR segmentation task. OccNet with camera input is comparable with LiDAR based method.Methods with ∗∗{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT stands for training form stratch on OpenOcc dataset.

Semantic Scene Completion. We compare OccNet with previous state-of-the-art methods for semantic scene completion task in Table3 and Figure4. We reproduce BEVDet4D [15], BEVDepth [19] and BEVDet [16] by replacing the detection head with the scene completion head built on their BEV feature maps, and OccNet outperforms these methods by a large margin as shown in Table3. Compared with BEV feature map, our occupancy descriptor is better for the voxel-wise prediction task. We also compare OccNet with TPVFormer[17], which is developed for surrounding 3D semantic occupancy prediction task, and our model surpasses it over 3.31 points in terms of mIOU (26.98 vs. 23.67), indicating that occupancy descriptor is better than TPV features for scene representation. Note that TPVFormer surpasses the OccNet in car, truck and trailer, because samples of these three objects are relatively large in the benchmark and TPVFormer learns better feature on these classes from their sampling strategy. However, for the objects with small size such as pedestrian and traffic cone, our method can outperform TPVFormer[17] with a large margin of 10 points in Table3.

Occupancy for LiDAR Segmentation. Occupancy is a voxelized representation of points in 3D space, and semantic scene completion is equivalent to semantic LiDAR prediction task when Δ⁢s→0→Δ 𝑠 0\Delta s\to 0 roman_Δ italic_s → 0. We transfer semantic occupancy prediction to LiDAR segmentation by assigning the point label based on associated voxel label, and then evaluate the model on the mIoU metric. As reported in Table15, given camera as input without LiDAR supervision, OccNet can be comparative with the LiDAR segmentation model RangeNet++[29] in terms of mIoU (60.46 vs. 65.50), and OccNet can even outperforms RangeNet++ in the IoU of bicycle (32.58 vs. 21.30). Compared with TPVFormer[17], OccNet also outperforms it with 2 points in mIoU.

Occupancy for 3D Detection. In the scene completion task, the location of foreground object can be coarse regressed, which can help the 3D detection task with 3D box regression. As shown in Table5, the joint training of scene completion and 3D detection task can improve the detector performance for all our three models, including BEVNet, VoxNet and OccNet, in terms of mAP and NDS. Note that the voxelized representation of occupancy with Δ⁢s=0.5⁢m Δ 𝑠 0.5 m\Delta s=0.5\text{m}roman_Δ italic_s = 0.5 m is too coarse when calculating the metric dependent on the precise center distance and IoU of 3D box, and thus mATE and mASE is a little increased with joint training.

Pretrained Occupancy for 3D Detection and BEV segmentation. The OccNet trained on semantic scene completion task can obtain general representation for 3D space owing to the scene reconstructed in the occupancy descriptor. Thus, the learned occupancy descriptor can be directly transferred to the downstream 3D perception tasks with model fine-tuning. As described in Figure5, the model performance on 3D detection with pretained OccNet is superior to that pretrained on FCOS3D[37] detector in different scales of training dataset with the performance gain about 10 points for mAP and NDS. We also compare the occupancy pretraining and detection pretraining for the BEV segmentation task, indicating that the occupancy pretraining can help BEV segmentation achieve higher IoU in the fine-tuning stage on both semantic and map segmentation as shown in Table6.

Method Joint mAP↑↑\uparrow↑NDS↑↑\uparrow↑mAOE↓↓\downarrow↓mAVE↓↓\downarrow↓mAAE↓↓\downarrow↓mATE↓↓\downarrow↓mASE↓↓\downarrow↓ BEVNet-0.259 0.377 0.600 0.592 0.216 0.828 0.290 ✓0.271 0.390 0.578 0.541 0.211 0.835 0.293 VoxNet-0.271 0.380 0.603 0.616 0.219 0.832 0.284 ✓0.277 0.387 0.586 0.614 0.203 0.828 0.285 OccNet-0.276 0.382 0.655 0.588 0.209 0.817 0.290 ✓0.276 0.390 0.585 0.570 0.190 0.842 0.292

Table 5: Joint training of 3D occupancy and 3D detection. Results reported on nuScenes validation set show that joint training of 3D occupancy and 3D detection can help the latter task.

Figure 5: The comparison of detector performance using different pretained models and different scale of training dataset. OccNet (sparse) and OccNet (dense) means the OccNet trained on sparse and dense occupancy data respectively. Best view in color.

Task Main value Drivable area Lane Vehicle Pedestrian Det 18.18 44.59 13.62 12.29 2.21 Occ 19.17 47.21 13.83 12.91 2.74

Table 6: Different pretraining tasks for BEV segmentation. Occupancy task can help BEV segmentation task achieve higher IoU.

Occupancy for Planning. With the prediction results from upstream tasks, i.e., bounding box and occupancy, the final trajectory can be obtained through a cost filter and a GRU refinement module[13] with the BEV segmentation inputs. To obtain these segmentation results, we rasterize the outputs of our OccNet in BEV space. We compare the rasterisation results of bounding box and occupancy by using the predictions from OccNet. We also compare our results with the direct segmentation from ST-P3[13]. For a fair comparison, we follow the same setup as ST-P3 with only vehicle and pedestrian classes kept. We also add the ground truth rasterisation inputs for better comparison. As shown in Table7, the best performance can be obtained by using ground truth of occupancy to filter trajectories. For predicted results, the collision rate can be reduced by 15% - 58% based on the occupancy prediction from OccNet. We also conduct the experiment using all 16 classes of occupancy, which shows that full classes of occupancy can bring the performance improvement on L2 distance. As shown in the Figure6, planning with full classes of occupancy can make decisions within the feasible areas to avoid collisions from the background objects.

Input Collision (%percent%%)↓↓\downarrow↓L2 (m)↓↓\downarrow↓ 1s 2s 3s 1s 2s 3s Bbox GT 0.23 0.66 1.50 1.32 2.16 3.00 Occupancy GT 0.20 0.56 1.30 1.29 2.13 2.98 Segmentation pred. [13]0.50 0.88 1.49 1.39 2.21 3.02 Bbox pred. (OccNet)0.27 0.68 1.59 1.32 2.17 3.03 Occupancy pred. (OccNet)0.21 0.55 1.35 1.31 2.18 3.07 Occupancy pred. (OccNet, full)0.21 0.59 1.37 1.29 2.13 2.99

Table 7: Planning results with different scene representations. Occupancy representation helps the planning task achieve a lower collision rate and more accurate L2 distance in all time intervals.

Figure 6: Visualization of planning. The blue line represents the planned trajectory, and the lower figures are rasterisation results of bounding box and occupancy, respectively.

5.2 Discussion

Model Efficiency. In Table8, we compare the performance of different models in the semantic scene completion task. Compared with BEVNet and VoxelNet, OccNet can obtain the best performance in terms of mIOU and IoU g⁢e⁢o subscript IoU 𝑔 𝑒 𝑜\text{IoU}_{geo}IoU start_POSTSUBSCRIPT italic_g italic_e italic_o end_POSTSUBSCRIPT with efficiency and effectiveness.

Method mIOU IoU g⁢e⁢o subscript IoU 𝑔 𝑒 𝑜\text{IoU}_{geo}IoU start_POSTSUBSCRIPT italic_g italic_e italic_o end_POSTSUBSCRIPT Params↓↓\downarrow↓Memory↓↓\downarrow↓FPS↑↑\uparrow↑ BEVNet 17.37 36.11 39M 8G 4.5 VoxelNet 19.06 37.59 72M 23G 1.9 OccNet 19.48 37.69 40M 18G 2.6

Table 8: Efficiency and performance analysis with model structure. The evaluation is measured on a V100 GPU.

Irregular Object. Representing the irregular object such as construction vehicle with 3D box or the background stuff such as traffic sign is difficult and inaccurate as indicated in Figure7. We transform 3D box into voxel to compare the 3D detection and occupancy task on irregular object in Table9, verifying that occupancy can describe the irregular object better. To study the effect of voxel size, we also generate the dataset with Δ⁢s=0.25⁢m Δ 𝑠 0.25 m\Delta s=0.25\text{m}roman_Δ italic_s = 0.25 m. With the decrease of Δ⁢s Δ 𝑠\Delta s roman_Δ italic_s from 0.5m to 0.25m, the performance gap between 3D box and occupancy increases because the finer granularity can better depict the irregular object.

Task Δ⁢s Δ 𝑠\Delta s roman_Δ italic_s mIoU i⁢r subscript mIoU 𝑖 𝑟\text{mIoU}_{ir}mIoU start_POSTSUBSCRIPT italic_i italic_r end_POSTSUBSCRIPT truck trailer cons. veh. Det 0.5m 15.92 23.19 9.57 15.00 Occ 0.5m 18.13 (+2.21)25.59 12.29 16.51 Det 0.25m 9.90 14.84 5.10 9.75 Occ 0.25m 13.41 (+3.51)18.85 7.14 14.25

Table 9: The comparison of detection task and occupancy task on the recognition of irregular object. mIoU i⁢r subscript mIoU 𝑖 𝑟\text{mIoU}_{ir}mIoU start_POSTSUBSCRIPT italic_i italic_r end_POSTSUBSCRIPT denotes mean IoU of truck, trailer and construction vehicle.

Figure 7: Visualization of 3D box and occupancy prediction.

Dense v.s. Sparse Occupancy. Compared with sparse occupancy, dense occupancy can help depict the complete geometry of background and foreground object in detail as shown in Figure3. Intuitively, dense occupancy is better for 3D perception and motion planning owing to more abundant information input. We validate that model pretrained on dense occupancy can benefit the downstream 3D detection task more as shown in Figure5.

6 Conclusion

We dive into the potential of the 3D occupancy as scene representation and propose a general framework OccNet to evaluate the idea. The experiments on various downstream tasks validate the effectiveness of our method. The OpenOcc benchmark with dense and high-quality labels is also provided for community.

Limitations and future work. Currently, the annotation is still based on the well-established dataset. Utilizing self-supervised learning to further reduce the human-annotation cost is a valuable direction. We hope occupancy framework can be the foundation model of autonomous driving.

References

[1] CES 2020 by Mobileye . https://youtu.be/HPWGFzqd7pI, 2020.
[2] Tesla AI Day. https://www.youtube.com/watch?v=j0z4FweCy4M, 2021.
[3] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In ICCV, pages 9297–9307, 2019.
[4] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
[5]Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In CVPR, pages 3991–4001, 2022.
[6] Anh-Quan Cao and Raoul de Charette. Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields. arXiv preprint arXiv:2212.02501, 2022.
[7] Li Chen, Chonghao Sima, Yang Li, Zehan Zheng, Jiajie Xu, Xiangwei Geng, Hongyang Li, Conghui He, Jianping Shi, Yu Qiao, et al. Persformer: 3d lane detection via perspective transformer and the openlane benchmark. In ECCV, pages 550–567. Springer, 2022.
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
[9] Ming Fang and Zhiqi Li. occupancy-for-nuscenes. https://github.com/FANG-MING/occupancy-for-nuscenes, 2023.
[10] Whye Kit Fong, Rohit Mohan, Juana Valeria Hurtado, Lubing Zhou, Holger Caesar, Oscar Beijbom, and Abhinav Valada. Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. In ICRA, 2022.
[11] R.I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[13] Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In ECCV, pages 533–549. Springer, 2022.
[14] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. In CVPR, 2023.
[15] Junjie Huang and Guan Huang. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022.
[16] Junjie Huang, Guan Huang, Zheng Zhu, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
[17] Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2302.07817, 2023.
[18]Qi Li, Yue Wang, Yilun Wang, and Hang Zhao. Hdmapnet: An online hd map construction and evaluation framework. In ICRA, pages 4628–4634. IEEE, 2022.
[19] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092, 2022.
[20] Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. arXiv preprint arXiv:2302.12251, 2023.
[21] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022.
[22] Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework. arXiv preprint arXiv:2205.13790, 2022.
[23] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
[24] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
[25] Zhijian Liu, Haotian Tang, Alexander Amini, Xingyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542, 2022.
[26]Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[27] Ruihang Miao, Weizhou Liu, Mingrui Chen, Zheng Gong, Weixin Xu, Chen Hu, and Shuchang Zhou. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv:2302.13540, 2023.
[28] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
[29] Andres Milioto, Ignacio Vizzo, Jens Behley, and Cyrill Stachniss. Rangenet++: Fast and accurate lidar semantic segmentation. In IROS, pages 4213–4220. IEEE, 2019.
[30] Julien Moras, J. Dezert, and Benjamin Pannetier. Grid occupancy estimation for environment perception based on belief functions and pcr6. volume 9474, 04 2015.
[31] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022.
[32] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
[33] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In CVPR, pages 10529–10538, 2020.
[34] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In CVPR, pages 1746–1754, 2017.
[35] Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P. Srinivasan, Jonathan T. Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. In CVPR, pages 8248–8258, June 2022.
[36] Tai Wang, Jiangmiao Pang, and Dahua Lin. Monocular 3d object detection with depth from motion. In European Conference on Computer Vision, pages 386–403. Springer, 2022.
[37] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, pages 913–922, 2021.
[38] Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, and Xingang Wang. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023.
[39] Brady Zhou and Philipp Krähenbühl. Cross-view transformers for real-time map-view semantic segmentation. In CVPR, pages 13760–13769, 2022.
[40] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transformers for end-to-end object detection. In ICLR, 2021.
[41] Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin Ma, Wei Li, Hongsheng Li, and Dahua Lin. Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR, pages 9939–9948, 2021.

Appendix

We put the detail of evaluation metrics in the supplementary materials, along with more related work, visualization, implementation / training detail and more ablation of OccNet, and detail about BEVNet, VoxelNet and OpenOcc post-processing.

Appendix A Evaluation Metrics

Semantic Scene Completion (SSC) Metric. For the scene completion task, we predict the semantic label of each voxel in 3D space. The evaluation metric is defined by mean intersection-over-union (mIoU) over all classes:

mIoU=1 C⁢∑c=1 C TP c TP c+FP c+FN c,mIoU 1 𝐶 superscript subscript 𝑐 1 𝐶 subscript TP 𝑐 subscript TP 𝑐 subscript FP 𝑐 subscript FN 𝑐\text{mIoU}=\frac{1}{C}\sum_{c=1}^{C}\frac{\text{TP}{c}}{\text{TP}{c}+\text{% FP}{c}+\text{FN}{c}},mIoU = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG TP start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG TP start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + FP start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + FN start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ,(2)

where C=16 𝐶 16 C=16 italic_C = 16 is the class num in the benchmark, TP c subscript TP 𝑐\text{TP}{c}TP start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, FP c subscript FP 𝑐\text{FP}{c}FP start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and FN c subscript FN 𝑐\text{FN}{c}FN start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT represent true positive, false positive, and false negative predictions for class c 𝑐 c italic_c, respectively. In addition, we consider the class-agnostic metric IoU g⁢e⁢o subscript IoU 𝑔 𝑒 𝑜\text{IoU}{geo}IoU start_POSTSUBSCRIPT italic_g italic_e italic_o end_POSTSUBSCRIPT to evaluate the geometrical reconstruction quality of scene.

3D Object Detection Metric. We use the official evaluation metrics for the nuScenes datasets[4], including nuScenes detection score (NDS), mean average precision (mAP), average translation error (ATE), average scale error (ASE), average orientation error (AOE), average velocity error (AVE) and average attribute error (AAE).

Motion Planning Metric. For planning evaluation, we follow the metrics in ST-P3[13]. In detail, L2 distance is calculated by the planning trajectory and the ground-truth trajectory for the regression accuracy, and collision rate (CR) to other vehicles and pedestrians is applied for the safety of future actions.

Appendix B More Related Work

BEV segmentation[39, 18] implicitly squeezes the height information into each cell in BEV map. However, in some challenging urban settings, explicit height information is necessary to capture entities above the ground, e.g. traffic lights and overpass. As an alternative, 3D occupancy is 3D geometry-aware.

Appendix C Implementation Detail of OccNet

Backbone and Multi-scale Features.

Following previous works [21, 37], We adopt ResNet101 [12] as the backbone with FPN [23] to extract the multi-scale features from multi-view images. We use the output features from stages S 3 subscript 𝑆 3 S_{3}italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, S 4 subscript 𝑆 4 S_{4}italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, and S 5 subscript 𝑆 5 S_{5}italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT from ResNet101, where S n subscript 𝑆 𝑛 S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT means the downsampling factor is 1/2 n 1 superscript 2 𝑛 1/2^{n}1 / 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with the feature dimension C n=256×2 n−2 subscript 𝐶 𝑛 256 superscript 2 𝑛 2 C_{n}=256\times 2^{n-2}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 256 × 2 start_POSTSUPERSCRIPT italic_n - 2 end_POSTSUPERSCRIPT. In the FPN, the features are aggregated and transforms to three levels with sizes of 1/16, 1/32, 1/64 and the dimension of C n=256 subscript 𝐶 𝑛 256 C_{n}=256 italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 256.

BEV Encoder.

The BEV encoder follows the structure of BEVFormer [21], where the multi-scale features from FPN are transformed into the BEV feature. The BEV encoder includes 2 encoder layers with the temporal self-attention and spatial cross-attention. Then the BEV query Q t subscript 𝑄 𝑡 Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT gradually refines in the encoder layers with spatial-temporal-transformer mechanism to learn the scene representation in BEV space.

Feature Transformation in Voxel Decoder.

To lift the voxel feature V t,i′∈ℝ Z i×H×W×C i superscript subscript 𝑉 𝑡 𝑖′superscript ℝ subscript 𝑍 𝑖 𝐻 𝑊 subscript 𝐶 𝑖 V_{t,i}^{{}^{\prime}}\in\mathbb{R}^{Z_{i}\times H\times W\times C_{i}}italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to V t,i+1′∈ℝ Z i+1×H×W×C i+1 superscript subscript 𝑉 𝑡 𝑖 1′superscript ℝ subscript 𝑍 𝑖 1 𝐻 𝑊 subscript 𝐶 𝑖 1 V_{t,i+1}^{{}^{\prime}}\in\mathbb{R}^{Z_{i+1}\times H\times W\times C_{i+1}}italic_V start_POSTSUBSCRIPT italic_t , italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT × italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we use the MLP to transfer the feature dimension from Z i×C i subscript 𝑍 𝑖 subscript 𝐶 𝑖 Z_{i}\times C_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to Z i+1×C i+1 subscript 𝑍 𝑖 1 subscript 𝐶 𝑖 1 Z_{i+1}\times C_{i+1}italic_Z start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT. To implement the spatial cross attention for V t,i′superscript subscript 𝑉 𝑡 𝑖′V_{t,i}^{{}^{\prime}}italic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, the multi-scale image features from FPN with dimension of C n=256 subscript 𝐶 𝑛 256 C_{n}=256 italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 256 are transformed into dimension of C i subscript 𝐶 𝑖 C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT utilizing the MLP.

Training Strategy.

Following previous works [21, 37], we train OccNet 24 epochs with a learning rate of 2×10−4 2 superscript 10 4 2\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, a batchsize of 1 per GPU with six images, and AdamW optimizer [26] with a weight decay of 1×10−2 1 superscript 10 2 1\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. For the implementation of downstream tasks, all the perception tasks (except BEV segmentation) are trained at once, and the others are fine-tuned based on the frozen tasks.

Details of VoxelNet and BEVNet.

Different from the OccNet with cascaded feature map, we construct the VoxelNet and BEVNet with single-scale feature map. In detail, VoxelNet uses voxel queries Q v⁢o⁢x⁢e⁢l∈ℝ 4×H×W subscript 𝑄 𝑣 𝑜 𝑥 𝑒 𝑙 superscript ℝ 4 𝐻 𝑊 Q_{voxel}\in\mathbb{R}^{4\times H\times W}italic_Q start_POSTSUBSCRIPT italic_v italic_o italic_x italic_e italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 × italic_H × italic_W end_POSTSUPERSCRIPT to construct the voxel feature map F v⁢o⁢x⁢e⁢l∈ℝ 4×H×W×C 1 subscript 𝐹 𝑣 𝑜 𝑥 𝑒 𝑙 superscript ℝ 4 𝐻 𝑊 subscript 𝐶 1 F_{voxel}\in\mathbb{R}^{4\times H\times W\times C_{1}}italic_F start_POSTSUBSCRIPT italic_v italic_o italic_x italic_e italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 × italic_H × italic_W × italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from the image feature using 3D-DA directly, and expands it to full-scale occupancy V∈ℝ 16×H×W×C 2 𝑉 superscript ℝ 16 𝐻 𝑊 subscript 𝐶 2 V\in\mathbb{R}^{16\times H\times W\times C_{2}}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT 16 × italic_H × italic_W × italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT using fully connected layer. BEVNet generates BEV feature F b⁢e⁢v∈ℝ H×W×C 1 subscript 𝐹 𝑏 𝑒 𝑣 superscript ℝ 𝐻 𝑊 subscript 𝐶 1 F_{bev}\in\mathbb{R}^{H\times W\times C_{1}}italic_F start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as in BEVFormer and reshapes it to voxel feature V∈ℝ 16×H×W×C 2 𝑉 superscript ℝ 16 𝐻 𝑊 subscript 𝐶 2 V\in\mathbb{R}^{16\times H\times W\times C_{2}}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT 16 × italic_H × italic_W × italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT directly. Here C 1 subscript 𝐶 1 C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C 2 subscript 𝐶 2 C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT stand for the number of channels. Both VoxelNet and BEVNet adopt temporal context fusion accordingly.

Appendix D More Detail about OpenOcc

Figure 8: The generation process of our occupancy data. (a) Generating the occupancy data based on objects points and partial background points with label, where the black points denotes the unknown background points from the intermediate frame. (b) Annotating partial unknown background points based on generated occupancy data. (c) Removing the remaining unknown background points which are regarded as noise. (d) Postprocessing the occupancy data to ensure the completeness of the scene, such as fill the hole, denoted by the red dashed box.

Accumulation of Foreground objects.

To accumulate the foreground object, we split the LiDAR points into object points and background points. However, the 3D box annotation of intermediate frame is not provided in the nuScenes dataset [4]. We approximately annotate the 3D box using the linear interpolation based on two adjacent key frames, then we can accumulate dense object points with available intermediate LiDAR points.

Dataset Generation Pipeline.

With accumulated dense background points and foreground object points, we generate the occupancy data following the pipeline as shown in Figure8. We gradually fine tune the occupancy data and obtain the 3D occupancy benchmark with dense and high-quality annotations in Figure8(d).

Dataset Statistics.

We annotate 16 classes in 34149 frames for all 700 training and 150 validation scenes with over 1.4 billion voxels. The label distribution of 16 classes is shown in Figure9, indicating great diversity in the benchmark. There exists a significant class imbalance phenomenon in the dataset, for example, where the 10 foreground objects only account for 5.33% of the total labels, especially the bicycle and motorcycle, which account for 0.02% and 0.03%, respectively.

We provide the additional flow annotation of eight foreground objects, which is helpful for the downstream task such as motion planning. We split the object into moving state and stationary state based on the velocity threshold v t⁢h=0.2m/s subscript 𝑣 𝑡 ℎ 0.2m/s v_{th}=\text{0.2m/s}italic_v start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT = 0.2m/s, and the percentage of moving object for each class is given in Figure10. Note that the percentage of moving foreground object is over 50%, indicating the significance of motion information in the autonomous driving scenes.

Figure 9: The distribution of occupancy classes in the OpenOcc benchmark. We notice that the background stuff is the majority in 3D occupancy data.

Figure 10: The percentage of occupancy with velocity for each foreground object. For 10 foreground objects in the benchmark, we only consider the 8 movable classes.

Appendix E More Experiments

Ablations on Frame Number for Temporal Self-Attention.

We investigate the effect of frame number applied for temporal self-attention during training. From Table10 and Table11, we find that increasing temporal frames results in better performance, which slows down until a threshold of four frames is reached. Meanwhile, insufficient previous frames would hurt the performance to some extent.

#Num IoU g⁢e⁢o 𝑔 𝑒 𝑜{}_{geo}start_FLOATSUBSCRIPT italic_g italic_e italic_o end_FLOATSUBSCRIPT↑↑\uparrow↑mIoU↑↑\uparrow↑barrier↑↑\uparrow↑bicycle↑↑\uparrow↑bus↑↑\uparrow↑ 0 37.49 19.21 20.07 4.70 24.11 1 36.89 18.35 18.77 4.51 21.66 2∗∗{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT 37.69 19.48 20.63 5.52 24.16 3 38.36 20.30 21.39 6.47 24.65 4 39.21 20.81 22.30 5.66 25.13 9 39.36 20.68 20.75 7.83 24.79

Table 10: The effect of historical frames on the semantic scene completion task using OccNet with ResNet50 backbone. The “#Num” denotes the historical frame number used during training. ∗∗{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT stands for number used in the main paper.

#Num mIoU↑↑\uparrow↑barrier↑↑\uparrow↑bicycle↑↑\uparrow↑bus↑↑\uparrow↑car↑↑\uparrow↑truck↑↑\uparrow↑ 0 53.82 59.02 24.05 67.61 69.59 59.38 1 48.41 57.45 21.52 57.71 67.26 44.60 2 52.33 61.41 26.07 73.97 70.56 52.64 3 53.49 60.98 24.45 70.20 69.37 56.84 4 54.59 62.09 21.06 75.05 70.20 59.40 9 54.35 60.04 30.83 75.49 71.02 61.63

Table 11: The effect of historical frames on the LiDAR segmentation task using OccNet with ResNet50 backbone. The “#Num” denotes the historical frame number used during training.

Evaluation on Occupancy Metrics for Planning.

We utilize the ground truth of occupancy as the metrics to evaluate the planning model, instead of the bounding box of vehicles and pedestrians. Specifically, all of foreground occupancy voxels and four classes of background occupancy voxels, i.e., other flat, terrain, manmade, and vegetation, are calculated collision rate with trajectory. As shown in Table12, using occupancy as input for planning model is still more advantageous in most of the intervals under this metrics. In the future research, a specific design of cost function for occupancy input may further improve the performance of planning.

Input Collision (%percent%%)↓↓\downarrow↓L2 (m)↓↓\downarrow↓ 1s 2s 3s 1s 2s 3s Bbox GT 1.66 2.88 4.37 1.33 2.18 3.03 Occupancy GT 1.63 2.85 4.29 1.29 2.13 2.99 Bbox pred. (OccNet)1.75 2.85 4.37 1.33 2.17 3.04 Occupancy pred. (OccNet)1.68 2.94 4.32 1.30 2.15 3.02

Table 12: Planning results with different scene representations under occupancy metrics. Occupancy representation is still more advantageous most of the intervals.

Pre-training for planning.

As evaluating the pre-trained model on 3D detection and BEV segmentation tasks in the main paper, we further compared the impact on the downstream planning task. Specifically, the perception module of ST-P3[13] is replaced by pre-trained OccNet, and the planning module is fine-tuned. Unfortunately, the pre-training on OccNet does not provide an advantage for planning as shown in Table13. Therefore, combined with the experiment of planning in the main paper, we should directly apply the scene completion results of occupancy in the planning task instead of these pre-trained features.

Input Collision (%percent%%)↓↓\downarrow↓L2 (m)↓↓\downarrow↓ 1s 2s 3s 1s 2s 3s Det 0.38 0.40 0.82 0.85 1.18 1.57 Occ 0.47 0.68 1.03 0.93 1.26 1.70

Table 13: Different pretraining tasks for planning. Pretrained features from occupancy do not directly bring performance benefits to planning.

Ablations in Semantic Scene Completion.

Table14 shows the comparison of BEVNet, VoxelNet, OccNet in the task of semantic scene completion. We can see that the design of cascaded voxel structure can help learn a bettern occupancy descriptor to represent the 3D space.

Method Backbone IoU g⁢e⁢o subscript IoU 𝑔 𝑒 𝑜\text{IoU}_{geo}IoU start_POSTSUBSCRIPT italic_g italic_e italic_o end_POSTSUBSCRIPT mIoU barrier bicycle bus car const. veh.motorcycle pedestrian traffic cone trailer truck driv. surf.other flat sidewalk terrain manmade vegetation BEVNet ResNet50 36.11 17.37 14.02 5.07 20.85 24.94 8.64 7.75 12.8 8.93 10.21 16.02 44.41 14.42 23.87 27.76 13.73 24.49 VoxelNet ResNet50 37.59 19.06 19.31 6.25 22.16 26.89 9.96 6.91 12.70 6.27 9.43 16.96 46.7 23.31 26.04 29.08 16.52 26.46 OccNet ResNet50 37.69 19.48 20.63 5.52 24.16 27.72 9.79 7.73 13.38 7.18 10.68 18.00 46.13 20.60 26.75 29.37 16.90 27.21 BEVNet ResNet101 40.15 24.62 26.39 15.79 32.07 35.83 11.93 19.72 19.75 15.38 12.82 23.90 49.16 21.52 30.57 31.39 18.99 28.71 VoxelNet ResNet101 40.73 26.06 27.98 15.95 32.31 36.15 14.88 20.55 20.72 16.52 15.13 25.94 49.07 27.82 31.04 32.43 20.45 29.99 OccNet ResNet101 41.08 26.98 29.77 16.89 34.16 37.35 15.58 21.92 21.29 16.75 16.37 26.23 50.74 27.93 31.98 33.24 20.8 30.68

Table 14: Ablation in semantic scene completion with different models. OccNet is superior to BEVNet and VoxelNet in performance.

Effect of Voxel Resolution on LiDAR Segmentation.

We voxelize the 3D space with the resolution Δ⁢s∈{1.0⁢m,0.5⁢m,0.25⁢m}Δ 𝑠 1.0 m 0.5 m 0.25 m\Delta s\in{1.0\text{m},0.5\text{m},0.25\text{m}}roman_Δ italic_s ∈ { 1.0 m , 0.5 m , 0.25 m } to investigate the effect of voxel resolution on LiDAR segmention. Since we transfer semantic occupancy prediction to LiDAR segmentation by assigning the point label based on associated voxel label, the performance of LiDAR segmention will increase with the decrease of Δ⁢s Δ 𝑠\Delta s roman_Δ italic_s as shown in Table15. OccNet with camera input can achieve the performance of LiDAR based method with Δ⁢s→0→Δ 𝑠 0\Delta s\to 0 roman_Δ italic_s → 0.

Method Δ⁢s Δ 𝑠\Delta s roman_Δ italic_s(m)mIOU barrier bicycle bus car const. veh.motorcycle pedestrian traffic cone trailer truck driv. surf.other flat sidewalk terrain manmade vegetation OccNet 1.00 46.60 52.78 21.04 65.94 62.45 18.31 15.49 30.71 15.82 33.94 50.22 83.93 48.84 50.52 57.89 69.49 68.29 OccNet 0.50 47.29 59.06 20.63 48.32 63.05 24.12 20.24 41.82 18.84 23.38 41.12 86.46 53.12 52.03 59.14 71.55 73.68 OccNet 0.25 53.00 65.93 22.84 64.09 72.69 32.73 28.73 52.21 17.64 22.05 51.26 89.05 57.41 58.06 64.30 75.09 73.92

Table 15: The performance of OccNet with ResNet50 backbone on nuScenes validation set for LiDAR segmentation task. The method with the smallest Δ⁢s Δ 𝑠\Delta s roman_Δ italic_s show best performance.

Figure 11: Visualization of planning. The blue line represents the planned trajectory, and the lower figures are rasterisation results of bounding box and occupancy, respectively. The trajectory obtained by the rasterized occupancy input can maintain a greater safety distance from the truck, because of the more accurate polygon representation.

Appendix F Visualization Results

We sample two scenes in the validation set and provide detailed visualization of the occupancy prediction in Figure12, indicating that OccNet can describe the scene geometry and semantics in detail. As shown in Figure11, we compare the rasterized occupancy with the rasterized bounding box as the input of planning module, indicating that occupancy is superior to bounding box for motion planning task.

Figure 12: Visualization of occupancy prediction. For each scene, the top left figure is the surrounding camera input, and the left bottom figure and right figure represents the perspective view and top view of occupancy prediction result. The dashed region denotes that OccNet can predict the small size target or the distance target well.

Xet Storage Details

Size:: 88.6 kB
Xet hash:: 3d9bdaa984f026ccbef2f4e2d093fffb2de2ccb5e722ca45fa7766bb5ecfd73a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.