88.4 kB

Title: MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction

URL Source: https://arxiv.org/html/2404.00876

Markdown Content: Xiaolu Liu 1, Song Wang 1, Wentong Li 1, Ruizi Yang 1, Junbo Chen 2 1 1 footnotemark: 1, Jianke Zhu 1

1 Zhejiang University 2 Udeer.ai

{xiaoluliu, songw, liwentong, ruiziyang, jkzhu}@zju.edu.cn, junbo@udeer.ai

Abstract

Currently, high-definition (HD) map construction leans towards a lightweight online generation tendency, which aims to preserve timely and reliable road scene information. However, map elements contain strong shape priors. Subtle and sparse annotations make current detection-based frameworks ambiguous in locating relevant feature scopes and cause the loss of detailed structures in prediction. To alleviate these problems, we propose MGMap, a mask-guided approach that effectively highlights the informative regions and achieves precise map element localization by introducing the learned masks. Specifically, MGMap employs learned masks based on the enhanced multi-scale BEV features from two perspectives. At the instance level, we propose the Mask-activated instance (MAI) decoder, which incorporates global instance and structural information into instance queries by the activation of instance masks. At the point level, a novel position-guided mask patch refinement (PG-MPR) module is designed to refine point locations from a finer-grained perspective, enabling the extraction of point-specific patch information. Compared to the baselines, our proposed MGMap achieves a notable improvement of around 10 mAP for different input modalities. Extensive experiments also demonstrate that our approach showcases strong robustness and generalization capabilities. Our code can be found at https://github.com/xiaolul2/MGMap.

1 Introduction

Figure 1: For some detailed structures, our proposed MGMap achieves effective map element localization by highlighting the informative regions through the learned masks.

High-definition (HD) map plays an important role in autonomous driving[19], as it provides centimeter-level road information for self-positioning[18], path planning[11, 7] and other downstream tasks[15, 14, 17]. Typically, offline HD map constructions rely on manual annotations on global LiDAR point clouds, which usually require tedious financial and manual investments. Moreover, it is challenging for offline maps to reflect the ever-changing road conditions due to the lack of real-time updates. To tackle the above issues, a lightweight and real-time online map construction paradigm[19] has gradually become a promising approach by incorporating information from onboard sensors.

Some existing approaches[36, 19, 50, 10, 43] consider online map construction as a segmentation task, where pixel-level rasterized maps are learned in bird’s-eye-view (BEV) space. Nevertheless, for vectorized expression, an additional step of post-processing is required to cluster and fit lane instances. Recently, efficient and straightforward approaches like VectorMapNet[26] and MapTR[23] have been proposed for vectorized map construction, in which map elements are represented by sparse point sets. Transformer-based architectures are directly employed to update instance queries and regress point locations.

Despite having achieved promising results, current map vectorization frameworks are still constrained by inherent issues. As shown inFigure 1, map elements, such as road edges, dividing lines, and pedestrian crossings, always have strong shape priors. Obscure features and coarse locations can easily lead to the loss of detailed expressions in prediction, especially for irregular boundaries and sudden changes in corner angles. Besides, subtle and sparse annotations pose significant challenges for deformable attention[51], which is known as a sparse and local feature extraction strategy employed in current vectorization frameworks. Considering the sparsity of detection targets, such under-sampling techniques can readily lead to coarse localization and the loss of effective information. These issues intensify the difficulties in determining the relevant feature scopes, confirming lane line instances, and accurately locating lane points.

To tackle the above issues, in this paper, we propose a fine-grained approach called MGMap, which aims to improve localization and highlight specific features by incorporating the guidance of learned map masks. Initially, the enhanced multi-level BEV features are dynamically constructed for richer semantic and position features. Based on this, masks can be generated and adequately utilized at the holistic instance level and the more granular point level. At the instance level, we design the Mask-Activated Instance (MAI) decoder to guide the construction and feature aggregation of lane queries. By leveraging learned instance masks, MAI decoder enables the activation of lane queries to possess global instance structures and shape characteristics upon activation. Furthermore, at the point level, a Position-Guided Mask Patch Refinement (PG-MPR) module is proposed to alleviate the difficulty in locating relevant features due to the sparse and irregular detection targets. By focusing on the specific patch regions, binary mask features can be extracted to gather more detailed information at lanes’ surrounding locations, which is well-designed for finer regression of the detailed structure and point locations.

Extensive experiments on nuScenes[1] and Argoverse2[46] datasets demonstrate that MGMap achieves state-of-the-art (SOTA) performance on the task of online HD map construction. Besides, the promising experiment results under different settings show the robustness and generalization capability of our presented model. The main contributions of our work can be summarized as follows:

• An effective approach for precise online HD map vectorization with the guidance of learned masks. The effective features of instance masks and binary masks are extracted for unique lane lines and shape learning.
• A mask-activated instance decoder and a novel position-guided mask patch refinement module to decode map elements from the instance level and the point level by fully leveraging the potential of mask features.
• Promising results on two testbeds show that our MGMap outperforms the previous approaches at a large margin and has strong robustness and generalization capability.

Figure 2: Overview of MGMap framework. MGMap mainly consists of three components: (1) BEV Extractor to obtain multi-scale BEV features by transforming from perspective view (PV) to BEV with the enhanced multi-level neck; (2) Mask-Activated Instance (MAI) Decoder is employed to construct and update queries at instance level; (3) Position-Guided Mask Patch Refinement (PG-MPR) module is designed to refine points’ positions from local patch features at point level.

2 Related Work

Online HDMap Construction. Unlike traditional offline HD map annotations on global LiDAR points[37, 38], recent studies[19, 43, 26, 23] explore the online construction directly from on-board sensor data to reduce the cost of labeling and provide up-to-date road information. Some approaches[19, 20, 43, 10, 33] treat HD map construction as a segmentation task to predict pixel-level rasterized maps, which requires post-processing for vectorized construction. For more straightforward vectorized construction, Liu et al.[26] propose a two-stage framework VectorMapNet with an auto-regressive decoder to recurrently connect vertices. Jeong et al.[39] detect points first and employ an adjacent matrix in InstaGraM to build the connections among instance points. Further, Liao et al.[23, 24] propose to represent map elements as the ordered points with fixed numbers in MapTR, so that transformer architecture can be used to regress points’ positions simultaneously. Later, Qiao et al.[32] and Ding et al.[9] present new modeling strategies, in which BeMapNet and PivotMap utilize Bezier curves and dynamic pivot points to model map elements separately for more detailed representations. In contrast to the above methods, our MGMap approach introduces the guidance of generated mask features to handle lane shapes in detail with specific feature enhancement.

Camera-based BEV Perception. HD map construction relies on high-quality BEV features, which are also the basis for most 3D perception tasks[44, 16, 25, 31, 45]. Generally, BEV features are extracted and transformed from perspective-view (PV) images. Initially, Reiher et al.[35] utilize homography transformation to project PV images into BEV space in IPM. After that, learning-based approaches are widely used to construct more reliable BEV features[2, 44]. Pan et al.[29] employ a fully connected layer in VPN to convert perspective view features into BEV space. In[31] and[16], depth estimation is utilized to establish the connection between the surrounding view images and BEV features. In order to enhance the robustness of the model and the effect of detection, multi-modality and temporal fusion strategies are used in BEVFormer [20] and BEVFusion [27, 22]. Besides, transformer-based architectures[30, 25, 20] are widely used for feature aggregation. BEV features are represented by queries at different positions, which can be updated by interaction with PV image features. To obtain a more reliable HD map, we employ a pyramid-like network for multi-scale BEV features, which enables to capture the rich semantic and location information for online HD map construction.

Mask Refinement for Segmentations. Mask refinement strategies are widely used for different segmentation tasks, aiming to improve the quality of instance or semantic features. In previous work[5] and[40], boundary-aware mask features are constructed by an extra branch, which is used to improve mask localization accuracy. Cheng et al.[6] utilize the interaction between learned mask features with instance activation maps for representing objects with instance features. In[21, 3, 4, 8], mask features are integrated into the transformer-based architecture, in which attentions are utilized for feature extraction. Based on the initial prediction outputs, Tang et al.[42] propose a refinement strategy with small boundary patches to improve the mask quality. As for map construction, our proposed MGMap approach takes advantage of the learned mask features from both the instance level and point level to highlight and enhance informative regions of subtle map annotations.

3 MGMap

Given surround-view images captured from onboard cameras, our goal is to distinguish local BEV map instances while locating their corresponding structures. Each map element comprises a class label 𝐜 𝐜\mathbf{c}bold_c and an ordered sequence of points 𝐏={(x i,y i)}i=1 N 𝐏 superscript subscript subscript 𝑥 𝑖 subscript 𝑦 𝑖 𝑖 1 𝑁\mathbf{P}={(x_{i},y_{i})}_{i=1}^{N}bold_P = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT that represents the lane structure, with N 𝑁 N italic_N denoting the number of points for each lane.

Figure 2 illustrates the framework of our proposed MGMap. To obtain multi-scale BEV features, a pyramid network with fused attention is designed, which is built after initial PV-to-BEV feature transformations (Section 3.1). Subsequently, a cascaded transformer decoder is utilized to update mask-activated instance queries (Section 3.2). Finally, by interacting with the local semantic context derived from the patch of mask features, we achieve fine-grained position refinement at the point level (Section 3.3).

3.1 BEV Feature Extraction

Initially, we employ shared CNN backbones to extract 2D features from PV images. PV features can be gathered into a unified BEV representation via the strategy in [20], which utilizes deformable attention[51] to update BEV queries by interacting with surround-view image features. The extracted BEV feature can be represented as 𝐅 0∈ℝ D×H B⁢E⁢V×W B⁢E⁢V subscript 𝐅 0 superscript ℝ 𝐷 subscript 𝐻 𝐵 𝐸 𝑉 subscript 𝑊 𝐵 𝐸 𝑉\mathbf{F}{0}\in\mathbb{R}^{D\times H{BEV}\times W_{BEV}}bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_H start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where H B⁢E⁢V×W B⁢E⁢V subscript 𝐻 𝐵 𝐸 𝑉 subscript 𝑊 𝐵 𝐸 𝑉 H_{BEV}\times W_{BEV}italic_H start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT is the size of the BEV feature and D 𝐷 D italic_D represents the dimension.

Enhanced Multi-Level Neck. To obtain BEV features with rich semantic and location information, at BEV space, we design a 3-layer Enhanced Multi-Level (EML) neck with fused attention to construct the unified BEV features. Multi-scale BEV features with larger receptive fields can be obtained for a better understanding of the overall structures.

In the EML neck, we construct the cascaded layers using residual blocks based on the hybrid of channel attention (CA) and spatial attention (SA)[47], in which hybrid multiplications among channels and spaces are designed to capture both local and global contextual information. Such dynamic selections at different BEV spaces are beneficial for detecting lane lines with irregular shapes. The calculation of learnable attention maps can be formulated below

𝐅 i+1=(CA⁢(𝐅 i)×𝐅 i)×SA⁢(𝐅 i).subscript 𝐅 𝑖 1 CA subscript 𝐅 𝑖 subscript 𝐅 𝑖 SA subscript 𝐅 𝑖\mathbf{F}{i+1}=(\text{CA}(\mathbf{F}{i})\times\mathbf{F}{i})\times\text{SA% }(\mathbf{F}{i}).bold_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = ( CA ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) × bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) × SA ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .(1)

After that, multi-level BEV features {𝐅 i}i=1 3 superscript subscript subscript 𝐅 𝑖 𝑖 1 3{\mathbf{F}{i}}{i=1}^{3}{ bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT can be obtained with the shape of D i×H B⁢E⁢V 2 i+1×W B⁢E⁢V 2 i+1 subscript 𝐷 𝑖 subscript 𝐻 𝐵 𝐸 𝑉 superscript 2 𝑖 1 subscript 𝑊 𝐵 𝐸 𝑉 superscript 2 𝑖 1{D}{i}\times\frac{H{BEV}}{2^{i+1}}\times\frac{W_{BEV}}{2^{i+1}}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × divide start_ARG italic_H start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT end_ARG × divide start_ARG italic_W start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT end_ARG for each layer i 𝑖 i italic_i, where D i subscript 𝐷 𝑖 D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding dimension. To preserve all the information at different levels, we employ bilinear interpolation to upsample 3-level outputs {𝐅 i}i=1 3 superscript subscript subscript 𝐅 𝑖 𝑖 1 3{\mathbf{F}{i}}{i=1}^{3}{ bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. All these features are aligned to have the same resolution as the initial 𝐅 0 subscript 𝐅 0\mathbf{F}{0}bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Finally, a 3×3 3 3 3\times 3 3 × 3 convolutional layer after concatenation is used to aggregate multi-level features and obtain the enhanced BEV features F c subscript 𝐹 𝑐 F{c}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Such enhanced F c subscript 𝐹 𝑐 F_{c}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT contains local and semantic information across varying receptive fields, enhancing the capability to detect fine-grained structures.

3.2 Mask-Activated Instance Decoder

For each lane instance, specific query embeddings with instance and structure information are required for the regression of lane shapes and positions. Based on enriched BEV features, this section focuses on the design of mask-activated lane queries and the subsequent update through the cascaded deformable transformer decoder.

Mask-Activated Query. To achieve a more detailed and specific representation, MGMap employs a hybrid approach that combines lane queries 𝐐 l⁢a⁢n⁢e subscript 𝐐 𝑙 𝑎 𝑛 𝑒\mathbf{Q}{lane}bold_Q start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT and point queries 𝐐 p⁢o⁢i⁢n⁢t subscript 𝐐 𝑝 𝑜 𝑖 𝑛 𝑡\mathbf{Q}{point}bold_Q start_POSTSUBSCRIPT italic_p italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT to encode individual map instances.

For lane queries, in contrast to the fixed query designs[23, 51], 𝐐 l⁢a⁢n⁢e∈ℝ M×D subscript 𝐐 𝑙 𝑎 𝑛 𝑒 superscript ℝ 𝑀 𝐷\mathbf{Q}{lane}\in\mathbb{R}^{M\times D}bold_Q start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_D end_POSTSUPERSCRIPT is dynamically initialized with the guidance of learned instance masks, which is more flexible and contains specific shape prior and instance information. Here M 𝑀 M italic_M is the number of instance queries and D 𝐷 D italic_D is the dimension. Specifically, we obtain a set of instance segmentation maps 𝐌 i⁢n⁢s∈ℝ M×H B⁢E⁢V×W B⁢E⁢V subscript 𝐌 𝑖 𝑛 𝑠 superscript ℝ 𝑀 subscript 𝐻 𝐵 𝐸 𝑉 subscript 𝑊 𝐵 𝐸 𝑉\mathbf{M}{ins}\in\mathbb{R}^{M\times H_{BEV}\times W_{BEV}}bold_M start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_H start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT by applying basic convolution with the sigmoid function σ⁢(⋅)𝜎⋅\sigma(\cdot)italic_σ ( ⋅ ) to the enhanced BEV feature 𝐅 c subscript 𝐅 𝑐\mathbf{F}{c}bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, as illustrated in the lower half ofFigure 3. By leveraging 𝐌 i⁢n⁢s subscript 𝐌 𝑖 𝑛 𝑠\mathbf{M}{ins}bold_M start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT, we generate the mask-activated instance query 𝐐 l⁢a⁢n⁢e subscript 𝐐 𝑙 𝑎 𝑛 𝑒\mathbf{Q}{lane}bold_Q start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT via multiplication between instance masks M i⁢n⁢s subscript 𝑀 𝑖 𝑛 𝑠 M{ins}italic_M start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT and the transpose of enhanced BEV features

𝐐 l⁢a⁢n⁢e=(σ((Conv(𝐅 c)))×𝐅 c⊤.\mathbf{Q}{lane}=(\sigma((\text{Conv}(\mathbf{F}{c})))\times\mathbf{F}_{c}^{% \top}.bold_Q start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT = ( italic_σ ( ( Conv ( bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ) × bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .(2)

During the training stage, 𝐌 i⁢n⁢s subscript 𝐌 𝑖 𝑛 𝑠\mathbf{M}_{ins}bold_M start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT is supervised by instance mask annotations, and bipartite matching is used to pair each instance mask with the ground-truth label. This activated instance query allows for a more tailored and precise representation of each map element, in which unique features in instance masks can be aggregated to the specific instance queries.

For point queries, 𝐐 p⁢o⁢i⁢n⁢t∈ℝ N×D subscript 𝐐 𝑝 𝑜 𝑖 𝑛 𝑡 superscript ℝ 𝑁 𝐷\mathbf{Q}{point}\in\mathbb{R}^{N\times D}bold_Q start_POSTSUBSCRIPT italic_p italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT is constructed using a set of predefined learnable weights, which are broadcasted by multi-layer perceptions (MLP) to fuse with each 𝐐 l⁢a⁢n⁢e subscript 𝐐 𝑙 𝑎 𝑛 𝑒\mathbf{Q}{lane}bold_Q start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT. Thus, the hybrid query 𝐐∈ℝ M×N×D 𝐐 superscript ℝ 𝑀 𝑁 𝐷\mathbf{Q}\in\mathbb{R}^{M\times N\times D}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N × italic_D end_POSTSUPERSCRIPT can be obtained by

𝐐=𝐐 l⁢a⁢n⁢e+MLP⁢(𝐐 p⁢o⁢i⁢n⁢t).𝐐 subscript 𝐐 𝑙 𝑎 𝑛 𝑒 MLP subscript 𝐐 𝑝 𝑜 𝑖 𝑛 𝑡\mathbf{Q}=\mathbf{Q}{lane}+\text{MLP}(\mathbf{Q}{point}).bold_Q = bold_Q start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT + MLP ( bold_Q start_POSTSUBSCRIPT italic_p italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT ) .(3)

Deformable Decoder. To update the hybrid quires, an L 𝐿 L italic_L-layer decoder can be built by following the multi-scale deformable DETR[51] structure. At each layer l 𝑙 l italic_l, query embeddings can be updated by interacting with structured multi-level BEV features. Specifically, deformable DETR assigns reference points 𝐏 l superscript 𝐏 𝑙\mathbf{P}^{l}bold_P start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT as anchors to collect sparse features from {{𝐅 i}i=1 3,𝐅 c}superscript subscript subscript 𝐅 𝑖 𝑖 1 3 subscript 𝐅 𝑐{{\mathbf{F}{i}}{i=1}^{3},\mathbf{F}{c}}{ { bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT }, in which 𝐏 l superscript 𝐏 𝑙\mathbf{P}^{l}bold_P start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is also the intermediate stage of the normalized lane point positions. Simultaneously, the updated lane point positions 𝐏 l+1 superscript 𝐏 𝑙 1\mathbf{P}^{l+1}bold_P start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT can be obtained by adding the learned offsets, which are derived from MLP regression branch Reg p⁢o⁢s subscript Reg 𝑝 𝑜 𝑠\text{Reg}{pos}Reg start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT on query 𝐐 l superscript 𝐐 𝑙\mathbf{Q}^{l}bold_Q start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT as follows

𝐐 l=DeformAttn⁢(𝐐 l−1,𝐏 l,{𝐅 i}i=1 3,𝐅 c),superscript 𝐐 𝑙 DeformAttn superscript 𝐐 𝑙 1 superscript 𝐏 𝑙 superscript subscript subscript 𝐅 𝑖 𝑖 1 3 subscript 𝐅 𝑐\displaystyle\mathbf{Q}^{l}=\text{DeformAttn}(\mathbf{Q}^{l-1},\mathbf{P}^{l},% {\mathbf{F}{i}}{i=1}^{3},\mathbf{F}{c}),bold_Q start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = DeformAttn ( bold_Q start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , bold_P start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , { bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ,(4) 𝐏 l+1=σ⁢(σ−1⁢(𝐏 l)+Reg p⁢o⁢s⁢(𝐐 l)).superscript 𝐏 𝑙 1 𝜎 superscript 𝜎 1 superscript 𝐏 𝑙 subscript Reg 𝑝 𝑜 𝑠 superscript 𝐐 𝑙\displaystyle\mathbf{P}^{l+1}=\sigma(\sigma^{-1}(\mathbf{P}^{l})+\text{Reg}{% pos}(\mathbf{Q}^{l})).bold_P start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = italic_σ ( italic_σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_P start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) + Reg start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ( bold_Q start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) .(5)

The classification scores 𝐜 l+1 superscript 𝐜 𝑙 1\mathbf{c}^{l+1}bold_c start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT can also be updated by MLP regression on 𝐏 l superscript 𝐏 𝑙\mathbf{P}^{l}bold_P start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. With iterative interaction in the L 𝐿 L italic_L-layer decoder, MGMap can progressively update the semantic and positional information to query embeddings and constantly revise the shape and position of lane lines.

Figure 3: Illustration of mask constructions at different stages. In MAI decoder, instance masks are generated to activate lane queries, while binary masks are extracted to provide fine-grained patch features in PG-MPR.

3.3 Position-Guided Mask Patch Refinement

Although the general shapes and structures can be regressed from the instance level, some detailed expressions are still hard to construct. Therefore, it is necessary to achieve fine-grained refinement at a more specific point level. In this section, we design a refinement module to utilize binary mask features more specifically.

Mask Feature Construction. As shown in Figure 3, binary mask 𝐌 b∈ℝ 2×H B⁢E⁢V×W B⁢E⁢V subscript 𝐌 𝑏 superscript ℝ 2 subscript 𝐻 𝐵 𝐸 𝑉 subscript 𝑊 𝐵 𝐸 𝑉\mathbf{M}{b}\in\mathbb{R}^{2\times H{BEV}\times W_{BEV}}bold_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_H start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT can be obtained by applying sigmoid function after basic convolution on 𝐅 c subscript 𝐅 𝑐\mathbf{F}_{c}bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, in which auxiliary loss with rasterized supervision is used at the training stage. With the guidance of binary segmentation learning, more highlighted features can be utilized to distinguish lane line information from the background.

After that, the binary mask feature 𝐅 m subscript 𝐅 𝑚\mathbf{F}{m}bold_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT can be constructed based on the binary map 𝐌 b subscript 𝐌 𝑏\mathbf{M}{b}bold_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. We densify the dimension of 𝐌 b subscript 𝐌 𝑏\mathbf{M}{b}bold_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT from 2 to 32 by D⁢(⋅)D⋅\text{D}(\cdot)D ( ⋅ ). Then, we concatenate the densified binary map with 𝐅 c subscript 𝐅 𝑐\mathbf{F}{c}bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and a 2-channel normalized positional grid 𝐆 b⁢e⁢v subscript 𝐆 𝑏 𝑒 𝑣\mathbf{G}_{bev}bold_G start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT, which comprises spatial local information features for each pixel to provide relative location information. Finally, these three kinds of characteristic features are fused by the convolutional operation as below

𝐅 m=Conv⁢(Concat⁢(𝐅 c,D⁢(𝐌 b),𝐆 b⁢e⁢v)).subscript 𝐅 𝑚 Conv Concat subscript 𝐅 𝑐 D subscript 𝐌 𝑏 subscript 𝐆 𝑏 𝑒 𝑣\mathbf{F}{m}=\text{Conv}(\text{Concat}(\mathbf{F}{c},\text{D}(\mathbf{M}{b% }),\mathbf{G}{bev})).bold_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = Conv ( Concat ( bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , D ( bold_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) , bold_G start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT ) ) .(6)

Compared to 𝐅 c subscript 𝐅 𝑐\mathbf{F}{c}bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, 𝐅 m subscript 𝐅 𝑚\mathbf{F}{m}bold_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT emphasizes specific location and semantic information around lane lines to differentiate the feature from ambiguous backgrounds, which provides more clear location information.

Patch Extraction and Refinement.

Figure 4: (a) The conventional deformable attention extracts sparse features from sampling points, which may select irrelevant features; (b) Our proposed Mask Patch Refinement extracts more relevant features from the region of reliable patch.

As shown in Figure 4 (b), for the i 𝑖 i italic_i-th point of the lane line, the patch region is a squared bounding box B i subscript 𝐵 𝑖 B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT centered at the point’s coordinates (x i,y i)subscript 𝑥 𝑖 subscript 𝑦 𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which is regressed from the last layer of the MAI decoder. The size of the patch region is determined by the hyper-parameter d 𝑑 d italic_d.

Once the location and region scale are determined, we apply the function f e⁢x⁢t subscript 𝑓 𝑒 𝑥 𝑡 f_{ext}italic_f start_POSTSUBSCRIPT italic_e italic_x italic_t end_POSTSUBSCRIPT, to extract local patch features 𝐯 i subscript 𝐯 𝑖\mathbf{v}_{i}bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each point in each map instance. This process is implemented using ROIAlign [13], which is employed to extract and unify the features as follows

𝐯 i=f e⁢x⁢t⁢(𝐅 m,R⁢(B i),(x i,y i)),subscript 𝐯 𝑖 subscript 𝑓 𝑒 𝑥 𝑡 subscript 𝐅 𝑚 R subscript 𝐵 𝑖 subscript 𝑥 𝑖 subscript 𝑦 𝑖\mathbf{v}{i}=f{ext}\big{(}\mathbf{F}{m},\text{R}(B{i}),(x_{i},y_{i})),bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_e italic_x italic_t end_POSTSUBSCRIPT ( bold_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , R ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ,(7)

where R⁢(⋅)R⋅\text{R}(\cdot)R ( ⋅ ) denotes the denormalized patch region computation. The extraction function f e⁢x⁢t subscript 𝑓 𝑒 𝑥 𝑡 f_{ext}italic_f start_POSTSUBSCRIPT italic_e italic_x italic_t end_POSTSUBSCRIPT utilizes basic convolutional blocks after bilinear interpolation and pooling to locate and align the corresponding values in the patch features. Consequently, each extracted semantic patch feature 𝐯 i subscript 𝐯 𝑖\mathbf{v}_{i}bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has the shape of D×5×5 𝐷 5 5 D\times 5\times 5 italic_D × 5 × 5, which effectively mitigates the issue of misalignment. Compared to the sparse deformable attention design in Figure 4 (a), the mask patch design contains dense and more reliable features.

With query embedding 𝐐 𝐐\mathbf{Q}bold_Q from the L-layer MAI decoder and patch region feature 𝐕={𝐯 i}i=0 N 𝐕 superscript subscript subscript 𝐯 𝑖 𝑖 0 𝑁\mathbf{V}={\mathbf{v}{i}}{i=0}^{N}bold_V = { bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, we leverage multi-head attention to refresh the query features, which can be written as follows

𝐐 s+1=MultiHeadAtt⁢(𝐐 s⁢𝐕 s D),superscript 𝐐 𝑠 1 MultiHeadAtt superscript 𝐐 𝑠 superscript 𝐕 𝑠 𝐷\mathbf{Q}^{s+1}=\text{MultiHeadAtt}\big{(}\frac{\mathbf{Q}^{s}\mathbf{V}^{s}}% {\sqrt{D}}\big{)},bold_Q start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT = MultiHeadAtt ( divide start_ARG bold_Q start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_D end_ARG end_ARG ) ,(8)

where s 𝑠 s italic_s represents the stage in the PG-MPR module and D 𝐷 D italic_D is the dimension of features. Query embeddings can be refreshed at a more detailed and specific point level.

Finally, point coordinates and the classification scores can be regressed from query 𝐐 s superscript 𝐐 𝑠\mathbf{Q}^{s}bold_Q start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT through MLP branches Reg p⁢o⁢s subscript Reg 𝑝 𝑜 𝑠\text{Reg}{pos}Reg start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT and Reg c⁢l⁢s subscript Reg 𝑐 𝑙 𝑠\text{Reg}{cls}Reg start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, which can be formulated as

𝐏 s+1=σ⁢(σ−1⁢(𝐏 s)+Reg p⁢o⁢s⁢(𝐐 s)),superscript 𝐏 𝑠 1 𝜎 superscript 𝜎 1 superscript 𝐏 𝑠 subscript Reg 𝑝 𝑜 𝑠 superscript 𝐐 𝑠\displaystyle\mathbf{P}^{s+1}=\sigma(\sigma^{-1}(\mathbf{P}^{s})+\text{Reg}{% pos}(\mathbf{Q}^{s})),bold_P start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT = italic_σ ( italic_σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_P start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) + Reg start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ( bold_Q start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ) ,(9) 𝐜 s+1=Reg c⁢l⁢s⁢(𝐐 s).superscript 𝐜 𝑠 1 subscript Reg 𝑐 𝑙 𝑠 superscript 𝐐 𝑠\displaystyle\mathbf{c}^{s+1}=\text{Reg}{cls}(\mathbf{Q}^{s}).bold_c start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT = Reg start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ( bold_Q start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) .(10)

3.4 Training Loss

MGMap is trained in an end-to-end manner. Bipartite matching is employed to pair predicted map instances with their ground-truth counterparts. With the regression of points and class labels, auxiliary loss is required for mask segmentation. Concretely, the total loss is the sum of detection loss and mask segmentation loss ℒ=ℒ d⁢e⁢t+ℒ m⁢a⁢s⁢k ℒ subscript ℒ 𝑑 𝑒 𝑡 subscript ℒ 𝑚 𝑎 𝑠 𝑘\mathcal{L}=\mathcal{L}{det}+\mathcal{L}{mask}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT.

Detection Loss. Lane detection aims to regress lane coordinates and classification labels. As in[23], we employ L 1 subscript 𝐿 1 L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss to calculate point-to-point Manhattan distance between the predicted points p i⁢j^^subscript 𝑝 𝑖 𝑗\hat{p_{ij}}over^ start_ARG italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG and ground-truth p i⁢j subscript 𝑝 𝑖 𝑗 p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. The edge direction loss is also considered in adjacent points by cosine similarity. ℒ f⁢o⁢c⁢a⁢l subscript ℒ 𝑓 𝑜 𝑐 𝑎 𝑙\mathcal{L}_{focal}caligraphic_L start_POSTSUBSCRIPT italic_f italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT calculates the classification loss, which can be formulated as

ℒ l⁢a⁢n⁢e=∑i,j=0 M,N(λ d⁢i⁢s⁢Dis⁢(p i⁢j^,p i⁢j)+λ d⁢i⁢r⁢CosSim⁢(e i⁢j^,e i⁢j)),subscript ℒ 𝑙 𝑎 𝑛 𝑒 superscript subscript 𝑖 𝑗 0 𝑀 𝑁 subscript 𝜆 𝑑 𝑖 𝑠 Dis^subscript 𝑝 𝑖 𝑗 subscript 𝑝 𝑖 𝑗 subscript 𝜆 𝑑 𝑖 𝑟 CosSim^subscript 𝑒 𝑖 𝑗 subscript 𝑒 𝑖 𝑗\mathcal{L}{lane}=\sum{i,j=0}^{M,N}\big{(}\lambda_{dis}\text{Dis}(\hat{p_{ij% }},p_{ij})+\lambda_{dir}\text{CosSim}(\hat{e_{ij}},e_{ij})\big{)},caligraphic_L start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M , italic_N end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT Dis ( over^ start_ARG italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG , italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT italic_d italic_i italic_r end_POSTSUBSCRIPT CosSim ( over^ start_ARG italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG , italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) ,(11)

ℒ d⁢e⁢t=ℒ l⁢a⁢n⁢e+λ c⁢l⁢s⁢∑i=0 M ℒ f⁢o⁢c⁢a⁢l⁢(𝐜 i^,𝐜 i),subscript ℒ 𝑑 𝑒 𝑡 subscript ℒ 𝑙 𝑎 𝑛 𝑒 subscript 𝜆 𝑐 𝑙 𝑠 superscript subscript 𝑖 0 𝑀 subscript ℒ 𝑓 𝑜 𝑐 𝑎 𝑙^subscript 𝐜 𝑖 subscript 𝐜 𝑖\mathcal{L}{det}=\mathcal{L}{lane}+\lambda_{cls}\sum_{i=0}^{M}\mathcal{L}{% focal}(\hat{\mathbf{c}{i}},\mathbf{c}_{i}),caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_f italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT ( over^ start_ARG bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,(12)

where p i⁢j subscript 𝑝 𝑖 𝑗 p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the j 𝑗 j italic_j-th point of the i 𝑖 i italic_i-th instance and e i⁢j subscript 𝑒 𝑖 𝑗 e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the direction of two adjacent points. λ d⁢i⁢s subscript 𝜆 𝑑 𝑖 𝑠\lambda_{dis}italic_λ start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT, λ d⁢i⁢r subscript 𝜆 𝑑 𝑖 𝑟\lambda_{dir}italic_λ start_POSTSUBSCRIPT italic_d italic_i italic_r end_POSTSUBSCRIPT, and λ c⁢l⁢s subscript 𝜆 𝑐 𝑙 𝑠\lambda_{cls}italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT are the weighted factors for the losses of point regression, direction adjustment, and label regression, respectively.

Mask Construction Loss. Mask learning is able to reduce the risk of overfitting by pixel-level intensive supervision. We use the combination of cross-entropy loss and dice loss[28] to deal with the unbalanced segmentation problems. Given the output masks M^i⁢n⁢s subscript^M 𝑖 𝑛 𝑠\hat{\textbf{M}}{ins}over^ start_ARG M end_ARG start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT and M^b subscript^M 𝑏\hat{\textbf{M}}{b}over^ start_ARG M end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, we formulate the auxiliary loss as below

ℒ m⁢a⁢s⁢k=λ i⁢n⁢s⁢ℒ i⁢n⁢s⁢(M^i⁢n⁢s,M i⁢n⁢s)+λ b⁢ℒ b⁢(M^b,M b),subscript ℒ 𝑚 𝑎 𝑠 𝑘 subscript 𝜆 𝑖 𝑛 𝑠 subscript ℒ 𝑖 𝑛 𝑠 subscript^M 𝑖 𝑛 𝑠 subscript M 𝑖 𝑛 𝑠 subscript 𝜆 𝑏 subscript ℒ 𝑏 subscript^M 𝑏 subscript M 𝑏\mathcal{L}{mask}=\lambda{ins}\mathcal{L}{ins}(\hat{\textbf{M}}{ins},% \textbf{M}{ins})+\lambda{b}\mathcal{L}{b}(\hat{\textbf{M}}{b},\textbf{M}_{% b}),caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT ( over^ start_ARG M end_ARG start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT , M start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( over^ start_ARG M end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ,(13)

where λ i⁢n⁢s subscript 𝜆 𝑖 𝑛 𝑠\lambda_{ins}italic_λ start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT and λ b subscript 𝜆 𝑏\lambda_{b}italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT are the corresponding loss weights.

4 Experiments

AP c⁢h⁢a⁢m⁢f⁢e⁢r subscript AP 𝑐 ℎ 𝑎 𝑚 𝑓 𝑒 𝑟\text{AP}{{chamfer}}AP start_POSTSUBSCRIPT italic_c italic_h italic_a italic_m italic_f italic_e italic_r end_POSTSUBSCRIPT AP r⁢a⁢s⁢t⁢e⁢r subscript AP 𝑟 𝑎 𝑠 𝑡 𝑒 𝑟\text{AP}{{raster}}AP start_POSTSUBSCRIPT italic_r italic_a italic_s italic_t italic_e italic_r end_POSTSUBSCRIPT Method Backbone Epochs ped.div.bou.avg.ped.div.bou.avg.FPS Camera-Based Methods HDMapNet[ICRA22][19]EB0 30 14.4 21.7 33.0 23.0----- InstaGraM[CVPRW23][39]EB4 30 33.8 47.2 44.0 41.7----- MapTR[ICLR23][23]R50 24 46.3 51.5 53.1 50.3 32.4 23.5 17.1 24.3 15.7 MapTR[ICLR23][23]R50 30 45.2 53.8 54.3 51.1 32.9 24.9 18.9 25.6 15.7 MapVR[NeurIPS23][49]R50 24 47.7 54.4 51.4 51.2 37.5 33.1 23.0 31.2 15.7 PivotNet[ICCV23][9]R50 30 53.8 55.8 59.6 57.4----9.6 BeMapNet[CVPR23][32]R50 30 57.7 62.3 59.4 59.8----4.4 MGMap(Ours)R50 30 57.4 63.5 63.3 61.4 46.5 36.5 28.7 37.2 11.6 \hdashline MapTRv2[arxiv23][24]R50 24 59.8 62.4 62.4 61.5----- MGMap*(Ours)R50 24 61.8 65.0 67.5 64.8----- VectorMapNet[ICML23][26]R50 110+ft 42.5 51.4 44.1 46.0----- MapTR[ICLR23][23]R50 110 56.2 59.8 60.1 58.7 43.6 35.7 25.8 35.0 15.7 MapVR[NeurIPS23][49]R50 110 55.0 61.8 59.4 58.8 46.0 39.7 29.9 38.5 15.7 BeMapNet[CVPR23][32]R50 110 62.6 66.7 65.1 64.8----4.4 MGMap(Ours)R50 110 64.4 67.6 67.7 66.5 54.5 42.1 37.4 44.7 11.6 LiDAR-Based Methods HDMapNet[ICRA22][19]PP 30 10.4 24.1 37.9 24.1----- VectorMapNet[ICML23][26]PP 110 42.5 51.4 44.1 34.0----- MapTR[ICLR23][23]Sec 24 48.5 53.7 64.7 55.6 38.9 30.1 41.1 36.7 6.0 MGMap(Ours)Sec 24 63.5 66.7 73.6 67.9 53.4 44.4 52.6 50.1 5.5 Camera-LiDAR Fusion Methods HDMapNet[ICRA22][19]EB0 & PP 30 16.3 29.6 46.7 31.0----- VectorMapNet[ICML23][26]EB0 & PP 110+ft 48.2 60.1 53.0 53.7----- MapTR[ICLR23][23]R50 & Sec 24 55.9 62.3 69.3 62.5 46.4 38.4 49.2 44.7 5.2 MapVR[NeurIPS23][49]R50 & Sec 24 60.4 62.7 67.2 63.5 52.4 46.4 54.4 51.1 5.2 MGMap(Ours)R50 & Sec 24 67.7 71.1 76.2 71.7 59.6 47.3 54.6 53.8 4.8

Table 1: Quantitative evaluation of map vectorization on nuScenes val. at 60⁢m×30⁢m 60 𝑚 30 𝑚 60m\times 30m 60 italic_m × 30 italic_m perception range under different input modalities and backbone settings, “EB0”, “EB4”, “R50”, and “Sec” correspond to the backbones Efficient-B0, Efficient-B4[41], ResNet50[12], and SECOND[48] for LiDAR, respectively. “ft” means the two-stage fine-tune strategy. “MGMap*” means the reimplemented structure based on stronger MapTRV2[32]. The inference speed is measured on the same computer with a single NVIDIA Tesla V100 GPU.

4.1 Datasets and Benchmarks

We conduct extensive experiments on two public datasets, including nuScenes[1] and Argoverse2[46]. nuScenes dataset contains 1000 driving scenes collected from Boston and Singapore. 750 and 150 scene sequences are annotated for training and validation, respectively. Each scene sequence consists of 40-keyframe data with a sampling rate of 2Hz. For each keyframe, there are 6 PV images and the corresponding point clouds from 32-beam LiDAR. Argoverse2 contains 1000 scenes collected from six cities with 7 PV images. We use a subset of Argoverse2, which is provided by the Online HD Map Construction Challenge 1 1 1 https://github.com/Tsinghua-MARS-Lab/Online-HD-Map-Construction-CVPR2023. We mainly focus on three map elements, including lane-divider (div.), ped-crossing (ped.), and road boundary (bou.).

Figure 5: The visual results of MapTR[23], our proposed MGMap approach and the corresponding ground truth.

4.2 Evaluation Metrics

To facilitate comprehensive evaluations, we employ the Chamfer distance-based metrics, including average precision AP c⁢h⁢a⁢m⁢f⁢e⁢r subscript AP 𝑐 ℎ 𝑎 𝑚 𝑓 𝑒 𝑟\text{AP}{{chamfer}}AP start_POSTSUBSCRIPT italic_c italic_h italic_a italic_m italic_f italic_e italic_r end_POSTSUBSCRIPT[19] and the IoU-based average precision AP r⁢a⁢s⁢t⁢e⁢r subscript AP 𝑟 𝑎 𝑠 𝑡 𝑒 𝑟\text{AP}{raster}AP start_POSTSUBSCRIPT italic_r italic_a italic_s italic_t italic_e italic_r end_POSTSUBSCRIPT[49], which evaluates the model from the point-coordinate aspect considers each map element as a whole unit separately. This ensures that the map vectorization quality can be assessed from different perspectives.

Chamfer-Distance AP. For fair comparisons, AP c⁢h⁢a⁢m⁢f⁢e⁢r subscript AP 𝑐 ℎ 𝑎 𝑚 𝑓 𝑒 𝑟\text{AP}{chamfer}AP start_POSTSUBSCRIPT italic_c italic_h italic_a italic_m italic_f italic_e italic_r end_POSTSUBSCRIPT is inherited from previous map vectorization works[23, 26]. Specifically, the average Euclidean distances between sampled points in each map line and the nearest points in the ground-truth labels are measured. The AP is calculated under the average of three distance thresholds, τ c⁢h⁢a⁢m⁢f⁢e⁢r∈{0.5⁢m,1⁢m,1.5⁢m}subscript 𝜏 𝑐 ℎ 𝑎 𝑚 𝑓 𝑒 𝑟 0.5 𝑚 1 𝑚 1.5 𝑚\tau{chamfer}\in{0.5m,1m,1.5m}italic_τ start_POSTSUBSCRIPT italic_c italic_h italic_a italic_m italic_f italic_e italic_r end_POSTSUBSCRIPT ∈ { 0.5 italic_m , 1 italic_m , 1.5 italic_m }. Each prediction is treated as true-positive (TP) when the distance is below the threshold. AP c⁢h⁢a⁢m⁢f⁢e⁢r subscript AP 𝑐 ℎ 𝑎 𝑚 𝑓 𝑒 𝑟\text{AP}_{chamfer}AP start_POSTSUBSCRIPT italic_c italic_h italic_a italic_m italic_f italic_e italic_r end_POSTSUBSCRIPT evaluates map construction quality from point-level perspectives, since it calculates the distances between points as the measured errors.

IoU-based AP. Following MapVR[49], AP r⁢a⁢s⁢t⁢e⁢r subscript AP 𝑟 𝑎 𝑠 𝑡 𝑒 𝑟\text{AP}{raster}AP start_POSTSUBSCRIPT italic_r italic_a italic_s italic_t italic_e italic_r end_POSTSUBSCRIPT is calculated by the Intersection over Union (IoU) among pixels. Both predictions and ground truth labels are rasterized into polylines in HD maps. Raster map size is set to 480×240 480 240 480\times 240 480 × 240, and each map element is dilated by two pixels on each side. Detection quality is evaluated by calculating the IoU of the rasterized representations between the prediction and ground truth. The calculation of AP is set under thresholds τ I⁢o⁢U∈{0.25:0.5:0.05}subscript 𝜏 𝐼 𝑜 𝑈 conditional-set 0.25:0.5 0.05\tau{IoU}\in{0.25:0.5:0.05}italic_τ start_POSTSUBSCRIPT italic_I italic_o italic_U end_POSTSUBSCRIPT ∈ { 0.25 : 0.5 : 0.05 } for line-shaped elements and τ I⁢o⁢U∈{0.5:0.75:0.05}subscript 𝜏 𝐼 𝑜 𝑈 conditional-set 0.5:0.75 0.05\tau_{IoU}\in{0.5:0.75:0.05}italic_τ start_POSTSUBSCRIPT italic_I italic_o italic_U end_POSTSUBSCRIPT ∈ { 0.5 : 0.75 : 0.05 } for polygon-shaped pedestrian crossings, where 0.05 represents the step size for the change of threshold. AP r⁢a⁢s⁢t⁢e⁢r subscript AP 𝑟 𝑎 𝑠 𝑡 𝑒 𝑟\text{AP}{raster}AP start_POSTSUBSCRIPT italic_r italic_a italic_s italic_t italic_e italic_r end_POSTSUBSCRIPT is calculated under the average of all the thresholds among the upper and lower bounds. Thus, AP r⁢a⁢s⁢t⁢e⁢r subscript AP 𝑟 𝑎 𝑠 𝑡 𝑒 𝑟\text{AP}{raster}AP start_POSTSUBSCRIPT italic_r italic_a italic_s italic_t italic_e italic_r end_POSTSUBSCRIPT considers each lane line at the pixel level and evaluates map quality from the whole instance perspective.

4.3 Implementation Details

To ensure fair comparisons, we choose ResNet50 [12] as the image backbone. SECOND[48] is adopted as the backbone for LiDAR modality. The BEV size, defined as H B⁢E⁢V×W B⁢E⁢V subscript 𝐻 𝐵 𝐸 𝑉 subscript 𝑊 𝐵 𝐸 𝑉 H_{BEV}\times W_{BEV}italic_H start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT, is set to 200×100 200 100 200\times 100 200 × 100. The maximum number of instances and point queries are set to 50 and 20, respectively. In PG-MPR module, the hyper-parameter d 𝑑 d italic_d is set to 0.1, which is the normalized patch size. We use AdamW[34] optimizer with a learning rate of 6⁢e−4 6 superscript 𝑒 4 6e^{-4}6 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. All models are trained on 8 NVIDIA Tesla V100 GPUs with a batch size of 6 per GPU. The weighted factors for detection loss are the same as those in MapTR[23]. For mask branches, {λ i⁢n⁢s,λ b}subscript 𝜆 𝑖 𝑛 𝑠 subscript 𝜆 𝑏{\lambda_{ins},\lambda_{b}}{ italic_λ start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT } are set to {2,15}2 15{2,15}{ 2 , 15 }, respectively.

4.4 Main Results

Performance on nuScenes Dataset. As shown in Table 1, we compare MGMap approach against the SOTA methods on the validation set of nuScenes with different settings. The experimental performance is evaluated under AP c⁢h⁢a⁢m⁢f⁢e⁢r subscript AP 𝑐 ℎ 𝑎 𝑚 𝑓 𝑒 𝑟\text{AP}{chamfer}AP start_POSTSUBSCRIPT italic_c italic_h italic_a italic_m italic_f italic_e italic_r end_POSTSUBSCRIPT and AP r⁢a⁢s⁢t⁢e⁢r subscript AP 𝑟 𝑎 𝑠 𝑡 𝑒 𝑟\text{AP}{raster}AP start_POSTSUBSCRIPT italic_r italic_a italic_s italic_t italic_e italic_r end_POSTSUBSCRIPT. It can be seen that our proposed approach outperforms previous methods and obtains the best performance. Compared with baseline MapTR[23], MGMap archives 10.3 mAP improvement with multi-view camera-based input under the same settings of ResNet-50 and 30 epochs for training. It is worth noting that MGMap achieves 67.9 mAP for LiDAR and 71.7 mAP for fusing camera data with LiDAR, which demonstrates the strong generalization capability of our scheme. Furthermore, the visual results of MGMap in several driving scenarios are shown inFigure 5. More comparison results under different conditions are included in the supplementary material.

Table 2: Performance comparison with baseline methods at 60⁢m×30⁢m 60 𝑚 30 𝑚 60m\times 30m 60 italic_m × 30 italic_m perception range on a subset of Argoverse2 provided by the Online HD Map Construction Challenge.

Performance on Argoverse2 Dataset. By following the settings on the online HDMap construction challenge, we re-implement MapTR and MGMap on the Argoverse2 dataset. Table 2 presents our experimental results. It can be observed that our method achieves competitive performance on the Argoverse2 dataset, MGMap achieves 5.4 mAP improvement compared with MapTR, which further shows the effectiveness of our proposed approach.

Performance on Enlarged Perception Ranges. To evaluate the robustness of model, we conduct experiments on the enlarged perception ranges. Under the same settings, we re-implement MapTR and our MGMap with the perception ranges of 60⁢m×60⁢m 60 𝑚 60 𝑚 60m\times 60m 60 italic_m × 60 italic_m and 30⁢m×90⁢m 30 𝑚 90 𝑚 30m\times 90m 30 italic_m × 90 italic_m at the X-axis and Y-axis in BEV space, in which query numbers are proportionally enlarged to account for the basic property. All models are trained for 30 epochs. Table 3 reports the experiment results. Compared to MapTR, our MGMap achieves consistent performance improvements, with a 9.5 mAP improvement on the 60⁢m×60⁢m 60 𝑚 60 𝑚 60m\times 60m 60 italic_m × 60 italic_m perceptions range setting and a 10.2 mAP improvement on the 30⁢m×90⁢m 30 𝑚 90 𝑚 30m\times 90m 30 italic_m × 90 italic_m range setting.

Table 3: Experimental results with enlarged perception range settings on the nuScenes dataset. Our proposed MGMap approach outperforms MapTR significantly on all evaluation metrics.

4.5 Ablation Study

In this section, ablation experiments are conducted to examine the effectiveness of our proposed modules and designs. For fair comparisons, all experiments are conducted on the camera modality of the nuScenes dataset with 30 epochs for training. ResNet50 is employed as the image backbone.

Ablation on Mask-Guided Design. Our mask-guided design comprises two main components: the MAI decoder and the PG-MPR module, which are utilized to handle the overall shape structure and specific details from the instance level and the point level. The MAI construction leverages the masks to obtain global structure information, which contributes to shape understanding. Furthermore, the patch refinement considers the local level more precisely. This strategy allows for more specific and fine-grained adjustments to individual points, which contributes to precise localization. The ablation results presented in Table 4 demonstrate the impact of each level design. When compared to the absence of mask guidance, the inclusion of the MAI decoder and PG-MPR module leads to individual improvements of 1.9 mAP and 2.6 mAP, respectively. Furthermore, the combination of these two components in the overall design achieves the highest performance, with a mAP of 61.4.

Ablation on EML Neck. To investigate the effectiveness of the EML neck design, we conduct ablation experiments to compare it with FPN in PV image space. Multi-scale BEV features contain enriched semantic and location information with larger receptive fields, which is beneficial for the positioning of irregular shapes. As shown in Table 5, compared to baselines without multi-level features, EML design at BEV space (referred to as “BEV”) achieves a larger performance gain. While PV space design fails to achieve the expected effect (“PV”). Besides, enhanced BEV features also contribute to mask generation and boost the performance of the mask-guided design (“MG”).

Ablation on PG-MPR Design. Finally, we investigate the ablation experiment on the setting of the position-guided mask patch refinement module. In our experiment settings, we compare different selections of patch size d 𝑑 d italic_d and the number of refinement stages s 𝑠 s italic_s. The evaluated normalized patch sizes range from 0.08 to 0.12, and refinement stages are set from 1 to 3. As illustrated inTable 6, applying the refinement strategy contributes to performance improvement.

Table 4: Ablation study of mask-guided design at the instance and point levels. Ins. denotes the instance-level MAI decoder and Point means the point-level PG-MPR module design.

Table 5: Ablation study of EML neck design by investigating the performance of multi-level features at PV and BEV stages. Empirical results show that the best performance is achieved by taking advantage of the BEV-level EML neck for mask-guided design.

Table 6: Performance of point-level PG-MPR design with different patch sizes (d 𝑑 d italic_d) and refinement stages (s 𝑠 s italic_s). The first row means the result without the point-level refinement.

Experiments also indicate that the oversized patches will introduce irrelevant information and adversely affect performance while undersized patches lead to information loss and also result in suboptimal results. The best performance is achieved when the selected patch size is set to 0.1 along with the two-stage refinement.

5 Conclusion

In this paper, we propose MGMap, an effective approach to online HD map vectorization with the guidance of learned masks. By taking advantage of masks at both the instance and point levels, we alleviate the challenges of rough detection and loss of details arising from subtle and sparse annotations in HD maps. Our proposed MGMap not only demonstrates state-of-the-art performance but also exhibits strong robustness in online map vectorization across various experimental settings. For future work, fusing with other perceptual tasks to construct a more comprehensive representation is still a direction worth exploring, which holds promise for further advancements in autonomous driving.

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grants (62376244). It is also supported by Information Technology Center and State Key Lab of CAD&CG, Zhejiang University.

\thetitle

Supplementary Material

In this supplementary file, we provide more details, discussions, and experiments as follows:

• More details of our method;
• Additional experiments;
• Extensive qualitative results;
• Limitations and future work.

Appendix A More Details of Our Method

A.1 Enhanced Multi-level Neck

In this section, we present more details of the fused attention in the enhanced multi-level (EML) neck. EML neck consists of three cascaded layers, each layer is a basic ResNet block[12] with fused channel attention (CA) and spatial attention (SA)[47], which focus on semantics and positional information to adaptively learn the crucial regions of the BEV space.

For the BEV feature with the shape of D×H B⁢E⁢V×W B⁢E⁢V 𝐷 subscript 𝐻 𝐵 𝐸 𝑉 subscript 𝑊 𝐵 𝐸 𝑉 D\times H_{BEV}\times W_{BEV}italic_D × italic_H start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT, channel attention calculates channel weight of size D×1×1 𝐷 1 1 D\times 1\times 1 italic_D × 1 × 1 through average pooling and max pooling, emphasizing diverse channel information. A sigmoid function after MLP is employed to calculate the channel-wise attention map. For spatial attention, average pooling and max pooling are used to compress the channel dimension D 𝐷 D italic_D, in which we construct spatial weight with shape 1×H B⁢E⁢V×W B⁢E⁢V 1 subscript 𝐻 𝐵 𝐸 𝑉 subscript 𝑊 𝐵 𝐸 𝑉 1\times H_{BEV}\times W_{BEV}1 × italic_H start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT to obtain the weight of location information. Then, a spatial map can be generated by a 7×7 7 7 7\times 7 7 × 7 convolutional layer followed by the sigmoid function.

A.2 Auxiliary Loss Setting

In addition to regressing the point’s position, an auxiliary loss for mask construction is required. As mentioned in the main paper, we combine cross-entropy loss ℒ c⁢e subscript ℒ 𝑐 𝑒\mathcal{L}{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT and dice loss ℒ d⁢i⁢c⁢e subscript ℒ 𝑑 𝑖 𝑐 𝑒\mathcal{L}{dice}caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT[28] to construct instance mask M^i⁢n⁢s subscript^M 𝑖 𝑛 𝑠\hat{\textbf{M}}{ins}over^ start_ARG M end_ARG start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT and binary mask M^b subscript^M 𝑏\hat{\textbf{M}}{b}over^ start_ARG M end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Specifically,

ℒ i⁢n⁢s=λ i⁢n⁢s⁢ℒ c⁢e⁢(M^i⁢n⁢s,M i⁢n⁢s)+λ d⁢1⁢ℒ d⁢i⁢c⁢e⁢(M^i⁢n⁢s,M i⁢n⁢s),subscript ℒ 𝑖 𝑛 𝑠 subscript 𝜆 𝑖 𝑛 𝑠 subscript ℒ 𝑐 𝑒 subscript^M 𝑖 𝑛 𝑠 subscript M 𝑖 𝑛 𝑠 subscript 𝜆 𝑑 1 subscript ℒ 𝑑 𝑖 𝑐 𝑒 subscript^M 𝑖 𝑛 𝑠 subscript M 𝑖 𝑛 𝑠\mathcal{L}{ins}=\lambda{ins}\mathcal{L}{ce}(\hat{\textbf{M}}{ins},\textbf% {M}{ins})+\lambda{d1}\mathcal{L}{dice}(\hat{\textbf{M}}{ins},\textbf{M}_{% ins}),caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( over^ start_ARG M end_ARG start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT , M start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT italic_d 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT ( over^ start_ARG M end_ARG start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT , M start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT ) ,(14)

ℒ b=λ b⁢ℒ c⁢e⁢(M^b,M b)+λ d⁢2⁢ℒ d⁢i⁢c⁢e⁢(M^b,M b),subscript ℒ 𝑏 subscript 𝜆 𝑏 subscript ℒ 𝑐 𝑒 subscript^M 𝑏 subscript M 𝑏 subscript 𝜆 𝑑 2 subscript ℒ 𝑑 𝑖 𝑐 𝑒 subscript^M 𝑏 subscript M 𝑏\mathcal{L}{b}=\lambda{b}\mathcal{L}{ce}(\hat{\textbf{M}}{b},\textbf{M}{b% })+\lambda{d2}\mathcal{L}{dice}(\hat{\textbf{M}}{b},\textbf{M}_{b}),caligraphic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( over^ start_ARG M end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT italic_d 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT ( over^ start_ARG M end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ,(15)

where {λ i⁢n⁢s,λ d⁢1,λ b,λ d⁢2}subscript 𝜆 𝑖 𝑛 𝑠 subscript 𝜆 𝑑 1 subscript 𝜆 𝑏 subscript 𝜆 𝑑 2{\lambda_{ins},\lambda_{d1},\lambda_{b},\lambda_{d2}}{ italic_λ start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_d 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_d 2 end_POSTSUBSCRIPT } are the corresponding loss weights for two level masks. ℒ c⁢e subscript ℒ 𝑐 𝑒\mathcal{L}{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT calculates the loss of each pixel equally, while ℒ d⁢i⁢c⁢e subscript ℒ 𝑑 𝑖 𝑐 𝑒\mathcal{L}{dice}caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT takes consideration of mining the foreground areas, which can be formulated as below

ℒ d⁢i⁢c⁢e=1−2⋅M^∩M M^∪M.subscript ℒ 𝑑 𝑖 𝑐 𝑒 1⋅2^M M^M M\mathcal{L}_{dice}=1-2\cdot\frac{\hat{\textbf{M}}\cap\textbf{M}}{\hat{\textbf{% M}}\cup\textbf{M}}.caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT = 1 - 2 ⋅ divide start_ARG over^ start_ARG M end_ARG ∩ M end_ARG start_ARG over^ start_ARG M end_ARG ∪ M end_ARG .(16)

To this end, we expect a larger intersection over the union area for the predicted mask M^^M\hat{\textbf{M}}over^ start_ARG M end_ARG and ground truth M.

Appendix B Additional Experiments

B.1 Time Analysis

Table A1 shows the detailed time analysis of each component. Compared with MapTR[23], EML neck and PG-MPR bring a slight time delay while the initial BEV extraction causes the main time-consuming.

Table A1: Detailed runtime analysis and comparison with MapTR

B.2 Experiments under Different Conditions

We compare MGMap with the state-of-the-art method MapTR[23] under different weather and lighting conditions, in which the nuScenes dataset[1] is split by[32]. We employ ResNet-50[12] as the image backbone and SECOND[48] for the LiDAR modality. All models are trained for 24 epochs. Moreover, experiments are conducted under mAP c⁢h⁢a⁢m⁢f⁢e⁢r subscript mAP 𝑐 ℎ 𝑎 𝑚 𝑓 𝑒 𝑟\text{mAP}_{chamfer}mAP start_POSTSUBSCRIPT italic_c italic_h italic_a italic_m italic_f italic_e italic_r end_POSTSUBSCRIPT with a threshold setup of [0.5m, 1.0m, 1.5m]. As illustrated in Table A2, our method achieves the stable improvements with more than +9 mAP under different conditions.

Table A2: Comparisons under several weather and lighting conditions with different input modalities, our MGMap approach consistently achieves significant improvements over MapTR.

B.3 Ablations on Auxiliary Loss

In this section, we investigate the effects of auxiliary losses. As mentioned in the main paper, we present a parallel branch for mask predictions, which requires intensive supervision at the BEV space to construct the feature-prominent masks. TableA3 reports the experimental results. Compared with the model without auxiliary loss, mask construction introduces intensive pixel-level learning and alleviates the overfitting issue to some extent, resulting in a noteworthy +1.7 mAP improvement (57.6 v.s. 59.3). However, simple parallel segmentation learning lacks full use of mask features. To synergize with the vectorization task, our mask-guided design is proposed to boost the potential of mask features and obtains the best performance with 61.4 mAP.

Table A3: Ablation studies on auxiliary loss. Adding parallel segmentation achieves a certain level of improvement, while mask-guided design further enhances performance with the best result.

Figure A1: The visual results of the learned masks, our proposed MGMap approach and the corresponding ground truth.

Appendix C Extensive Qualitative Results

Figure A1 presents the visualization results of the learned masks and the final predictions, in which the binary masks are constructed to assist for the final map vectorization. Figure A2 provides the visual comparison with recent sotas. Later, Figures A3, A4, A5 andA6 provide extensive visualization results of our MGMap, comparing with the state-of-the-art approach MapTR under different weather and lighting conditions. Our MGMap method consistently demonstrates its promising capabilities across various scenarios.

Appendix D Limitations and Future Work

As shown in Figure A6, under some adverse conditions, like low light, occlusion, and long-range perceptions, our image-based approach still has limitations in achieving reliable performance. It is mainly caused by the lack of effective features and the inferior interpretation of driving scenes. In the future, multi-modal fusion, temporal information, and the introduction of road priors will be explored to address the current shortcomings and obtain the vectorized HD map with higher precision.

Figure A2: Comparison with recent sota methods.

Figure A3: Visualization results under the weather condition of sunny.

Figure A4: Visualization results under the weather condition of cloudy.

Figure A5: Visualization results under the weather condition of rainy.

Figure A6: Visualization results under the night condition.

References

Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
Can et al. [2021] Yigit Baran Can, Alexander Liniger, Danda Pani Paudel, and Luc Van Gool. Structured bird’s-eye-view traffic scene understanding from onboard images. In ICCV, pages 15661–15670, 2021.
Cheng et al. [2021] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, pages 17864–17875, 2021.
Cheng et al. [2022a] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022a.
Cheng et al. [2020] Tianheng Cheng, Xinggang Wang, Lichao Huang, and Wenyu Liu. Boundary-preserving mask r-cnn. In ECCV, pages 660–676. Springer, 2020.
Cheng et al. [2022b] Tianheng Cheng, Xinggang Wang, Shaoyu Chen, Wenqiang Zhang, Qian Zhang, Chang Huang, Zhaoxiang Zhang, and Wenyu Liu. Sparse instance activation for real-time instance segmentation. In CVPR, pages 4433–4442, 2022b.
Da and Zhang [2022] Fang Da and Yu Zhang. Path-aware graph attention for hd maps in motion prediction. In ICRA, pages 6430–6436. IEEE, 2022.
Ding et al. [2023a] Jian Ding, Nan Xue, Gui-Song Xia, Bernt Schiele, and Dengxin Dai. Hgformer: Hierarchical grouping transformer for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15413–15423, 2023a.
Ding et al. [2023b] Wenjie Ding, Limeng Qiao, Xi Qiu, and Chi Zhang. Pivotnet: Vectorized pivot learning for end-to-end hd map construction. In ICCV, pages 3672–3682, 2023b.
Dong et al. [2022] Hao Dong, Xianjing Zhang, Xuan Jiang, Jun Zhang, Jintao Xu, Rui Ai, Weihao Gu, Huimin Lu, Juho Kannala, and Xieyuanli Chen. Superfusion: Multilevel lidar-camera fusion for long-range hd map generation and prediction. arXiv preprint arXiv:2211.15656, 2022.
Gao et al. [2020] Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, and Cordelia Schmid. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In CVPR, pages 11525–11533, 2020.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
He et al. [2023] Yuzhe He, Shuang Liang, Xiaofei Rui, Chengying Cai, and Guowei Wan. Egovm: Achieving precise ego-localization using lightweight vectorized maps. arXiv preprint arXiv:2307.08991, 2023.
Hu et al. [2023] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In CVPR, pages 17853–17862, 2023.
Huang et al. [2021] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
Jiang et al. [2022] Bo Jiang, Shaoyu Chen, Xinggang Wang, Bencheng Liao, Tianheng Cheng, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, and Chang Huang. Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction. arXiv preprint arXiv:2212.02181, 2022.
Levinson et al. [2007] Jesse Levinson, Michael Montemerlo, and Sebastian Thrun. Map-based precision vehicle localization in urban environments. In Robotics: science and systems, page 1. Atlanta, GA, USA, 2007.
Li et al. [2022a] Qi Li, Yue Wang, Yilun Wang, and Hang Zhao. Hdmapnet: An online hd map construction and evaluation framework. In ICRA, pages 4628–4634. IEEE, 2022a.
Li et al. [2022b] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022b.
Liang et al. [2020] Justin Liang, Namdar Homayounfar, Wei-Chiu Ma, Yuwen Xiong, Rui Hu, and Raquel Urtasun. Polytransform: Deep polygon transformer for instance segmentation. In CVPR, pages 9131–9140, 2020.
Liang et al. [2022] Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework. In NeurIPS, 2022.
Liao et al. [2022] Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. In ICLR, 2022.
Liao et al. [2023] Bencheng Liao, Shaoyu Chen, Yunchi Zhang, Bo Jiang, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Maptrv2: An end-to-end framework for online vectorized hd map construction. arXiv preprint arXiv:2308.05736, 2023.
Liu et al. [2022] Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. In ECCV, pages 531–548. Springer, 2022.
Liu et al. [2023a] Yicheng Liu, Tianyuan Yuan, Yue Wang, Yilun Wang, and Hang Zhao. Vectormapnet: End-to-end vectorized hd map learning. In ICML, pages 22352–22369. PMLR, 2023a.
Liu et al. [2023b] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, pages 2774–2781. IEEE, 2023b.
Milletari et al. [2016] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pages 565–571. Ieee, 2016.
Pan et al. [2020] Bowen Pan, Jiankai Sun, Ho Yin Tiga Leung, Alex Andonian, and Bolei Zhou. Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 5(3):4867–4873, 2020.
Peng et al. [2023] Lang Peng, Zhirong Chen, Zhangjie Fu, Pengpeng Liang, and Erkang Cheng. Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs. In WACV, pages 5935–5943, 2023.
Philion and Fidler [2020] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210. Springer, 2020.
Qiao et al. [2023] Limeng Qiao, Wenjie Ding, Xi Qiu, and Chi Zhang. End-to-end vectorized hd-map construction with piecewise bezier curve. In CVPR, pages 13218–13228, 2023.
Qin et al. [2023] Zequn Qin, Jingyu Chen, Chao Chen, Xiaozhi Chen, and Xi Li. Unifusion: Unified multi-view fusion transformer for spatial-temporal representation in bird’s-eye-view. In ICCV, pages 8690–8699, 2023.
Reddi et al. [2019] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
Reiher et al. [2020] Lennart Reiher, Bastian Lampe, and Lutz Eckstein. A sim2real deep learning approach for the transformation of images from multiple vehicle-mounted cameras to a semantically segmented image in bird’s eye view. In ITSC, pages 1–7. IEEE, 2020.
Roddick and Cipolla [2020] Thomas Roddick and Roberto Cipolla. Predicting semantic map representations from images using pyramid occupancy networks. In CVPR, pages 11138–11147, 2020.
Shan and Englot [2018] Tixiao Shan and Brendan Englot. Lego-loam: Lightweight and ground-optimized lidar odometry and mapping on variable terrain. In IROS, pages 4758–4765. IEEE, 2018.
Shan et al. [2020] Tixiao Shan, Brendan Englot, Drew Meyers, Wei Wang, Carlo Ratti, and Daniela Rus. Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping. In IROS, pages 5135–5142. IEEE, 2020.
Shin et al. [2023] Juyeb Shin, Francois Rameau, Hyeonjun Jeong, and Dongsuk Kum. Instagram: Instance-level graph modeling for vectorized hd map learning. In CVPRW, 2023.
Takikawa et al. [2019] Towaki Takikawa, David Acuna, Varun Jampani, and Sanja Fidler. Gated-scnn: Gated shape cnns for semantic segmentation. In ICCV, pages 5229–5238, 2019.
Tan and Le [2019] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114. PMLR, 2019.
Tang et al. [2021] Chufeng Tang, Hang Chen, Xiao Li, Jianmin Li, Zhaoxiang Zhang, and Xiaolin Hu. Look closer to segment better: Boundary patch refinement for instance segmentation. In CVPR, pages 13926–13935, 2021.
Wang et al. [2023] Song Wang, Wentong Li, Wenyu Liu, Xiaolu Liu, and Jianke Zhu. Lidar2map: In defense of lidar-based semantic map construction using online camera distillation. In CVPR, pages 5186–5195, 2023.
Wang et al. [2021] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, pages 913–922, 2021.
Wang et al. [2022] Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In CoRL, pages 180–191. PMLR, 2022.
Wilson et al. [2021] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. In NeurIPS, 2021.
Woo et al. [2018] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In ECCV, pages 3–19, 2018.
Yan et al. [2018] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
Zhang et al. [2023] Gongjie Zhang, Jiahao Lin, Shuang Wu, Yilin Song, Zhipeng Luo, Yang Xue, Shijian Lu, and Zuoguan Wang. Online map vectorization for autonomous driving: A rasterization perspective. In NeurIPS, 2023.
Zhang et al. [2022] Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie Zhou, and Jiwen Lu. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.
Zhu et al. [2020] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.

Xet Storage Details

Size:: 88.4 kB
Xet hash:: 0cd1ff45720723d6935dbd74c666b78ff7e11e1569be4672bca056ae7c477281

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.