Buckets:

|
download
raw
75.1 kB

Title: Learning Flow-Guided Registration for RGB–Event Semantic Segmentation

URL Source: https://arxiv.org/html/2505.01548

Published Time: Fri, 26 Sep 2025 00:28:48 GMT

Markdown Content: Zhen Yao 1, Xiaowen Ying 2, Zhiyu Zhu 3, Mooi Choo Chuah 1

Lehigh University 1, Qualcomm AI Research 2, City University of Hong Kong 3

zhy321@lehigh.edu, xying@qti.qualcomm.com,

zhiyuzhu2-c@my.cityu.edu.hk, chuah@cse.lehigh.edu

Abstract

Event cameras capture microsecond-level motion cues that complement RGB sensors. However, the prevailing paradigm of treating RGB-Event perception as a fusion problem is ill-posed, as it ignores the intrinsic (i) Spatiotemporal and (ii) Modal Misalignment, unlike other RGB-X sensing domains. To tackle these limitations, we recast RGB-Event segmentation from fusion to registration. We propose BRENet, a novel flow-guided bidirectional framework that adaptively matches correspondence between the asymmetric modalities. Specifically, it leverages temporally aligned optical flows as a coarse-grained guide, along with fine-grained event temporal features, to generate precise forward and backward pixel pairings for registration. This pairing mechanism converts the inherent motion lag into terms governed by flow estimation error, bridging modality gaps. Moreover, we introduce Motion-Enhanced Event Tensor (MET), a new representation that transforms sparse event streams into a dense, temporally coherent form. Extensive experiments on four large-scale datasets validate our approach, establishing flow-guided registration as a promising direction for RGB–Event segmentation.

1 Introduction

Semantic segmentation, the task of assigning pixel-wise semantic categories, has received much research attention. It is fundamental in computer vision tasks (Wu et al., 2025; 2024), e.g., medical imaging (Chen et al., 2022; Ginley et al., 2023) and robotics (Mosbach & Behnke, 2024; Panda et al., 2023). While most approaches focused on RGB modality, recent researchers have explored incorporating event cameras (Liang et al., 2023), e.g., Dynamic Vision Sensor (DVS). Event cameras are bio-inspired devices that asynchronously capture edge motion with higher temporal resolution (10 µs vs 3 ms), higher dynamic range (120 dB vs 60 dB), and lower latency (Shiba et al., 2022). These inherent advantages enable robust motion estimation under challenging real-world scenarios.

Despite advances in multimodal integration modules, prior methods (Zihao Zhu et al., 2018; Yao & Chuah, 2024; Chen et al., 2023) treat RGB-Event perception as a fusion problem that implicitly assumes spatiotemporal co-registration. However, RGB-Event perception is intrinsically unaligned, unlike other RGB-X sensing domains: (1) Spatiotemporal Misalignment: Events are captured with microsecond-level latency while RGB images are sampled at a lower sampling rate, causing spatial shifts of corresponding scene points. (2) Modal Misalignment: Events record asynchronous, sparse brightness changes (temporal derivatives) while RGB captures synchronous, dense absolute intensities. This sensing mismatch presents significant challenges for pixel-level segmentation tasks that require spatially coherent, dense visual information. As illustrated in Figure 1, fusion-centric approaches overlook that RGB and event streams are intrinsically unregistered: fusing multiple event frames with an RGB image disrupts motion continuity and leaves modality gaps unresolved. Without explicit registration, these fusion-centric pipelines yield suboptimal predictions, limiting segmentation performance. Visualizations of misalignments are in the Appendix.

\begin{overpic}[width=377.24727pt]{image/motivation.pdf} \put(8.0,-3.0){(a) Fusion-centric} \put(68.0,-3.0){(b) Ours} \end{overpic}

Figure 1: Comparisons between existing fusion-centric models and our registration-centric method. (a) Fusion-centric methods assume spatiotemporal co-registration and ignore inherent misalignments. (b) In contrast, we rethink RGB-Event segmentation with registers first, then fuses design principle, mitigating both misalignments.

To address these misalignments, we recast RGB-Event perception as a registration problem rather than fusion, following a “registers first, then fuses” design principle. We propose BRENet, a registration-centric B idirectional R GB-E vent Semantic Segmentation framework that estimates correspondence fields in forward and backward event flows to register asynchronous event streams to the reference RGB frame. A subsequent Temporal Fusion Module (TFM) performs spatial warping to correct spatiotemporal offsets and locate motion occlusions. The flow-guided bidirectional registration mechanism and TFM provide ensemble temporal cues and yield temporally coherent features, substantially mitigating the Spatiotemporal Misalignment.

Optical flow serves as the coarse-grained motion prior that guides registration and offers three benefits: (i) it reconciles sampling-rate disparity through temporal alignment; (ii) it converts sparse, derivative-like events into dense representations, which are compatible with RGB; and (iii) it captures motion changes that align with the inherent modal nature of events. Building on this, we introduce Motion-Enhanced Event Tensors (MET). This novel representation enhances coarse global motion from optical flows and fine temporal event cues, performing representation-level registration (RGB and event). It alleviates the Modal Misalignment while preserving low-level details in a multi-granularity manner, alleviating both misalignments.

In summary, our contributions in this paper include:

  • •We formulate RGB–Event semantic segmentation as a registration problem. We propose a novel flow-guided registration-centric framework, BRENet, which estimates pixel-wise, bidirectional correspondences to pair the asynchronous event stream with the reference RGB frame. It then leverages a Temporal Fusion Module (TFM) for adaptive fusion. Rather than fusing RGB-Event data directly, BRENet registers first, then fuses, shifting the paradigm from a fusion-centric to a registration-centric approach.
  • •We introduce a new event representation, Motion-enhanced Event Tensor (MET), to integrate coarse-grained optical flows with fine-grained temporal visual cues. We redefine the role of optical flow: not as an alternative input modality, but as a bridge that dynamically aligns event with RGB. To the best of our knowledge, we are the first to employ optical flows for registration in RGB–Event perception.
  • •We evaluate our proposed BRENet on DDD17, DSEC, DELIVER, and M3ED datasets and demonstrate its effectiveness. Compared to SOTA models, BRENet achieves superior performance.

2 Related Work

2.1 RGB-Event Semantic Segmentation

Event modality provides complementary information to RGB modality, offering a new perspective on motion dynamics. However, as discussed in Section 1, two key misalignments arise when integrating these two modalities, reflecting differences in sparsity, frequency, and viewpoint.

To address these misalignments, researchers develop various fusion-centric methods (Chen et al., 2021; 2024) to integrate multi-modal features. CMX (Zhang et al., 2023a) presents a Cross-modal Feature Rectification Module that uses one modality to rectify and refine multi-modal features, learning long-range contextual information. EVSNet (Yao & Chuah, 2024) learns short- and long-term temporal motions from event and then aggregates multi-modal features adaptively. EventSAM (Chen et al., 2023) presents a cross-modal adaptation model of SAM (Kirillov et al., 2023) for event modality, leveraging weighted knowledge distillation. HALSIE (Das Biswas et al., 2024) proposed a dual-encoder framework with Spiking Neural Network (SNN) and Artificial Neural Network (ANN) to improve cross-domain feature aggregation. SpikingEDN (Zhang et al., 2024) designs an efficient SNN model that employs a dual-path spiking module for spatially adaptive modulation.

Despite these advances, fusion-centric approaches assume implicit spatiotemporal co-registration and therefore ignore misaligned signals caused by temporal discontinuity and motion displacement. In contrast, our approach is registration-centric: we establish pixel-wise correspondences via flow-guided bidirectional pairing. This design ideology converts inherent Spatiotemporal Misalignment into learnable parameters of a registration module, which can be optimized in a model-agnostic way.

2.2 Event Representation

Researchers have explored different event representations. Each single event 𝒆 𝒊\bm{e_{i}} is represented as a 4-tuple: e i=[x i,y i,p i,t i]e_{i}=\left[x_{i},y_{i},p_{i},t_{i}\right] where x i,y i x_{i},y_{i} are spatial coordinates, t i t_{i} is the timestamp and p i∈{−1,+1}p_{i}\in{-1,+1} indicates the polarity of brightness change (increasing or decreasing).

Early works introduce an image-based representation of event streams. Rebecq et al. (2017) consider events in overlapping windows and yield motion-compensated event representation. EV-FlowNet (Zhu et al., 2018) processes the event streams to event frames where the value of each pixel is the number of events. Maqueda et al. (2018) separates all events into two streams on polarity (positive and negative) and then designs a dual-branch model.

Following image-based representations, grid-based representation has been introduced. Zhu et al. (2019) discretizes event streams into bins and stacks all event bins to generate voxel grids. EST (Gehrig et al., 2019) samples voxel grids from event streams and learns the event representation using differentiable operations. Grid-based representations discretize the temporal dimension into discrete B B bins and accumulate all events in a fixed time interval Δ​t\Delta t. Recent approaches explore refined representations based on voxel grids. EISNet (Xie et al., 2024a) leverages event counts as indicators of scene activity, capturing activity-aware features. SE-Adapter (Yao et al., 2024) proposes MSP, a multi-scale spatiotemporal feature-enhanced event representation. However, these representations overlook that the asynchronous and sparse nature of event is incompatible with pixel-level segmentation tasks, which require dense visual information.

In this work, we tackle these limitations by proposing MET, which converts sparse, discrete event data to dense, continuous features. It reframes optical flows as a structural prior to register event frames into the reference RGB timestamp. By integrating coarse-grained flow information with fine-grained temporal correlations, MET effectively mitigates modal gaps and generates more robust features.

Image 1: Refer to caption

Figure 2: Illustration of overall framework. Given an input RGB-Event pair, the Coarse-to-Fine Estimator (CFE) generates bidirectional METs through coarse-grained optical flows and fine-grained event temporal features. The Bidirectional Registration Module (BRM) further adaptively registers METs into image features in both forward and backward directions. Finally, the Temporal Fusion Module (TFM) fuses bidirectional features to learn the temporal consistency.

3 Methodology

3.1 Motivation

Existing fusion-centric methods (Xie et al., 2024a), discretize time and accumulate polarity counts, integrating RGB-Event without establishing pixel-wise correspondences. This leaves both spatiotemporal discrepancy (motion-induced spatial replacement) and the modal gap unresolved.

Assume RGB frame I I captures at time t k t_{k}, all events E E are continuously recorded during [t k−1,t k][t_{k-1},t_{k}], event location is x x and velocity is v v. The fusion function can be expressed as:

ℱ fuse​(I t k,E[t k−1,t k])=∑i w i​f​(I t k​(x 0),E i​(x 0+v⋅(t k−t i))),t k−t i≥0.\mathcal{F_{\text{fuse}}}!\big(I_{t_{k}},E_{[t_{k-1},t_{k}]}\big)=\sum_{i}w_{i},f!\Big(I_{t_{k}}(x_{0}),,E_{i}!\big(x_{0}+v\cdot(t_{k}-t_{i})\big)\Big),\qquad t_{k}-t_{i}\geq 0.(1)

where w w are the fusion weights. Assume an average motion lag δ¯\bar{\delta}, weighted spatial shift Δ fuse​(x)\Delta_{\text{fuse}}(x) is:

‖Δ fuse​(x)‖=‖∑i w i⋅v i⋅(t k−t i)‖≈‖∑i w i⋅v i⋅δ¯‖>0|\Delta_{\text{fuse}}(x)|=|\sum_{i}w_{i}\cdot v_{i}\cdot(t_{k}-t_{i})|\approx|\sum_{i}w_{i}\cdot v_{i}\cdot\bar{\delta}|>0(2)

Given the positive temporal lag δ¯>0\bar{\delta}>0 and velocities of moving objects v i>0 v_{i}>0, learnable fusion weights w w cannot eliminate the irreducible spatial shift between RGB and events.

Inspired by the success of optical flow to search for shared regions in video stabilization (Yu & Ramamoorthi, 2020; Shi et al., 2022; Zhao et al., 2023), we employ it to address misalignments and serve as visual guidance. In our registration-centric design, optical flow offers key advantages: (i) registering events onto the RGB frame grid with temporal alignment of asynchronous event stream; (ii) generating dense motion fields that eliminate sparsity; (iii) transforming per-pixel intensity changes (log intensity) to modality-agnostic motion vectors in pixel space analogous to RGB-derived features (Bardow et al., 2016). This paradigm shift is inherently suited for visual tasks with dense RGB information and can effectively mitigate both Spatiotemporal (i) and Modal Misalignment (ii & iii) by providing motion and correspondence information (Shi et al., 2023).

Specifically, we convert the irreducible spatial shift to optical flow estimation errors through flow-guided registration. Define the warping function as:

ϕ t→t 0​(x)=x−∫t t 0 u​(x,τ)​𝑑 τ\phi_{t\to t_{0}}(x)=x-\int_{t}^{t_{0}}u\big(x,\tau\big)d\tau(3)

where u u is the optical-flow field under a local constant-velocity approximation. Thus, the displacement after registration can be represented as:

‖Δ reg​(x)‖=(ϕ t i→t k−ϕ^t i→t k)​(𝐱)=∫t i t k(u i−u^t k)​𝑑 τ≈𝐞⋅(t k−t i)|\Delta_{\text{reg}}(x)|=\Big(\phi_{t_{i}\to t_{k}}-\hat{\phi}{t{i}\to t_{k}}\Big)(\mathbf{x})=\int_{t_{i}}^{t_{k}}\big(u_{i}-\hat{u}{t{k}}\big),d\tau\approx\mathbf{e}\cdot,(t_{k}-t_{i})(4)

where ϕ^\hat{\phi} is ground truth warping. Assume one refinement step is a contraction 𝒰\mathcal{U} as below:

u(j+1)=𝒰​(u(j);C t k)u^{(j+1)}=\mathcal{U}!\big(u^{(j)};,C_{t_{k}}\big)(5)

where C t k C_{t_{k}} is the cost volume in estimating optical flows. After J J iterations, local convergence is:

‖u(J)−u t k‖≤ρ J​‖u(0)−u t k‖\big|u^{(J)}-u_{t_{k}}\big|\leq\rho^{J},\big|u^{(0)}-u_{t_{k}}\big|(6)

Under temporal smoothness of the true flow and bounded previous error ‖u t k−1−u^t k−1‖≤ϵ t k−1\big|u_{t_{k-1}}-\hat{u}{t{k-1}}\big|\leq\epsilon_{t_{k-1}}, the final flow error can be represented as:

‖𝐞‖=‖u(J)−u t k‖≤ρ J​(L⋅(t k−t k−1)+ϵ t k−1)|\mathbf{e}|=;\big|u^{(J)}-u_{t_{k}}\big|\leq\rho^{J},\big(L\cdot(t_{k}-t_{k-1})+\epsilon_{t_{k-1}}\big)(7)

where L L is temporal smoothness constant. Therefore, the spatiotemporal misalignment is no longer tied to motion lag but presented as flow-estimation error 𝐞\mathbf{e}, which can be optimized by ρ J\rho^{J}.

Table 1: Comparison of feature similarity on DDD17 and DSEC. Higher values indicate stronger cross-modal alignment.

Our additional theoretical analysis presenting the Centered Kernel Alignment (CKA) (Kornblith et al., 2019) metric in Table 1 quantitatively demonstrates this advantage. Flow-RGB exhibits consistently higher CKA scores than voxel grid-RGB across datasets, demonstrating its superior registration-centric and modality-generic capability over existing event representations.

We adopt a registration-centric formulation by using optical flow as a bridge that pairs the event stream onto the RGB frame’s timestamp and pixel grid, providing pixel-wise correspondences. We then combine optical flow with event temporal, maintaining multi-granular receptive fields while preserving critical low-level details. Our designed bidirectional scheme further alleviates Spatiotemporal Misalignment by fusing forward and backward features after registration.

3.2 Architecture Overview

Our proposed framework, BRENet, as illustrated in Figure 2, has three core components: a Coarse-to-Fine Estimator (CFE) for generating Motion-enhanced Event Tensors, a Bidirectional Registration Module (BRM) for multimodal fusion, and a Temporal Fusion Module (TFM) that integrates bidirectional features adaptively.

Our proposed architecture begins by processing the raw event input 𝑬\bm{E} through a sampling stage to obtain the event frames 𝑰 𝑬∈ℝ N×H×W×B\bm{I_{E}}\in\mathbb{R}^{N\times H\times W\times B} where N N is the number of frames and B B is the bin size. These event frames 𝑰 𝑬\bm{I_{E}} then undergo the flow-guided event tensorization pipeline which comprises a flow encoder, a Temporal Convolution Module, and the proposed Coarse-to-Fine Estimator (CFE) to transform the raw event stream 𝑬\bm{E} into Motion-enhanced Event Tensor (MET) 𝑴\bm{M}. Note that the generated MET {𝑴 𝒇,𝑴 𝒃}{\bm{M^{f}},\bm{M^{b}}} is bidirectional, consisting of forward 𝑴 𝒇∈ℝ H×W×C\bm{M^{f}}\in\mathbb{R}^{H\times W\times C} and backward MET 𝑴 𝒃∈ℝ H×W×C\bm{M^{b}}\in\mathbb{R}^{H\times W\times C}. Simultaneously, multi-scale RGB features 𝑭 𝑰∈ℝ H×W×C\bm{F_{I}}\in\mathbb{R}^{H\times W\times C} are extracted from the input image 𝑰∈ℝ H×W×3\bm{I}\in\mathbb{R}^{H\times W\times 3} using an image encoder. The bidirectional MET {𝑴 𝒇,𝑴 𝒃}{\bm{M^{f}},\bm{M^{b}}} and the RGB features 𝑭 𝑰\bm{F_{I}} are then registered jointly through the Bidirectional Registration Module (BRM). The resulting forward 𝑭 𝒓 𝒇∈ℝ H×W×C\bm{F^{f}{r}}\in\mathbb{R}^{H\times W\times C} and backward registered features 𝑭 𝒓 𝒃∈ℝ H×W×C\bm{F^{b}{r}}\in\mathbb{R}^{H\times W\times C} are fused using the Temporal Fusion Module (TFM), which outputs the final refined feature maps 𝑭 𝑰′∈ℝ H×W×C\bm{F_{I}^{\prime}}\in\mathbb{R}^{H\times W\times C} for the image decoder to produce the semantic segmentation masks 𝒀^∈ℝ H×W×1\bm{\hat{Y}}\in\mathbb{R}^{H\times W\times 1}. The details of each module are explained in the following subsections.

3.3 Motion-Enhanced Event Tensor (MET)

Flow-Guided Event Tensorization. We integrate optical flow with event features, enabling the capture of continuous motion trajectories. Specifically, coarse-grained optical flows derived from a flow encoder (Gehrig et al., 2021b) capture pixel correspondences under spatiotemporal displacements, while fine-grained event features model complex temporal dependencies over all time horizons via a Temporal Convolution Module.

Given an input event stream 𝑬\bm{E}, we split it into N N temporal windows and sample N N event frames I E I_{E} from these snippets. After preprocessing the event data, we first adopt a pre-trained flow encoder (Gehrig et al., 2021b) to estimate dense optical flows {𝑶 𝒇,𝑶 𝒃}{\bm{O^{f}},\bm{O^{b}}} from previous sampled N N event frames. Optical flows {𝑶 𝒇,𝑶 𝒃}{\bm{O^{f}},\bm{O^{b}}} serve as coarse-grained motion dynamics that capture the global motion information. Specifically, we estimate 𝑶 𝒇\bm{O^{f}} based on I E I_{E} and then reverse the order of all event frames to generate backward optical flow 𝑶 𝒃\bm{O^{b}}. Meanwhile, the Temporal Convolution Module is designed to capture event temporal features h h, which serve as the fine-grained event features that capture the local boundary information in the context of the temporal dimension. The module consists of three sub-blocks, with each sub-block consisting of a 3D convolutional layer with kernel size 2 ×\times 3 ×\times 3, followed by a 2D convolutional layer with kernel size 3 ×\times 3, and an average pooling layer.

Image 2: Refer to caption

Figure 3: Illustration of Coarse-to-Fine Estimator (CFE).

Coarse-to-Fine Estimator. To provide a registration-centric representation and address modal discrepancy, we propose a Coarse-to-Fine Estimator (CFE). By leveraging local adaptive modeling through deformable convolution Dai et al. (2017), it effectively captures fine-grained structural details and enhances motion-aware representations, yielding a richer and more precise scene understanding for semantic segmentation.

As illustrated in Figure 3, we first apply a Multi-Layer Perceptron (MLP) (Rosenblatt, 1958) to the input optical flows {𝑶 𝒇,𝑶 𝒃}{\bm{O^{f}},\bm{O^{b}}}. We further employ two additional MLPs to generate offsets and masks required by the subsequent Deformable Convolution Layer, based on event temporal features 𝒉\bm{h}. We then introduce a Deformable Convolution Layer which uses these offsets and masks, with the temporal features 𝒉\bm{h} acting as conditions to guide adaptive sampling. This enables the kernel to adjust its spatial sampling locations based on local context, alleviating spatial misalignment through flexible receptive fields.

Next, we apply a 2D Fast Fourier Transform (FFT) (Nussbaumer & Nussbaumer, 1982) to optical flow and convolved features, transforming them into the frequency domain. We leverage this domain based on the observation that optical flow and event representations exhibit distinct but complementary characteristics (Kim et al., 2024). We then concatenate real and imaginary components of the FFT results to obtain a frequency representation ℱ​(𝒪)∈ℝ H×⌊w 2+1⌋×2​C\mathcal{F(O)}\in\mathbb{R}^{H\times\lfloor\frac{w}{2}+1\rfloor\times 2C} and ℱ​(𝒞)∈ℝ H×⌊w 2+1⌋×2​C\mathcal{F(C)}\in\mathbb{R}^{H\times\lfloor\frac{w}{2}+1\rfloor\times 2C}. The final Motion-enhanced Event Tensor (MET), 𝑴\bm{M}, is obtained through element-wise multiplication, followed by MLP and skip connection as follows:

𝑴=f​(FFT−1​(ℱ​(𝒪)⊗ℱ​(𝒞)))+f​(𝑶)\bm{M}=f(\text{FFT}^{-1}(\mathcal{F(O)}\otimes\mathcal{F(C)}))+f(\bm{O})(8)

where f​(⋅)f(\cdot) denotes the MLP block; FFT−1\text{FFT}^{-1} represents the inverse Fast Fourier Transform operation; ℱ​(𝒪)\mathcal{F(O)} and ℱ​(𝒞)\mathcal{F(C)} denotes the frequency representation (concatenation of the real and imaginary parts) of optical flow features and convolved features.

3.4 Registration-Centric Propagation

Bidirectional Registration Module. The registration module is flexible and adaptable to multiple network architectures. Here we follow FEVD (Kim et al., 2024) and extend their proposed Frequency-aware Cross-modal Feature Enhancement (FCFE) module into a bidirectional setting. We leverage the frequency domain in both forward and backward directions to capture low- and high-frequency components from the domain-invariant aspect. Module details can be found in the Appendix.

Image 3: Refer to caption

Figure 4: Illustration of Temporal Fusion Module (TFM).

Temporal Fusion Module. The Temporal Fusion Module (TFM) fuses forward and backward registered features 𝑭 𝒓\bm{F_{r}} with image features 𝑭 𝑰\bm{F_{I}} to capture temporal coherence and enhance contextual representation across the temporal dimension. As shown in Figure 4, the module utilizes a Deformable Convolution Layer to align features from different time steps by jointly warping the same regions of bidirectional registered features 𝑭 𝒓\bm{F_{r}} into the input image features 𝑭 𝑰\bm{F_{I}}:

𝑭 𝒓 𝒇′=𝒟​𝒞​(𝑭 𝑰,f 1​(F r f),f 2​(F r f))\bm{F^{f^{\prime}}{r}}=\mathcal{DC}(\bm{F{I}},f_{1}(F^{f}{r}),f{2}(F^{f}_{r}))(9)

𝑭 𝒓 𝒃′=𝒟​𝒞​(𝑭 𝑰,f 3​(F r b),f 4​(F r b))\bm{F^{b^{\prime}}{r}}=\mathcal{DC}(\bm{F{I}},f_{3}(F^{b}{r}),f{4}(F^{b}_{r}))(10)

where f n​(⋅)f_{n}(\cdot) denotes different MLP blocks; 𝒟​𝒞​(⋅)\mathcal{DC}(\cdot) denotes Deformable Convolution Layer. The output features are multiplied by the updated image features 𝑭 𝑰′\bm{F_{I}^{\prime}}, concatenated with 𝑭 𝑰′\bm{F_{I}^{\prime}}, and subsequently passed through a Depth-wise Convolution Layer (Chollet, 2017) with skip connections as follows:

𝑭 𝑰′=C​o​n​c​a​t​(𝑭 𝒓 𝒇′⊗f​(𝑭 𝑰),𝑭 𝒓 𝒃′⊗f​(𝑭 𝑰),f​(𝑭 𝑰))\bm{F_{I}^{\prime}}=Concat(\bm{F^{f^{\prime}}{r}}\otimes f(\bm{F{I}}),\bm{F^{b^{\prime}}{r}}\otimes f(\bm{F{I}}),f(\bm{F_{I}}))(11)

𝑭′=𝒟​𝒲​𝒞​(𝑭 𝑰′)+𝑭 𝑰\bm{F^{\prime}}=\mathcal{DWC}(\bm{F_{I}^{\prime}})+\bm{F_{I}}(12)

where f​(⋅)f(\cdot) denotes MLP block and 𝒟​𝒲​𝒞​(⋅)\mathcal{DWC}(\cdot) denotes the Depth-wise Convolution Layer. The output of TFM is refined features 𝑭 𝑰′\bm{F_{I}^{\prime}}.

Finally, BRENet adopts a lightweight image decoder, consisting of one MLP block, to predict segmentation results 𝒀^\bm{\hat{Y}}.

4 Experiments

4.1 Implementation Details

Training details. We implement our work using PyTorch (Paszke, 2019) and MMSeg (Contributors, 2020). The loss function is per-pixel cross-entropy loss, following common practice with Online Hard Example Mining strategy. We train the model using AdamW optimizer (Loshchilov & Hutter, 2017) and poly learning rate schedule with initial LR 6e-5. We use 2 NVIDIA RTX A5000 GPUs for training. All of the models are trained for 80k iterations. More details are in the Appendix.

4.2 Datasets

We used 4 public large-scale datasets : DDD17(Binas et al., 2017), DSEC(Gehrig et al., 2021a), DELIVER Zhang et al. (2023b) and M3ED(Chaney et al., 2023). DDD17 contains 15950 grey-scale images for training and 3890 images for testing with 6 categories. DSEC contains 11 video sequences (10891 frames) with 11 categories. DELIVER is for RGB-X semantic segmentation, and we evaluated our model on it for robustness. For M3ED, we chose 4 sequences (5516 images) for training and 2 sequences (2481 images) for testing from the Urban Day subset with manual inspections.

Table 2: Baseline comparisons on DDD17 and DSEC dataset. Improvements over the second-best are highlighted in green.

Table 3: Baseline comparisons on DELIVER and M3ED datasets using mIoU.

Table 4: Model complexity on DDD17 dataset.

4.3 Quantitative Results

We evaluate BRENet and SOTA models on the DDD17 (Binas et al., 2017) and DSEC (Gehrig et al., 2021a) datasets in Table 2. SOTA models evaluated include RGB-based models (SegFormer (Xie et al., 2021), SegNeXt (Guo et al., 2022)), Event-based models (EV-SegNet(Alonso & Murillo, 2019), ESS (Sun et al., 2022)), RGB-Event models (EDCNet-S2D (Zhang et al., 2021), CMX (Zhang et al., 2023a), CMNeXt (Zhang et al., 2023b), HALSIE (Das Biswas et al., 2024), CMESS (Xie et al., 2024b), OpenESS (Kong et al., 2024), SpikingEDN (Zhang et al., 2024), SE-Adapter (Yao et al., 2024), EISNet (Xie et al., 2024a), Spike-BRGNet Long et al. (2024), Hybrid-Seg Li et al. (2025), Any2Seg Zheng et al. (2024), GeminiFusion Jia et al. (2024)) using their default settings.

As shown in Table 2, BRENet achieves 78.56% mIoU and 96.61% accuracy on DDD17 and 74.94% mIoU and 95.85% accuracy on DSEC using the MiT-B2 backbone (Xie et al., 2021). It outperforms SOTA methods on both dataset by a large margin. We present results exclusively with MiT-B2 for direct comparison, as most SOTA models adopt this backbone. We further evaluate model performance using DELIVER Zhang et al. (2023b) and M3ED (Chaney et al., 2023) datasets in Table 4. The table indicates that BRENet achieves 63.13% and 67.28% mIoU on DELIVER and M3ED. Our model achieves the best performance on all datasets and outperforms SOTA methods significantly, improving 7.37% mIoU on M3ED, 4.63% mIoU on DELIVER, 3.53% mIoU on DDD17, and 1.87% mIoU on DSEC than the previous best models. It is noteworthy that our BRENet demonstrates substantial performance improvements over the SAM-based model, SE-Adapter (Yao et al., 2024).

Additionally, we analyze model complexity on DDD17 dataset in Table 4. While earlier approaches exhibit the smallest parameter size, they achieve substantially lower mIoU. In contrast, our method achieves SOTA performance (78.56% mIoU) while having a similar model size and MACs comparable to recent high-performing approaches. Although our bidirectional propagation slightly increases overhead, our efficient flow encoder ensures comparable processing latency. The results demonstrate that our method effectively balances the trade-off between accuracy and computational efficiency, offering a more practical solution for real-world event-based vision applications.

Image 4: Refer to caption

Figure 5: Qualitative results on DSEC dataset. The proposed BRENet produces images with enhanced boundary details and more robust predictions compared to SOTA methods. More qualitative results are in the Appendix.

\begin{overpic}[width=459.63612pt]{image/tsne_met.pdf} \put(16.0,-6.0){(a) RGB} \put(58.0,-6.0){(b) RGB w/ MET} \end{overpic}

Figure 6: Effects of adding MET visualized with t-SNE.

Image 5: Refer to caption

Figure 7: Visualization of forward and backward optical flows and feature maps adding MET. Best viewed with zoom.

4.4 Qualitative Results

We compare the qualitative results of our method with SOTA models (e.g., CMX (Zhang et al., 2023a), CMNeXt (Zhang et al., 2023b), and EISNet (Xie et al., 2024a)) using their default settings in Figure 5. We analyze qualitative results across diverse scenarios, including varying lighting conditions (rows 1-2) and crowded scenes (rows 3-4). These models struggle to capture small moving objects (e.g., people in row 2) and locate accurate boundaries between different categories (e.g., road signs in row 1), while BRENet enhances boundary accuracy and effectively resolves the blurring in fast motion.

We additionally present the t-SNE visualization of image features after adding MET in Figure 7. In contrast to (a), we observe more distinct and well-separated clusters in (b), indicating enhanced feature differentiation. This suggests that incorporating MET improves the model’s ability to learn more discriminative features, further enhancing the performance.

To further analyze the practical benefits of MET, we visualize features in Figure 7. It shows that incorporating MET selectively highlights areas with rich semantic information (e.g., buildings) while reducing attention to less important regions (e.g., sidewalks). MET’s multi-granular receptive fields preserve low-level details with motion semantics, leading to clearer boundary details with less noise.

4.5 Ablation Study

Design choices for event representations. To validate the impact of different event representations on models, we compare our proposed MET with commonly used event representations, e.g., voxel grid (Zhu et al., 2019) and AET (Xie et al., 2024a). We take different event representations as input to the baseline and concatenate the event features with image features in the intermediate layer. The corresponding results are listed in rows 1-6 of Table 6. Our proposed MET outperforms other SOTA representations by 2.06% (voxel grid) and 0.92% (AET). Row 4-6 provides a granularity analysis of our design, showing that both coarse-grained and fine-grained features are important. With a single input (variants 4&5), event temporal features improve more, enhancing low-level details. Adding optical flow (variant 6) restores high-level motion and semantics.

Design choices for proposed modules. To validate the effectiveness of each component, we further evaluate four variants (7-10) of BRENet. Specifically, they are: (7) removing bidirectional propagation, (8) removing BRM, (9) removing TFM and applying concatenation, and (10) employing all designed components. Results are summarized in Table 6. Without the bidirectional propagation, performance drastically drops (comparing variants 7 & 10). This is because adding bidirectional contexts and extending RM to bidirectional RM enhance motion dynamics in challenging scenarios. Incorporating BRM (variant 8) and TFM (variant 9) improves results, indicating that they can successfully mitigate temporal and spatial misalignments.

Table 5: Ablation study on DDD17 dataset.

Table 6: Ablation study on plug-and-play performance of MET in SOTA methods using mIoU.

Plug-and-play performance of MET. We provide additional results of incorporating MET in SOTA methods, CMX (Zhang et al., 2023a) and EISNet (Xie et al., 2024a), on DDD17 and DSEC in Table 6. We observe that CMX with MET achieves 71.18% and 68.34% mIoU on DDD17 and DSEC, which improves original method by 3.71% and 3.05%. Similarly, EISNet with MET achieves 75.64% and 73.56% mIoU on DDD17 and DSEC, which improves original architecture by 0.61% and 0.49%.

5 Conclusion

In this paper, we recast RGB-Event semantic segmentation as a registration problem rather than fusion. Our proposed framework, BRENet, establishes pixel-wise correspondences through flow-guided bidirectional registration and then fuses aligned features. We further introduce a Motion-enhanced Event Tensor (MET) representation, which converts asynchronous, sparse events into a dense, temporally coherent representation by combining coarse optical flows with fine event temporal features. Bidirectional registration and MET effectively capture temporal context information while bridging modality gaps, mitigating Spatiotemporal and Modal Misalignment. Experimental results on DDD17, DSEC, DELIVER, and M3ED demonstrate the effectiveness of our model. Our findings and registers first, then fuses design ideology suggest a promising direction for future research in RGB-Event perception.

References

  • Alonso & Murillo (2019) Inigo Alonso and Ana C Murillo. Ev-segnet: Semantic segmentation for event-based cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0, 2019.
  • Bardow et al. (2016) Patrick Bardow, Andrew J Davison, and Stefan Leutenegger. Simultaneous optical flow and intensity estimation from an event camera. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 884–892, 2016.
  • Bazazian & Parés (2021) Dena Bazazian and M Eulàlia Parés. Edc-net: Edge detection capsule network for 3d point clouds. Applied Sciences, 11(4):1833, 2021.
  • Binas et al. (2017) Jonathan Binas, Daniel Neil, Shih-Chii Liu, and Tobi Delbruck. Ddd17: End-to-end davis driving dataset. arXiv preprint arXiv:1711.01458, 2017.
  • Chaney et al. (2023) Kenneth Chaney, Fernando Cladera, Ziyun Wang, Anthony Bisulco, M Ani Hsieh, Christopher Korpela, Vijay Kumar, Camillo J Taylor, and Kostas Daniilidis. M3ed: Multi-robot, multi-sensor, multi-environment event dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4016–4023, 2023.
  • Chen et al. (2022) Zhang Chen, Zhiqiang Tian, Jihua Zhu, Ce Li, and Shaoyi Du. C-cam: Causal cam for weakly supervised semantic segmentation on medical image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11676–11685, 2022.
  • Chen et al. (2021) Zhimin Chen, Longlong Jing, Yang Liang, YingLi Tian, and Bing Li. Multimodal semi-supervised learning for 3d objects. arXiv preprint arXiv:2110.11601, 2021.
  • Chen et al. (2024) Zhimin Chen, Longlong Jing, Yingwei Li, and Bing Li. Bridging the domain gap: Self-supervised 3d scene understanding with foundation models. Advances in Neural Information Processing Systems, 36, 2024.
  • Chen et al. (2023) Zhiwen Chen, Zhiyu Zhu, Yifan Zhang, Junhui Hou, Guangming Shi, and Jinjian Wu. Segment any events via weighted adaptation of pivotal tokens. arXiv preprint arXiv:2312.16222, 2023.
  • Chollet (2017) François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258, 2017.
  • Contributors (2020) MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  • Dai et al. (2017) Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773, 2017.
  • Das Biswas et al. (2024) Shristi Das Biswas, Adarsh Kosta, Chamika Liyanagedera, Marco Apolinario, and Kaushik Roy. Halsie: Hybrid approach to learning segmentation by simultaneously exploiting image and event modalities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5964–5974, 2024.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
  • Gehrig et al. (2019) Daniel Gehrig, Antonio Loquercio, Konstantinos G Derpanis, and Davide Scaramuzza. End-to-end learning of representations for asynchronous event-based data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5633–5643, 2019.
  • Gehrig et al. (2021a) Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios. IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021a.
  • Gehrig et al. (2021b) Mathias Gehrig, Mario Millhäusler, Daniel Gehrig, and Davide Scaramuzza. E-raft: Dense optical flow from event cameras. In 2021 International Conference on 3D Vision (3DV), pp. 197–206. IEEE, 2021b.
  • Ginley et al. (2023) Brandon Ginley, Nicholas Lucarelli, Jarcy Zee, Sanjay Jain, Seung Seok Han, Luis Rodrigues, Michelle L Wong, Kuang-yu Jen, and Pinaki Sarder. Automated reference kidney histomorphometry using a panoptic segmentation neural network correlates to patient demographics and creatinine. In Medical Imaging 2023: Digital and Computational Pathology, volume 12471, pp. 458–462. SPIE, 2023.
  • Guo et al. (2022) Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu. Segnext: Rethinking convolutional attention design for semantic segmentation. Advances in Neural Information Processing Systems, 35:1140–1156, 2022.
  • Jia et al. (2024) Ding Jia, Jianyuan Guo, Kai Han, Han Wu, Chao Zhang, Chang Xu, and Xinghao Chen. Geminifusion: Efficient pixel-wise multimodal fusion for vision transformer. arXiv preprint arXiv:2406.01210, 2024.
  • Kim et al. (2024) Taewoo Kim, Hoonhee Cho, and Kuk-Jin Yoon. Frequency-aware event-based video deblurring for real-world motion blur. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24966–24976, 2024.
  • Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026, 2023.
  • Kong et al. (2024) Lingdong Kong, Youquan Liu, Lai Xing Ng, Benoit R Cottereau, and Wei Tsang Ooi. Openess: Event-based semantic scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15686–15698, 2024.
  • Kornblith et al. (2019) Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International conference on machine learning, pp. 3519–3529. PMLR, 2019.
  • Li et al. (2025) Hebei Li, Yansong Peng, Jiahui Yuan, Peixi Wu, Jin Wang, Yueyi Zhang, and Xiaoyan Sun. Efficient event-based semantic segmentation via exploiting frame-event fusion: A hybrid neural network approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 18296–18304, 2025.
  • Liang et al. (2023) Jinxiu Liang, Yixin Yang, Boyu Li, Peiqi Duan, Yong Xu, and Boxin Shi. Coherent event guided low-light video enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10615–10625, 2023.
  • Liu et al. (2023) Haotian Liu, Guang Chen, Sanqing Qu, Yanping Zhang, Zhijun Li, Alois Knoll, and Changjun Jiang. Tma: Temporal motion aggregation for event-based optical flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9685–9694, 2023.
  • Long et al. (2024) Xianlei Long, Xiaxin Zhu, Fangming Guo, Chao Chen, Xiangwei Zhu, Fuqiang Gu, Songyu Yuan, and Chunlong Zhang. Spike-brgnet: Efficient and accurate event-based semantic segmentation with boundary region-guided spiking neural networks. IEEE Transactions on Circuits and Systems for Video Technology, 2024.
  • Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Luo et al. (2023) Xinglong Luo, Kunming Luo, Ao Luo, Zhengning Wang, Ping Tan, and Shuaicheng Liu. Learning optical flow from event camera with rendered dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9847–9857, 2023.
  • Luo et al. (2024) Xinglong Luo, Ao Luo, Zhengning Wang, Chunyu Lin, Bing Zeng, and Shuaicheng Liu. Efficient meshflow and optical flow estimation from event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19198–19207, 2024.
  • Maqueda et al. (2018) Ana I Maqueda, Antonio Loquercio, Guillermo Gallego, Narciso García, and Davide Scaramuzza. Event-based vision meets deep learning on steering prediction for self-driving cars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5419–5427, 2018.
  • Mosbach & Behnke (2024) Malte Mosbach and Sven Behnke. Grasp anything: Combining teacher-augmented policy gradient learning with instance segmentation to grasp arbitrary objects. Proceedings of IEEE ICRA, 2024.
  • Nussbaumer & Nussbaumer (1982) Henri J Nussbaumer and Henri J Nussbaumer. The fast Fourier transform. Springer, 1982.
  • Panda et al. (2023) Shivam K Panda, Yongkyu Lee, and M Khalid Jawed. Agronav: Autonomous navigation framework for agricultural robots and vehicles using semantic segmentation and semantic line detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6272–6281, 2023.
  • Paszke (2019) A Paszke. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
  • Patil & Kulkarni (2018) Priyanka A Patil and Charudatta Kulkarni. A survey on multiply accumulate unit. In 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), pp. 1–5. IEEE, 2018.
  • Rebecq et al. (2017) Henri Rebecq, Timo Horstschaefer, and Davide Scaramuzza. Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization. In Proceedings of the British Machine Vision Conference (BMVC), pp. 16–1, 2017.
  • Rosenblatt (1958) Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
  • Shi et al. (2023) Xiaoyu Shi, Zhaoyang Huang, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1599–1610, 2023.
  • Shi et al. (2022) Zhenmei Shi, Fuhao Shi, Wei-Sheng Lai, Chia-Kai Liang, and Yingyu Liang. Deep online fused video stabilization. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 1250–1258, 2022.
  • Shiba et al. (2022) Shintaro Shiba, Yoshimitsu Aoki, and Guillermo Gallego. Secrets of event-based optical flow. In European Conference on Computer Vision, pp. 628–645. Springer, 2022.
  • Sun et al. (2022) Zhaoning Sun, Nico Messikommer, Daniel Gehrig, and Davide Scaramuzza. Ess: Learning event-based semantic segmentation from still images. In European Conference on Computer Vision, pp. 341–357. Springer, 2022.
  • Wu et al. (2024) Wangyu Wu, Xianglin Qiu, Siqi Song, Xiaowei Huang, Fei Ma, and Jimin Xiao. Prompt categories cluster for weakly supervised semantic segmentation. arXiv preprint arXiv:2412.13823, 2024.
  • Wu et al. (2025) Wangyu Wu, Tianhong Dai, Zhenhong Chen, Xiaowei Huang, Fei Ma, and Jimin Xiao. Generative prompt controlled diffusion for weakly supervised semantic segmentation. Neurocomputing, 2025.
  • Xie et al. (2024a) Bochen Xie, Yongjian Deng, Zhanpeng Shao, and Youfu Li. Eisnet: A multi-modal fusion network for semantic segmentation with events and images. IEEE Transactions on Multimedia, 2024a.
  • Xie et al. (2024b) Chuyun Xie, Wei Gao, and Ren Guo. Cross-modal learning for event-based semantic segmentation via attention soft alignment. IEEE Robotics and Automation Letters, 9(3):2359–2366, 2024b.
  • Xie et al. (2021) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34:12077–12090, 2021.
  • Yao et al. (2024) Bowen Yao, Yongjian Deng, Yuhan Liu, Hao Chen, Youfu Li, and Zhen Yang. Sam-event-adapter: Adapting segment anything model for event-rgb semantic segmentation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 9093–9100. IEEE, 2024.
  • Yao & Chuah (2024) Zhen Yao and Mooi Choo Chuah. Event-guided low-light video semantic segmentation. arXiv preprint arXiv:2411.00639, 2024.
  • Yu & Ramamoorthi (2020) Jiyang Yu and Ravi Ramamoorthi. Learning video stabilization using optical flow. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8159–8167, 2020.
  • Zhang et al. (2021) Jiaming Zhang, Kailun Yang, and Rainer Stiefelhagen. Exploring event-driven dynamic context for accident scene segmentation. IEEE Transactions on Intelligent Transportation Systems, 23(3):2606–2622, 2021.
  • Zhang et al. (2023a) Jiaming Zhang, Huayao Liu, Kailun Yang, Xinxin Hu, Ruiping Liu, and Rainer Stiefelhagen. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on intelligent transportation systems, 2023a.
  • Zhang et al. (2023b) Jiaming Zhang, Ruiping Liu, Hao Shi, Kailun Yang, Simon Reiß, Kunyu Peng, Haodong Fu, Kaiwei Wang, and Rainer Stiefelhagen. Delivering arbitrary-modal semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1136–1147, 2023b.
  • Zhang et al. (2024) Rui Zhang, Luziwei Leng, Kaiwei Che, Hu Zhang, Jie Cheng, Qinghai Guo, Jianxing Liao, and Ran Cheng. Accurate and efficient event-based semantic segmentation using adaptive spiking encoder–decoder network. IEEE Transactions on Neural Networks and Learning Systems, 2024.
  • Zhao et al. (2023) Weiyue Zhao, Xin Li, Zhan Peng, Xianrui Luo, Xinyi Ye, Hao Lu, and Zhiguo Cao. Fast full-frame video stabilization with iterative optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23534–23544, 2023.
  • Zheng et al. (2024) Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. Learning modality-agnostic representation for semantic segmentation from any modalities. In European Conference on Computer Vision, pp. 146–165. Springer, 2024.
  • Zhu et al. (2018) Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Ev-flownet: Self-supervised optical flow estimation for event-based cameras. arXiv preprint arXiv:1802.06898, 2018.
  • Zhu et al. (2019) Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Unsupervised event-based learning of optical flow, depth, and egomotion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 989–997, 2019.
  • Zihao Zhu et al. (2018) Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Unsupervised event-based optical flow using motion compensation. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0, 2018.

Appendix Overview

In this Appendix, we provide additional details to compliment the content of the paper, including the model details of the BRM (Section A), additional visualization (Section B), additional ablation studies (Section C), analysis of the results from comparative methods (Section D), Misalignment visualization (Section F), and limitation of proposed method (Section G).

Image 6: Refer to caption

Figure 8: Illustration of Bidirectional Registration Module (BRM). From left to right, the components are Spatial and Channel Attention Blocks.

Appendix A More Details

A.1 Architectural Details

A.1.1 Bidirectional Registration Module

Inspired by FEVD Kim et al. (2024), we build the Bidirectional Registration Module (BRM) where we leverage the frequency domain in both the forward and backward directions to mitigate the misalignment between RGB features and MET features. It consists of two main components: the Spatial Attention Block and Channel Attention Block.

As shown in Figure 8, we first concatenate the image feature 𝑭 𝑰\bm{F_{I}} from the image encoder and the bidirectional METs {𝑴 𝒇,𝑴 𝒃}{\bm{M^{f}},\bm{M^{b}}}, and apply an MLP to each modality feature.

For the Spatial Attention Block, we apply 2D Fast Fourier Transform (FFT) to MET features. The generated real and imaginary components are concatenated to preserve both amplitude and phase information as follows:

ℱ​(ℳ)=FFT​(f​(C​o​n​c​a​t​(𝑴,𝑭 𝑰)))\mathcal{F(M})=\text{FFT}(f(Concat(\bm{M},\bm{F_{I}})))(13)

where ℱ​(ℳ)\mathcal{F(M)} can be viewed as the frequency representation of MET Features in both directions and FFT​(⋅)\text{FFT}(\cdot) includes Fast Fourier Transform operation, followed by concatenation. We then adopt a Convolution Layer for generating the Spatial Attention Mask, and the proposed Spatial Attention Block is calculated as:

𝒜 s=σ​(𝒞 3×3​(ℱ​(ℳ)))\mathcal{A}{s}=\sigma(\mathcal{C}{3\times 3}(\mathcal{F(M)}))(14)

𝑭 𝒔=FFT−1​(𝒜 s⊗f​(𝑭 𝑰))\bm{F_{s}}=\text{FFT}^{-1}(\mathcal{A}{s}\otimes f(\bm{F{I}}))(15)

where 𝒞 3×3​(⋅)\mathcal{C}{3\times 3}(\cdot) represents Convolution Layer with kernel size 3 ×\times 3, followed by ReLU and Sigmoid functions; σ​(⋅)\sigma(\cdot) indicates Sigmoid activation function; FFT−1​(⋅)\text{FFT}^{-1}(\cdot) refers to inverse Fast Fourier Transform operation. Note that the Spatial Attention Block is applied for both forward and backward directions, generating spatial correlated features {𝑭 𝒔 𝒇,𝑭 𝒔 𝒃}∈ℝ H×W×C{\bm{F^{f}{s}},\bm{F^{b}_{s}}}\in\mathbb{R}^{H\times W\times C}.

For Channel Attention Block, we similarly apply 2D Fast Fourier Transform (FFT) Nussbaumer & Nussbaumer (1982) for spatially correlated features 𝑭 𝒔\bm{F_{s}}. The Channel Attention Mask and the following channel correlated features 𝑭 𝒄\bm{F_{c}} are calculated based on the Average Pooling Layer as:

𝒜 c=𝒜​𝒫​(𝑭 𝒔⊕f​(C​o​n​c​a​t​(𝑴,𝑭 𝑰)))\mathcal{A}{c}=\mathcal{AP}(\bm{F{s}}\oplus f(Concat(\bm{M},\bm{F_{I}})))(16)

𝑭 𝒄=FFT−1​(𝒜 c⊗FFT​(𝑭 𝒔))\bm{F_{c}}=\text{FFT}^{-1}(\mathcal{A}{c}\otimes\text{FFT}(\bm{F{s}}))(17)

where 𝒜​𝒫​(⋅)\mathcal{AP}(\cdot) represents Average Pooling Layer with 2D Adaptive Average Pooling, followed by ReLU and Sigmoid functions.

An additional Cross-attention Layer is employed to generate the final registered features 𝑭 𝒓 𝒇\bm{F^{f}{r}} and 𝑭 𝒓 𝒃\bm{F^{b}{r}} in both directions, where image features act as Query and channel correlated features 𝑭 𝒄\bm{F_{c}} serve as Key/Value.

A.1.2 Decoder

We employ a lightweight MLP decoder that significantly reduces high computational costs compared with existing methods. The key to enabling such a simple decoder is that our proposed MET representations and bidirectional mechanism alleviate spatiotemporal and modal misalignments, leading to better feature extraction ability.

A.2 Other Implementation Details

Training details. We use MiT-B2 from SegFormer Xie et al. (2021) pre-trained on ImageNet-1K dataset Deng et al. (2009) as the backbone, following the same setting as in most of the recent state-of-the-art methods. For the flow encoder, we adopt E-Raft Gehrig et al. (2021b), which is an efficient event-based optical flow estimation framework pre-trained on the DSEC dataset Gehrig et al. (2021a). We fine-tune the flow encoder when training on other datasets.

Data preprocessing. The data augmentation used in our work includes random cropping, random flipping, and photometric distortion, following Xie et al. (2024a). During training, we crop the RGB images to 512 ×\times 512 solely on the M3ED dataset Chaney et al. (2023). We don’t explore random cropping on the other 2 datasets since their input resolution is much smaller.

Evaluation metrics. We use mean Intersection over Union (mIoU) and pixel accuracy to measure segmentation performance. The model complexity is measured by parameter size and MACs (Patil & Kulkarni, 2018).

Image 7: Refer to caption

Figure 9: Qualitative results on DDD17 dataset. It shows that our model generates more robust and temporally consistent results than SOTA methods.

Appendix B Additional Visualization

B.1 Additional Qualitative Results

To further validate our model, we present a visualization of segmentation predictions on several samples from the DDD17 Binas et al. (2017) dataset to highlight the robustness of our approach, as shown in Figure 9. Specifically, we compare the qualitative results of our method with the baseline EISNet Xie et al. (2024a), utilizing its trained weights. In the predictions generated by EISNet, some small objects are omitted, such as traffic signs in row 1-3 and trees in row 4. This indicates its limitations in accurately recognizing dynamic motions and in reducing blurring effects. In contrast, BRENet captures smaller objects, preserves intricate details, and generates sharper and more precise object boundaries. These samples showcase BRENet’s ability to tackle challenging scenarios involving rapid motion and complex visual environments, effectively addressing limitations in prior methods.

\begin{overpic}[width=368.57964pt]{image/tsne_eisnet.pdf} \put(19.0,-4.0){(a) EISNet} \put(69.0,-4.0){(b) BRENet} \end{overpic}

Figure 10: Comparisons between EISNet and BRENet via t-SNE visualisation. Best viewed with zoom.

B.2 Additional t-SNE Visualization

Figure 10 depicts the t-SNE visualization of the feature maps taken before the prediction head for both EISNet and the proposed BRENet on the DSEC Gehrig et al. (2021a) dataset. Whereas EISNet features form partially overlapping groups, BRENet produces more distinct and well-grouped clusters, indicating stronger class discriminability and robustness—properties that align with its superior quantitative performance.

Image 8: Refer to caption

Figure 11: Per-class comparisons with SOTA method on DSEC and DDD17 datasets.

B.3 Per-class Comparison

We present a comprehensive analysis across all classes in Figure 11 on DSEC Gehrig et al. (2021a) and DDD17 Binas et al. (2017), focusing on per-class mIoU performance. BRENet outperforms EISNet in every category on DDD17 Binas et al. (2017) and in all but the "fence" & "wall" classes on DSEC Gehrig et al. (2021a), where the scores are comparable. It demonstrates BRENet’s balanced and robust performance across classes, achieving high mIoU scores in previously challenging areas. We attribute the leading performance of BRENet to the following three factors: (1) Our proposed MET transforms sparse events into visual-based tensors with optical flows, alleviating modal misalignment. (2) The bidirectional feature propagation leverages temporal coherence in both forward and backward directions to improve spatiotemporal consistency. (3) The TFM adaptively aligns bidirectional features, further addressing the spatial misalignment issue.

Table 7: Ablation study of different flow encoders on DSEC

Table 8: Ablation study of our model with different backbones on DSEC dataset.

Appendix C Additional Ablation Study

Selection of different flow encoders. We intentionally adopt E-RAFT Gehrig et al. (2021b) as our default choice to avoid utilizing the latest advancements. To thoroughly analyze the robustness, we evaluate BRENet with various flow encoders. As shown in Table 7, our model achieves stable performance using ADM-Flow Luo et al. (2023) and EEMFlow Luo et al. (2024). While adding TMA Liu et al. (2023) achieves lower mIoU, it still significantly beats SOTA models, e.g., EISNet Xie et al. (2024a). It shows that our approach is not sensitive to the flow encoder, remaining effective under different flow qualities.

Selection of different backbones. In addition, we validate our model using different backbones, as summarized in Table 8. Specifically, we use MiT-B0, MiT-B2, and MiT-B5 to analyze the impact of varying network capacities on performance. Leveraging more powerful backbones consistently improves the results. Replacing MiT-B0 with MiT-B2 leads to a significant mIoU increase of 9.0%. Similarly, replacing MiT-B2 with MiT-B5 yields a further mIoU improvement of 0.4%, accompanied by a 151.9% increase in model size. After employing the most powerful backbone MiT-B5, BRENet achieves 75.22% mIoU, outperforming SOTA methods by 2.9%.

Selection of event bin number. We also perform ablation studies on the DSEC dataset Gehrig et al. (2021a) to investigate the impact of different event bin size. Results in Table 9 indicate that increasing N N likely contributes to learning better temporal dynamics. However, the improvement is modest and not decisive. Therefore, the performance gain isn’t mainly from the larger event bin number but from our unique design of MET and subsequent modules.

Table 9: Ablation study of event bin number N N on DSEC dataset.

Table 10: Ablation study on DSEC dataset.

Design choices on DSEC dataset. We furthermore validate the designs of MET and subsequent modules on the DSEC Gehrig et al. (2021a) dataset in Table 10. Variant 6, which employs MET, achieves 73.24% mIoU and outperforms other event representations. To validate the effectiveness of each component, we further evaluate four variants (7-10) of BRENet. Specifically, they are: (i) removing bidirectional propagation, (ii) removing BRM, (iii) removing TFM and applying concatenation, and (iv) employing all designed components. Variant 10, equipped with all proposed modules, improves SOTA performance by 2.6% mIoU. Bidirectional propagation, RM, and TFM contribute 1.3%, 0.9%, and 0.7% mIoU improvements, respectively. From these findings, we draw a similar conclusion that combining MET with bidirectional propagation, RM, and TFM effectively mitigates misalignments and enhances spatiotemporal coherence.

Appendix D Performance vs. Model Size

We further conduct the Performance vs. Model Size analysis on the DDD17 dataset Binas et al. (2017) in Figure 12. The results demonstrate that BRENet achieves significant improvements over state-of-the-art (SOTA) methods in segmentation performance while maintaining a comparable or slightly increased model size. Specifically, BRENet increases the parameter size by approximately 13M compared to the smallest method, EDCNet-S2D Bazazian & Parés (2021), and only 3M more than the most recent approach, EISNet Xie et al. (2024a). Despite these modest increases in model size, BRENet achieves a remarkable mIoU improvement of 4.7%. This analysis demonstrates BRENet’s efficiency in balancing performance gains with computational cost.

Image 9: Refer to caption

Figure 12: Performance vs. model size on DDD17 dataset.

Image 10: Refer to caption

Figure 13: Visualization on DSEC-Night testing set.

Appendix E Failure Case

We perform zero-shot evaluations on DSEC-Night Gehrig et al. (2021a) for challenging low-light scenarios. These experiments reveal that our model maintains strong segmentation accuracy even when flow estimation becomes noisy, as shown in Figure 13. Specifically, in moderately dark regions, events remain robust and support reliable flow estimation to guide RGB features. In extremely dark regions, event data still preserves subtle motion cues while optical flows naturally and significantly degrade. As a result, predictions in these areas become more ambiguous, particularly for fine-grained boundaries (e.g., tree vs. sky).

Appendix F Misalignment visualization

As discussed in Section 1, RGB-Event fusion suffers from two major misalignments: Spatiotemporal and Modal Misalignment. Figure 14 illustrates both types of misalignment on DSEC dataset. On one hand, asynchronous sampling causes temporal lag and spatial shifts, resulting in noticeable boundary mismatches, especially in fast motion scenarios, e.g., driving. Such spatial shifts make the pixel-wise predictions more challenging. On the other hand, the event modality differs from RGB: it only records brightness changes, generating sparse, edge-focused scattered points resembling point clouds. In contrast, RGB frames provide dense appearance and semantic contexts. This disparity leads to a significant representation gap, making feature alignment across modalities difficult.

Appendix G Limitation

Although BRENet attains similar latency compared to SOTA models owing to its efficient flow encoder and TFM design, the bidirectional propagation and deformable convolutions still add non-trivial computational costs. Consequently, the current model is not yet suitable for edge devices with limited computational resources. Developing a lightweight variant that preserves accuracy under such hardware budgets is an intriguing topic for future work.

Appendix H The Use of Large Language Models (LLMs)

We used LLMs as a general-purpose assist tool to improve the writings. Specifically, it helped check typos, grammar and style issues, and minor notation inconsistencies. It also suggested alternative phrasings for clarity. The LLM did not propose research ideas and design models. All LLM suggestions were manually reviewed before incorporation.

Image 11: Refer to caption

Figure 14: Misalignment visualization. The scattered events exhibit a clear domain gap and obvious spatial shifts due to different temporal resolution, compared to the RGB modality.

Xet Storage Details

Size:
75.1 kB
·
Xet hash:
845f5ba1c4ec1484054e8bfd9398f93a5183f0853c1635f580647d5a580d6653

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.