59.8 kB

Title: Multi-head Spatial-Spectral Mamba for Hyperspectral Image Classification

URL Source: https://arxiv.org/html/2408.01224

Published Time: Tue, 27 Aug 2024 01:08:39 GMT

Markdown Content: \name Muhammad Ahmad a, Muhammad Hassaan Farooq Butt b, Muhammad Usama a, Hamad Ahmed Altuwaijri c, Manuel Mazzara d, Salvatore Distefano d a M. Ahmad and M. Usama are with the Department of Computer Science, National University of Computer and Emerging Sciences, Islamabad, Chiniot-Faisalabad Campus, Chiniot 35400, Pakistan.;

b M. H. F. Butt is with the Institute of Artificial Intelligence, School of Mechanical and Electrical Engineering, Shaoxing University, Shaoxing 312000, China.

c H.A. Altuwaijri is with the Department of Geography, College of Humanities and Social Sciences, King Saud University, Riyadh, 11451 Saudi Arabia.

d M. Mazzara is with the Institute of Software Development and Engineering, Innopolis University, Innopolis, 420500, Russia.

e S. Distefano is with Dipartimento di Matematica e Informatica—MIFT, University of Messina, Messina 98121, Italy.

Abstract

Spatial-Spectral Mamba (SSM) improves computational efficiency and captures long-range dependencies, addressing Transformer limitations. However, traditional Mamba models overlook rich spectral information in HSIs and struggle with high dimensionality and sequential data. To address these issues, we propose the SSM with multi-head self-attention and token enhancement (MHSSMamba). This model integrates spectral and spatial information by enhancing spectral tokens and using multi-head attention to capture complex relationships between spectral bands and spatial locations. It also manages long-range dependencies and the sequential nature of HSI data, preserving contextual information across spectral bands. MHSSMamba achieved remarkable classification accuracies of 97.62% on Pavia University, 96.92% on the University of Houston, 96.85% on Salinas, and 99.49% on Wuhan-longKou datasets. The source code is available at GitHub.

keywords:

Hyperspectral Imaging, Spatial-Spectral Mamba, Multi-head Self-Attention, Hyperspectral Image Classification

††articletype: Letter 1 Introduction

Hyperspectral Image Classification (HSIC) is essential in various fields due to its capacity to capture detailed spectral information across numerous narrow bands, enabling precise material identification and analysis. This capability has facilitated significant advancements and applications in domains such as remote sensing [1], Earth observation [2], urban planning [3], agriculture [4], mineral exploration [5], environmental monitoring [6], and climate change [7]. Additionally, HSIC is valuable in less geoscience-related fields such as food processing [8, 9], bakery products [10], bloodstain identification [11, 12], and meat processing [13, 14].

The extensive spectral data in Hyperspectral Images (HSIs) presents both challenges and opportunities for effective classification [15]. The successful application of Neural Networks has bolstered recent progress in HSIC [16, 17, 18, 19, 20], with a growing interest in Transformer models to enhance HSI analysis. The Transformer architecture has significantly advanced HSIC compared to traditional deep learning (TDL) methods [21, 22, 23, 24, 25, 26, 27, 28, 29]. Transformers, with their self-attention mechanisms, can capture long-range dependencies within spatial-spectral features, enabling a comprehensive understanding of the intricate relationships between spectral bands and spatial information. The quadratic computational complexity of Transformers presents significant challenges, particularly when dealing with high-dimensional HSI data. This complexity can limit their practical applications. Additionally, Transformers typically require a large number of labeled data for training to achieve high performance; otherwise, they are prone to overfitting.

The emergence of Mamba, a state space model-based (SSM) approach, offers a promising solution to the challenges of high-dimensional HSI data. Mamba can capture long-range dependencies like Transformers but with significantly greater computational efficiency, making them well-suited for processing large datasets without compromising performance [30, 31]. Originally designed for sequence data in natural language processing, Mamba models have proven effective by incorporating time-varying parameters into SSMs, achieving linear time complexity while maintaining strong representational capabilities [32]. This innovative approach has been extended to various visual tasks, including HSIC, where TDL often struggles with the data’s high dimensionality and complexity [33].

Mamba’s unique cross-scan strategy bridges 1D and 2D sequence scanning, effectively modeling spatial dependencies in multidimensional data. The architecture by Li and Wang et al. [34, 35] includes selective scanning mechanisms and optimizations that enhance training and inference efficiency. Mamba’s integration with hybrid models, like combining CNNs with SSMs, offers a robust framework for high-resolution HSIs [36, 37]. However, Mamba models often neglect the rich spectral information in HSIs, leading to suboptimal performance in tasks requiring spectral feature distinction. Existing models [32] struggle to balance spectral and spatial features, missing critical information. Additionally, Mamba faces challenges in capturing long-range dependencies, significant for HSIs with features spread across different regions, and struggles with the sequential nature of HSI data, losing contextual information across spectral bands [30].

In response to the challenges posed by traditional Mamba architecture and the complexities of HSI data, we introduce a multi-head Spatial-Spectral Mamba (MHSSMamba) for HSIC. By building on the Mamba framework, the proposed architecture incorporates cutting-edge token enhancement and multi-head self-attention mechanisms. The MHSSMamba dramatically enhances spectral token representation, leading to superior feature extraction and classification performance. Our key contributions are:

1.The Spectral-Spatial token extracts separate spectral and spatial tokens from the input HSI patches, allowing the MHSSMamba model to independently leverage both types of information. This approach potentially enhances feature representation and improves classification performance.
2.The customized Multi-head self-attention (MHSA) mechanism is specifically designed to process spatial-spectral tokens by projecting queries, keys, and values through dense layers and appropriately reshaping them. This design enables the MHSSMamba to perform attention operations more efficiently and effectively, capturing complex relationships between different spectral bands and spatial locations.
3.The spectral-spatial feature enhancement module introduces a dual gating mechanism that separately enhances spectral and spatial tokens processed by MHSA using learned gating signals. This dual-gate approach enables the model to adaptively refine features based on both spatial and spectral contexts, potentially increasing the discriminative power of the learned representations.
4.The State Space Model (SSM) introduces a novel way of capturing temporal dynamics by maintaining and updating state representations through learned transitions and updates. This model component integrates sequential dependencies into the HSI processing pipeline, enhancing the ability to model temporal patterns in time-series HSI data.

In summary, the MHSSMamba presents a comprehensive end-to-end pipeline that integrates token generation, multi-head attention, feature enhancement, and state space modeling. This hybrid architecture combines several advanced techniques into a unified framework, offering an intricate approach to HSIC that leverages both spatial and spectral information as well as temporal dynamics. These contributions underscore the model’s innovative techniques for addressing HSIC challenges, providing potential improvements in feature representation, attention mechanisms, and temporal modeling.

2 Spatial-Spectral Mamba with Multi-Head Self-Attention

Assume the HSI data has a shape of (H,W,C)𝐻 𝑊 𝐶(H,W,C)( italic_H , italic_W , italic_C ), where H 𝐻 H italic_H and W 𝑊 W italic_W are the height and width, and C 𝐶 C italic_C is the number of bands. Figure 1 provides an overview of the MHSSMamba method. The HSI cube X 𝑋 X italic_X is divided into overlapping 3D patches as N=(H P×W P)𝑁 𝐻 𝑃 𝑊 𝑃 N=\big{(}\frac{H}{P}\times\frac{W}{P}\big{)}italic_N = ( divide start_ARG italic_H end_ARG start_ARG italic_P end_ARG × divide start_ARG italic_W end_ARG start_ARG italic_P end_ARG ), where P⁢a⁢t⁢c⁢h⁢(X)∈ℛ N×(P×P×C)𝑃 𝑎 𝑡 𝑐 ℎ 𝑋 superscript ℛ 𝑁 𝑃 𝑃 𝐶 Patch(X)\in\mathcal{R}^{N\times(P\times P\times C)}italic_P italic_a italic_t italic_c italic_h ( italic_X ) ∈ caligraphic_R start_POSTSUPERSCRIPT italic_N × ( italic_P × italic_P × italic_C ) end_POSTSUPERSCRIPT, P 𝑃 P italic_P be the size of patch. These patches are divided into spectral and spatial patches, each undergoing separate processing to generate spectral and spatial tokens.

Figure 1: A joint spatial-spectral feature token is first computed from the HSI. These tokens are then encoded in the MHSSMamba model, which includes token enhancement and a multi-head attention module, allowing for a more selective and effective representation of information compared to standard fixed-dimension encodings. The output is subsequently processed through a state-space model, followed by normalization and a linear layer, before being passed to the classification head for ground truth generation.

HSIs have rich spatial and spectral information. By flattening 2D spatial data into spatial tokens 𝐒 𝐒\mathbf{S}bold_S and 1D spectral data into spectral tokens 𝐅 𝐅\mathbf{F}bold_F, the model captures intricate spatial-spectral relationships. This token representation enables the multi-head attention module to jointly attend to spatial and spectral cues, which is crucial for accurate HSI classification. We generate spatial and spectral tokens 𝐒 𝐒\mathbf{S}bold_S and 𝐅 𝐅\mathbf{F}bold_F as 𝐒=[𝐬 1,𝐬 2,…,𝐬 C]∈ℝ B×(H⁢W)×C 𝐒 subscript 𝐬 1 subscript 𝐬 2…subscript 𝐬 𝐶 superscript ℝ 𝐵 𝐻 𝑊 𝐶\mathbf{S}=[\mathbf{s}{1},\mathbf{s}{2},...,\mathbf{s}{C}]\in\mathbb{R}^{B% \times(HW)\times C}bold_S = [ bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × ( italic_H italic_W ) × italic_C end_POSTSUPERSCRIPT and 𝐅=[𝐟 1,𝐟 2,…,𝐟 H⁢W]∈ℝ B×(H⁢W)×C 𝐅 subscript 𝐟 1 subscript 𝐟 2…subscript 𝐟 𝐻 𝑊 superscript ℝ 𝐵 𝐻 𝑊 𝐶\mathbf{F}=[\mathbf{f}{1},\mathbf{f}{2},...,\mathbf{f}{HW}]\in\mathbb{R}^{B% \times(HW)\times C}bold_F = [ bold_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_f start_POSTSUBSCRIPT italic_H italic_W end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × ( italic_H italic_W ) × italic_C end_POSTSUPERSCRIPT. The token enhancement applies a gating mechanism to spatial and spectral tokens, using the center region of the HSI sample as context [30]. This allows MHSSMamba to adjust the importance of different tokens dynamically, enhancing feature extraction. Given the center tokens, the token enhancements are made as follows:

𝐒~~(l)=𝐒(l)⊙σ⁢(𝐖 s⁢𝐜+𝐛 s)superscript~~𝐒 𝑙 direct-product superscript 𝐒 𝑙 𝜎 subscript 𝐖 𝑠 𝐜 subscript 𝐛 𝑠\widetilde{\mathbf{S}}^{(l)}=\mathbf{S}^{(l)}\odot\sigma(\mathbf{W}{s}\mathbf% {c}+\mathbf{b}{s})over~ start_ARG bold_S end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = bold_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ⊙ italic_σ ( bold_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_c + bold_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )(1)

𝐅~~(l)=𝐅(l)⊙σ⁢(𝐖 f⁢𝐜+𝐛 f)superscript~~𝐅 𝑙 direct-product superscript 𝐅 𝑙 𝜎 subscript 𝐖 𝑓 𝐜 subscript 𝐛 𝑓\widetilde{\mathbf{F}}^{(l)}=\mathbf{F}^{(l)}\odot\sigma(\mathbf{W}{f}\mathbf% {c}+\mathbf{b}{f})over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = bold_F start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ⊙ italic_σ ( bold_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT bold_c + bold_b start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT )(2)

where 𝐒(l)superscript 𝐒 𝑙\mathbf{S}^{(l)}bold_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and 𝐅(l)superscript 𝐅 𝑙\mathbf{F}^{(l)}bold_F start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT represents the spatial and spectral tokens at layer l 𝑙 l italic_l of the model. Spatial and spectral tokens capture spatial and spectral features from the HSI samples, respectively. 𝐜 𝐜\mathbf{c}bold_c denotes the center region of the HSI sample used as context. It helps to provide additional information about the central part of the patch to guide the gating mechanism. 𝐖 s subscript 𝐖 𝑠\mathbf{W}{s}bold_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝐖 f subscript 𝐖 𝑓\mathbf{W}{f}bold_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are the learned weight parameters used to transform the center context 𝐜 𝐜\mathbf{c}bold_c for spatial and spectral tokens, respectively. It is a matrix that projects the context into the same space as the spatial and spectral tokens. 𝐛 s subscript 𝐛 𝑠\mathbf{b}{s}bold_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝐛 f subscript 𝐛 𝑓\mathbf{b}{f}bold_b start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT represent the bias terms added to the transformed context for spatial and spectral tokens. σ 𝜎\sigma italic_σ denotes the sigmoid function, which is used to apply a gating mechanism. The sigmoid function outputs values between 0 and 1, allowing the model to adjust the importance of different tokens dynamically. ⊙direct-product\odot⊙ denotes element-wise multiplication. It is used to apply the gating values (obtained from the sigmoid function) to the spatial and spectral tokens.

The multi-head attention mechanism enables the model to learn diverse and informative feature representations by attending to different parts of the input tokens (both spatial and spectral). By projecting the input tokens into multiple attention heads, the model captures complex dependencies within HSI data, essential for effective feature extraction. The scaled dot-product attention, followed by a softmax operation, dynamically assigns importance weights to the input tokens, enhancing the model’s focus on the most relevant features. Let the spatial and spectral tokens 𝐒~~(l)superscript~~𝐒 𝑙\widetilde{\mathbf{S}}^{(l)}over~ start_ARG bold_S end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and 𝐅~~(l)superscript~~𝐅 𝑙\widetilde{\mathbf{F}}^{(l)}over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT be the inputs, and Q,K,V 𝑄 𝐾 𝑉 Q,K,V italic_Q , italic_K , italic_V be the query, key, and value respectively. For each head i 𝑖 i italic_i: Q i=𝐒~~(l)⁢W i Q subscript 𝑄 𝑖 superscript~~𝐒 𝑙 subscript superscript 𝑊 𝑄 𝑖 Q_{i}=\widetilde{\mathbf{S}}^{(l)}W^{Q}{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over~ start_ARG bold_S end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, K i=𝐅~~(l)⁢W i K subscript 𝐾 𝑖 superscript~~𝐅 𝑙 subscript superscript 𝑊 𝐾 𝑖 K{i}=\widetilde{\mathbf{F}}^{(l)}W^{K}{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and V i=𝐅~~(l)⁢W i V subscript 𝑉 𝑖 superscript~~𝐅 𝑙 subscript superscript 𝑊 𝑉 𝑖 V{i}=\widetilde{\mathbf{F}}^{(l)}W^{V}{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where W i Q,W i K,W i V subscript superscript 𝑊 𝑄 𝑖 subscript superscript 𝑊 𝐾 𝑖 subscript superscript 𝑊 𝑉 𝑖 W^{Q}{i},W^{K}{i},W^{V}{i}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are learned weight matrices. The attention scores are computed as:

A i=softmax⁢(𝐐 𝐢⁢𝐊 𝐢⊤d k)subscript 𝐴 𝑖 softmax subscript 𝐐 𝐢 superscript subscript 𝐊 𝐢 top subscript 𝑑 𝑘 A_{i}=\text{softmax}\left(\frac{\mathbf{Q_{i}}\mathbf{K_{i}}^{\top}}{\sqrt{d_{% k}}}\right)italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = softmax ( divide start_ARG bold_Q start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT bold_K start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG )(3)

where the attention output is O i=A i⁢V i subscript 𝑂 𝑖 subscript 𝐴 𝑖 subscript 𝑉 𝑖 O_{i}=A_{i}V_{i}italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT while the concatenated heads are O=C⁢o⁢n⁢c⁢a⁢t⁢(O 1,O 2,…,O h)𝑂 𝐶 𝑜 𝑛 𝑐 𝑎 𝑡 subscript 𝑂 1 subscript 𝑂 2…subscript 𝑂 ℎ O=Concat(O_{1},O_{2},\dots,O_{h})italic_O = italic_C italic_o italic_n italic_c italic_a italic_t ( italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_O start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ). Given a sequence of enhanced tokens O=(E 1,E 2,E 3,…,E T)𝑂 subscript 𝐸 1 subscript 𝐸 2 subscript 𝐸 3…subscript 𝐸 𝑇 O=(E_{1},E_{2},E_{3},\dots,E_{T})italic_O = ( italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), the state transition is computed as:

h t=R⁢e⁢L⁢U⁢(W t⁢r⁢a⁢n⁢s⁢i⁢t⁢i⁢o⁢n⁢h t−1+W u⁢p⁢d⁢a⁢t⁢e⁢E t)subscript ℎ 𝑡 𝑅 𝑒 𝐿 𝑈 subscript 𝑊 𝑡 𝑟 𝑎 𝑛 𝑠 𝑖 𝑡 𝑖 𝑜 𝑛 subscript ℎ 𝑡 1 subscript 𝑊 𝑢 𝑝 𝑑 𝑎 𝑡 𝑒 subscript 𝐸 𝑡 h_{t}=ReLU(W_{transition}h_{t-1}+W_{update}E_{t})italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_R italic_e italic_L italic_U ( italic_W start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s italic_i italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )(4)

where O=(E 1,E 2,E 3,…,E T)𝑂 subscript 𝐸 1 subscript 𝐸 2 subscript 𝐸 3…subscript 𝐸 𝑇 O=(E_{1},E_{2},E_{3},\dots,E_{T})italic_O = ( italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) denotes a sequence of enhanced tokens, where E t subscript 𝐸 𝑡 E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the token at the time step t 𝑡 t italic_t, and T 𝑇 T italic_T is the total number of tokens in the sequence. h t subscript ℎ 𝑡 h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the hidden state at the time step t 𝑡 t italic_t. It captures the context and dependencies from previous tokens and is updated based on the enhanced token E t subscript 𝐸 𝑡 E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. W t⁢r⁢a⁢n⁢s⁢i⁢t⁢i⁢o⁢n subscript 𝑊 𝑡 𝑟 𝑎 𝑛 𝑠 𝑖 𝑡 𝑖 𝑜 𝑛 W_{transition}italic_W start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s italic_i italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT is a learned weight matrix used in the transition function. It transforms the previous hidden state h t−1 subscript ℎ 𝑡 1 h_{t-1}italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to combine with the current token E t subscript 𝐸 𝑡 E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the state update process. W u⁢p⁢d⁢a⁢t⁢e subscript 𝑊 𝑢 𝑝 𝑑 𝑎 𝑡 𝑒 W_{update}italic_W start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT is a learned weight matrix applied to the current enhanced token E t subscript 𝐸 𝑡 E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT during the state update process. R⁢e⁢L⁢U 𝑅 𝑒 𝐿 𝑈 ReLU italic_R italic_e italic_L italic_U denotes the Rectified Linear Unit activation function, which introduces non-linearity into the model by outputting the maximum of 0 and the input value. It is applied to the weighted sum of the previous hidden state and the current token. The final output is obtained by applying a linear classifier on h t subscript ℎ 𝑡 h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

y=σ⁢(h t⁢W c⁢l⁢a⁢s⁢s⁢i⁢f⁢i⁢e⁢r)𝑦 𝜎 subscript ℎ 𝑡 subscript 𝑊 𝑐 𝑙 𝑎 𝑠 𝑠 𝑖 𝑓 𝑖 𝑒 𝑟 y=\sigma(h_{t}W_{classifier})italic_y = italic_σ ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s italic_i italic_f italic_i italic_e italic_r end_POSTSUBSCRIPT )(5)

where W c⁢l⁢a⁢s⁢s⁢i⁢f⁢i⁢e⁢r subscript 𝑊 𝑐 𝑙 𝑎 𝑠 𝑠 𝑖 𝑓 𝑖 𝑒 𝑟 W_{classifier}italic_W start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s italic_i italic_f italic_i italic_e italic_r end_POSTSUBSCRIPT is a learned weight matrix used in the linear classifier. It projects the hidden state h t subscript ℎ 𝑡 h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to obtain the final output. σ 𝜎\sigma italic_σ denotes the sigmoid function, which is used to generate probabilities for classification. It outputs values between 0 and 1. y 𝑦 y italic_y represents the final output of the model, which is obtained by applying the linear classifier to the hidden state h t subscript ℎ 𝑡 h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and passing it through the sigmoid function.

Integrating spectral-spatial token generation, token enhancement, multi-head attention, and the state space module, MHSSMamba captures complex dependencies in HSI data. By concatenating final spatial and spectral representations and applying a linear classifier, the model leverages complementary information from both modalities for improved classification performance.

3 Experimental Results and Discussion

The proposed MHSSMamba model is evaluated using several publicly available Hyperspectral datasets including The WHU-Hi-LongKou[38, 39], The University of Pavia (UP), The Salinas (SA), and The University of Houston (UH) datasets. MHSSMamba’s weights were initialized randomly and optimized over 50 epochs using the Adam optimizer with a learning rate of 0.001 and softmax loss. The training was done in mini-batches of 256 samples per epoch. The Mamba block’s embedding dimensions were set to 64, with 4 heads for multi-head attention and state space dimensions of 128.

This setup allowed MHSSMamba to learn patterns by adjusting its parameters to minimize loss. Training, validation, and test samples were randomly selected, with various random split percentages tested. The initial patch size was 4×4 4 4 4\times 4 4 × 4, but different sizes were also evaluated, and an optimized set of samples and patch sizes was chosen for comparative results. All experiments were conducted on Google Colab, utilizing a Python 3 notebook with a GPU, 25 GB of RAM, and 358.27 GB of cold storage.

Table 1 presents MHSSMamba’s performance with different training and test percentages using a 4×4 4 4 4\times 4 4 × 4 patch size and various patch sizes. Larger patches capture minute details and local patterns but are prone to noise and overfitting, especially with smaller samples. Smaller patches encapsulate global features and contextual information, enhancing robustness against noise. As sample sizes increase, MHSSMamba’s robustness and generalization improve. Computational performance is influenced by internet speed and available RAM. Figures 2, 3, 4, and 5 show qualitative results, with quantitative results in Table 1.

Table 1: The performance of the proposed model is evaluated using various patch sizes and different test splits. For all test splits, a 4×4 4 4 4\times 4 4 × 4 patch size is used over 50 epochs with a batch size of 256. For all patch sizes, 10% of the data is allocated for training and validation samples.

Tr/Va/Te(%)AA OA κ 𝜅\kappa italic_κ Train Time (s)Patch AA OA κ 𝜅\kappa italic_κ Train Time (s) WHU-Hi-LongKou 5/5/90 98.41 99.26 99.02 43.78 2×2 2 2 2\times 2 2 × 2 96.36 98.72 98.32 50.72 10/10/80 98.28 99.35 99.14 78.08 4×4 4 4 4\times 4 4 × 4 98.74 99.53 99.39 77.46 15/15/70 99.31 99.70 99.60 115.60 6×6 6 6 6\times 6 6 × 6 99.12 99.68 99.58 136.40 20/20/60 99.31 99.69 99.60 168.82 8×8 8 8 8\times 8 8 × 8 99.23 99.63 99.52 246.62 25/25/50 98.84 99.65 99.55 192.55 10×10 10 10 10\times 10 10 × 10 98.89 99.56 99.42 357.10 Pavia University 5/5/90 75.15 85.70 80.59 22.33 2×2 2 2 2\times 2 2 × 2 91.47 93.68 90.88 22.09 10/10/80 94.33 95.53 94.04 42.74 4×4 4 4 4\times 4 4 × 4 93.68 95.96 94.64 27.88 15/15/70 95.75 96.60 95.48 38.94 6×6 6 6 6\times 6 6 × 6 94.46 96.40 95.23 85.37 20/20/60 96.36 97.51 96.70 84.58 8×8 8 8 8\times 8 8 × 8 89.44 93.08 90.80 70.25 25/25/50 96.41 97.62 96.85 204.22 10×10 10 10 10\times 10 10 × 10 88.39 93.11 90.81 143.55 Salinas 5/5/90 97.31 94.39 93.75 18.09 2×2 2 2 2\times 2 2 × 2 97.12 94.18 93.51 20.96 10/10/80 97.44 94.73 94.11 42.84 4×4 4 4 4\times 4 4 × 4 97.79 95.28 94.75 33.68 15/15/70 97.87 95.52 95.01 83.51 6×6 6 6 6\times 6 6 × 6 97.81 95.12 94.56 83.53 20/20/60 98.21 96.27 95.84 83.53 8×8 8 8 8\times 8 8 × 8 98.26 96.48 96.08 143.56 25/25/50 98.55 97.00 96.66 73.99 10×10 10 10 10\times 10 10 × 10 97.94 94.90 94.33 144.19 University of Houston 5/5/90 90.87 91.60 90.92 11.90 2×2 2 2 2\times 2 2 × 2 94.10 94.55 94.10 8.11 10/10/80 90.70 90.71 89.95 12.54 4×4 4 4 4\times 4 4 × 4 92.24 93.28 92.73 11.95 15/15/70 95.42 95.40 95.03 15.70 6×6 6 6 6\times 6 6 × 6 93.43 93.42 92.88 19.56 20/20/60 96.61 96.82 96.57 23.08 8×8 8 8 8\times 8 8 × 8 94.04 93.85 93.35 27.48 25/25/50 96.73 96.92 96.67 42.58 10×10 10 10 10\times 10 10 × 10 92.37 91.09 90.37 42.60

(a)2×2 2 2 2\times 2 2 × 2

(b)4×4 4 4 4\times 4 4 × 4

(d)8×8 8 8 8\times 8 8 × 8

(e)10×10 10 10 10\times 10 10 × 10

Figure 2: Qualitative results of the University of Houston Dataset.

(a)2×2 2 2 2\times 2 2 × 2

(b)4×4 4 4 4\times 4 4 × 4

(d)8×8 8 8 8\times 8 8 × 8

(e)10×10 10 10 10\times 10 10 × 10

Figure 3: Qualitative results of the Pavia University Dataset.

(a)2×2 2 2 2\times 2 2 × 2

(b)4×4 4 4 4\times 4 4 × 4

(d)8×8 8 8 8\times 8 8 × 8

(e)10×10 10 10 10\times 10 10 × 10

Figure 4: Qualitative results of the Salinas Dataset.

(a)2×2 2 2 2\times 2 2 × 2

(b)4×4 4 4 4\times 4 4 × 4

(d)8×8 8 8 8\times 8 8 × 8

(e)10×10 10 10 10\times 10 10 × 10

Figure 5: Qualitative results of the WHU-Hi-LongKou Dataset.

3.1 Comparative Methods

To demonstrate the effectiveness of MHSSMamba, various HSIC methods were selected for comparison: S3L: Spectrum Transformer for Self-Supervised Learning in HSIC [40], RIAN: Rotation-Invariant Attention Network for HSIC [41], CAT: Center Attention Transformer With Stratified Spatial–Spectral Token for HSIC [42], SPRLT: Local Transformer With Spatial Partition Restore for HSIC [43], and CMT: A Center-Masked Transformer for HSIC [44]. DBDA [45], an advanced CNN model with a double-branch dual-attention mechanism, serves as a benchmark against Transformer-based approaches. MSSG [46] employs a super-pixel structured graph U-Net to learn multiscale features. SSFTT [47] is a spatial-spectral Transformer with a novel tokenization approach and CNN-generated local features. LSFAT [48] aggregates local semantic features to learn multiscale features effectively. CT-Mixer [49] combines CNN and Transformer frameworks. SS-Mamba [30] includes a spectral-spatial token generation module with stacked spectral-spatial Mamba blocks.

Evaluation. To assess the classification performance of the MHSSMamba model, we used three common evaluation metrics: Overall Accuracy (OA), Average Accuracy (AA), and kappa coefficient (κ 𝜅\kappa italic_κ). All methods were tested using optimal experimental settings or re-implemented with their official code where applicable.

Table 2: University of Houston Classification Results over several SOTA methods.

For the University of Houston dataset, CAT, MSSG, DBDA, SSFTT, LSFAT, and CT-Mixer demonstrated superior performance compared to SPRLT, CMT, S3L, and RIAN, with improvements of approximately 6-10% in OA, AA, and κ 𝜅\kappa italic_κ accuracy metrics. SS-Mamba surpassed these models by an additional 2% across OA, AA, and κ 𝜅\kappa italic_κ metrics. Our proposed model, MHSSMamba, further enhanced performance, exceeding SS-Mamba by 2.62% in OA, 1.77% in AA, and 2.83% in κ 𝜅\kappa italic_κ accuracy. These results underscore MHSSMamba’s effectiveness across various learning frameworks. Transformer-based models, requiring higher learning rates and more epochs, showed lower performance on Houston data, likely due to dataset characteristics.

Table 3: Pavia University Classification Results over several SOTA methods.

For the Pavia University dataset, traditional Spatial-Spectral models like SPRLT and CMT were less effective compared to RIAN, CAT, and S3L, which showed significant improvements of 12% in OA, AA, and κ 𝜅\kappa italic_κ metrics. RIAN, CAT, and S3L also outperformed DBDA, SSFTT, LSFAT, and CT-Mixer by 1-3% in OA, AA, and κ 𝜅\kappa italic_κ accuracy. MSSG surpassed RIAN, CAT, and S3L with gains of 0.50% in OA, 3.10% in AA, and 1.05% in κ 𝜅\kappa italic_κ accuracy. The SS-Mamba model outperformed MSSG by 1.0% in OA, AA, and κ 𝜅\kappa italic_κ accuracy. Lastly, MHSSMamba showed slight improvements over SS-Mamba in OA and κ 𝜅\kappa italic_κ accuracy, while SS-Mamba achieved slightly better results in AA.

4 Computational Complexity

The computational complexity of the proposed model can be analyzed by evaluating each of its primary components, for instance, the token generation layer, which generates the spatial-spectral tokens using the dense layer, has a complexity of O⁢(B×H×W×C×o⁢u⁢t⁢⁢C⁢h⁢a⁢n⁢n⁢e⁢l⁢s)𝑂 𝐵 𝐻 𝑊 𝐶 𝑜 𝑢 𝑡 _ 𝐶 ℎ 𝑎 𝑛 𝑛 𝑒 𝑙 𝑠 O(B\times H\times W\times C\times out_Channels)italic_O ( italic_B × italic_H × italic_W × italic_C × italic_o italic_u italic_t _ italic_C italic_h italic_a italic_n italic_n italic_e italic_l italic_s ). The Multi-head self-attention layer involves matrix multiplications and attention score calculations, resulting in a complexity of O⁢(B×L×e⁢m⁢b⁢e⁢d⁢_⁢D⁢i⁢m 2+B×n⁢u⁢m⁢_⁢h⁢e⁢a⁢d⁢s×L 2×h⁢e⁢a⁢d⁢_⁢d⁢i⁢m)𝑂 𝐵 𝐿 𝑒 𝑚 𝑏 𝑒 𝑑 _ 𝐷 𝑖 superscript 𝑚 2 𝐵 𝑛 𝑢 𝑚 _ ℎ 𝑒 𝑎 𝑑 𝑠 superscript 𝐿 2 ℎ 𝑒 𝑎 𝑑 _ 𝑑 𝑖 𝑚 O(B\times L\times embed_Dim^{2}+B\times num_heads\times L^{2}\times head_dim)italic_O ( italic_B × italic_L × italic_e italic_m italic_b italic_e italic_d _ italic_D italic_i italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B × italic_n italic_u italic_m _ italic_h italic_e italic_a italic_d italic_s × italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_h italic_e italic_a italic_d _ italic_d italic_i italic_m ). The feature enhancement layer responsible for enhancing features using dense layers and element-wise operations, contributed O⁢(B×L×o⁢u⁢t⁢_⁢C⁢h⁢a⁢n⁢n⁢e⁢l⁢s 2)𝑂 𝐵 𝐿 𝑜 𝑢 𝑡 _ 𝐶 ℎ 𝑎 𝑛 𝑛 𝑒 𝑙 superscript 𝑠 2 O(B\times L\times out_Channels^{2})italic_O ( italic_B × italic_L × italic_o italic_u italic_t _ italic_C italic_h italic_a italic_n italic_n italic_e italic_l italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Finally, the SSM layer which updates the state through dense layers over T 𝑇 T italic_T timesteps has a complexity of O⁢(T×B×S⁢t⁢a⁢t⁢e⁢_⁢d⁢i⁢m 2)𝑂 𝑇 𝐵 𝑆 𝑡 𝑎 𝑡 𝑒 _ 𝑑 𝑖 superscript 𝑚 2 O(T\times B\times State_dim^{2})italic_O ( italic_T × italic_B × italic_S italic_t italic_a italic_t italic_e _ italic_d italic_i italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Combining all layers, the overall complexity of the MHSSMamba is O⁢(B×H×W×C×o⁢u⁢t⁢_⁢C⁢h⁢a⁢n⁢n⁢e⁢l⁢s 2+B×L×e⁢m⁢b⁢e⁢d d⁢i⁢m 2+B×n⁢u⁢m⁢_⁢h⁢e⁢a⁢d⁢s×L 2×h⁢e⁢a⁢d⁢_⁢d⁢i⁢m+B×L×o⁢u⁢t⁢_⁢C⁢h⁢a⁢n⁢n⁢e⁢l⁢s 2+T×B×s⁢t⁢a⁢t⁢e⁢_⁢d⁢i⁢m 2)𝑂 𝐵 𝐻 𝑊 𝐶 𝑜 𝑢 𝑡 _ 𝐶 ℎ 𝑎 𝑛 𝑛 𝑒 𝑙 superscript 𝑠 2 𝐵 𝐿 𝑒 𝑚 𝑏 𝑒 subscript 𝑑 𝑑 𝑖 superscript 𝑚 2 𝐵 𝑛 𝑢 𝑚 _ ℎ 𝑒 𝑎 𝑑 𝑠 superscript 𝐿 2 ℎ 𝑒 𝑎 𝑑 _ 𝑑 𝑖 𝑚 𝐵 𝐿 𝑜 𝑢 𝑡 _ 𝐶 ℎ 𝑎 𝑛 𝑛 𝑒 𝑙 superscript 𝑠 2 𝑇 𝐵 𝑠 𝑡 𝑎 𝑡 𝑒 _ 𝑑 𝑖 superscript 𝑚 2 O(B\times H\times W\times C\times out_Channels^{2}+B\times L\times embed{d}% im^{2}+B\times num_heads\times L^{2}\times head_dim+B\times L\times out_% Channels^{2}+T\times B\times state_dim^{2})italic_O ( italic_B × italic_H × italic_W × italic_C × italic_o italic_u italic_t _ italic_C italic_h italic_a italic_n italic_n italic_e italic_l italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B × italic_L × italic_e italic_m italic_b italic_e italic_d start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_i italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B × italic_n italic_u italic_m _ italic_h italic_e italic_a italic_d italic_s × italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_h italic_e italic_a italic_d _ italic_d italic_i italic_m + italic_B × italic_L × italic_o italic_u italic_t _ italic_C italic_h italic_a italic_n italic_n italic_e italic_l italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_T × italic_B × italic_s italic_t italic_a italic_t italic_e _ italic_d italic_i italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), reflecting the cumulative computational demands of token generation, attention mechanisms, feature enhancement, and state updates. Here we also discuss the best, average, and worst-case scenarios of the proposed model.

Best Case: When the model operates with minimal features and token sizes, and there is no need for extensive processing, the complexity of the token generation is O⁢(B×H×W×C)𝑂 𝐵 𝐻 𝑊 𝐶 O(B\times H\times W\times C)italic_O ( italic_B × italic_H × italic_W × italic_C ), where B 𝐵 B italic_B is the batch size, H,W 𝐻 𝑊 H,~{}W italic_H , italic_W are the spatial dimensions, and C 𝐶 C italic_C is the number of bands. The multi-head self-attention under the minimal head and embedding sizes has a complexity of O⁢(B×L×e⁢m⁢b⁢e⁢d⁢_⁢d⁢i⁢m)𝑂 𝐵 𝐿 𝑒 𝑚 𝑏 𝑒 𝑑 _ 𝑑 𝑖 𝑚 O(B\times L\times embed_dim)italic_O ( italic_B × italic_L × italic_e italic_m italic_b italic_e italic_d _ italic_d italic_i italic_m ), and the SSm assuming minimal state dimension and timesteps length operates with (T×B×s⁢t⁢a⁢t⁢e⁢_⁢d⁢i⁢m)𝑇 𝐵 𝑠 𝑡 𝑎 𝑡 𝑒 _ 𝑑 𝑖 𝑚(T\times B\times state_dim)( italic_T × italic_B × italic_s italic_t italic_a italic_t italic_e _ italic_d italic_i italic_m ). Average Case: With the typical sizes for token generation, the complexity will be O⁢(B×H×W×C×o⁢u⁢t⁢_⁢C⁢h⁢a⁢n⁢n⁢e⁢l⁢s)𝑂 𝐵 𝐻 𝑊 𝐶 𝑜 𝑢 𝑡 _ 𝐶 ℎ 𝑎 𝑛 𝑛 𝑒 𝑙 𝑠 O(B\times H\times W\times C\times out_Channels)italic_O ( italic_B × italic_H × italic_W × italic_C × italic_o italic_u italic_t _ italic_C italic_h italic_a italic_n italic_n italic_e italic_l italic_s ). the multi-head self-attention considering the standard attention heads and embeddings sizes, results in O⁢(B×L×e⁢m⁢b⁢e⁢d⁢_⁢d⁢i⁢m 2+B×n⁢u⁢m⁢_⁢h⁢e⁢a⁢d⁢s×L 2×h⁢e⁢a⁢d⁢_⁢d⁢i⁢m)𝑂 𝐵 𝐿 𝑒 𝑚 𝑏 𝑒 𝑑 _ 𝑑 𝑖 superscript 𝑚 2 𝐵 𝑛 𝑢 𝑚 _ ℎ 𝑒 𝑎 𝑑 𝑠 superscript 𝐿 2 ℎ 𝑒 𝑎 𝑑 _ 𝑑 𝑖 𝑚 O(B\times L\times embed_dim^{2}+B\times num_heads\times L^{2}\times head_dim)italic_O ( italic_B × italic_L × italic_e italic_m italic_b italic_e italic_d _ italic_d italic_i italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B × italic_n italic_u italic_m _ italic_h italic_e italic_a italic_d italic_s × italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_h italic_e italic_a italic_d _ italic_d italic_i italic_m ). For the feature enhancement module, the complexity is O⁢(B×L×o⁢u⁢t⁢_⁢c⁢h⁢a⁢n⁢n⁢e⁢l⁢s 2)𝑂 𝐵 𝐿 𝑜 𝑢 𝑡 _ 𝑐 ℎ 𝑎 𝑛 𝑛 𝑒 𝑙 superscript 𝑠 2 O(B\times L\times out_channels^{2})italic_O ( italic_B × italic_L × italic_o italic_u italic_t _ italic_c italic_h italic_a italic_n italic_n italic_e italic_l italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), while the SSm operates at O⁢(T×B×s⁢t⁢a⁢t⁢e⁢_⁢d⁢i⁢m 2)𝑂 𝑇 𝐵 𝑠 𝑡 𝑎 𝑡 𝑒 _ 𝑑 𝑖 superscript 𝑚 2 O(T\times B\times state_dim^{2})italic_O ( italic_T × italic_B × italic_s italic_t italic_a italic_t italic_e _ italic_d italic_i italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Worst Case: In scenarios with maximal feature and token sizes, the complexity of the token generation can reach O⁢(B×H×W×C×o⁢u⁢t⁢_⁢C⁢h⁢a⁢n⁢n⁢e⁢l⁢s)𝑂 𝐵 𝐻 𝑊 𝐶 𝑜 𝑢 𝑡 _ 𝐶 ℎ 𝑎 𝑛 𝑛 𝑒 𝑙 𝑠 O(B\times H\times W\times C\times out_Channels)italic_O ( italic_B × italic_H × italic_W × italic_C × italic_o italic_u italic_t _ italic_C italic_h italic_a italic_n italic_n italic_e italic_l italic_s ). The multi-head self-attention complexity in the worst case is O⁢(B×L 2×e⁢m⁢d⁢e⁢d⁢_⁢d⁢i⁢m 2)𝑂 𝐵 superscript 𝐿 2 𝑒 𝑚 𝑑 𝑒 𝑑 _ 𝑑 𝑖 superscript 𝑚 2 O(B\times L^{2}\times emded_dim^{2})italic_O ( italic_B × italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_e italic_m italic_d italic_e italic_d _ italic_d italic_i italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) due to the large attention scores and weights. The feature enhancement layers scale to O⁢(B×L×o⁢u⁢t⁢_⁢C⁢h⁢a⁢n⁢n⁢e⁢l⁢s 2)𝑂 𝐵 𝐿 𝑜 𝑢 𝑡 _ 𝐶 ℎ 𝑎 𝑛 𝑛 𝑒 𝑙 superscript 𝑠 2 O(B\times L\times out_Channels^{2})italic_O ( italic_B × italic_L × italic_o italic_u italic_t _ italic_C italic_h italic_a italic_n italic_n italic_e italic_l italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and the SSM could require O⁢(T×B×s⁢t⁢a⁢t⁢e⁢_⁢d⁢i⁢m 2)𝑂 𝑇 𝐵 𝑠 𝑡 𝑎 𝑡 𝑒 _ 𝑑 𝑖 superscript 𝑚 2 O(T\times B\times state_dim^{2})italic_O ( italic_T × italic_B × italic_s italic_t italic_a italic_t italic_e _ italic_d italic_i italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) in the most extensive cases.

5 Conclusions

This paper introduces the Multihead Attention-Based Mamba (MHSSMamba) architecture for HSIC. Tested on benchmark datasets, MHSSMamba was evaluated using Overall Accuracy (OA), Average Accuracy (AA), and the kappa coefficient (κ 𝜅\kappa italic_κ). It outperformed state-of-the-art methods, achieving 97.82% OA, 96.41% AA, and 96.85% κ 𝜅\kappa italic_κ on the Pavia University dataset, and 96.92% OA, 96.41% AA, and 97.62% κ 𝜅\kappa italic_κ on the University of Houston dataset. These results highlight the model’s robustness and effectiveness in capturing spectral-spatial features. The superior performance is due to its advanced multi-head attention mechanisms, which enhance spectral-spatial information extraction.

Disclosure statement

The authors declare no conflict of interest exists.

Data Availability

The data used in this study is publicly available and can be accessed from the corresponding data repository Hyperspectral Datasets.

References

[1] M.Ahmad, S.Shabbir, S.K. Roy, D.Hong, X.Wu, J.Yao, A.M. Khan, M.Mazzara, S.Distefano, and J.Chanussot, “Hyperspectral image classification—traditional to deep models: A survey for future prospects,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021.
[2] D.Hong, C.Li, B.Zhang, N.Yokoya, J.A. Benediktsson, and J.Chanussot, “Multimodal artificial intelligence foundation models: Unleashing the power of remote sensing big data in earth observation,” The Innovation Geoscience, vol.2, no.1, p. 100055, 2024.
[3] Y.Li, D.Hong, C.Li, J.Yao, and J.Chanussote, “Hd-net: High-resolution decoupled network for building footprint extraction via deeply supervised body and boundary decomposition,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 209, pp. 51–65, 2024.
[4] B.Lu, P.D. Dao, J.Liu, Y.He, and J.Shang, “Recent advances of hyperspectral imaging technology and applications in agriculture,” Remote Sensing, vol.12, no.16, p. 2659, 2020.
[5] E.Bedini, “The use of hyperspectral remote sensing for mineral exploration: A review,” Journal of Hyperspectral Remote Sensing, vol.7, no.4, pp. 189–211, 2017.
[6] M.B. Stuart, A.J. McGonigle, and J.R. Willmott, “Hyperspectral imaging in environmental monitoring: A review of recent developments and technological advances in compact field deployable systems,” Sensors, vol.19, no.14, p. 3071, 2019.
[7] C.B. Pande and K.N. Moharir, “Application of hyperspectral remote sensing role in precision farming and sustainable agriculture under climate change: A review,” Climate Change Impacts on Natural Resources, Ecosystems and Agricultural Systems, pp. 503–520, 2023.
[8] M.H. Khan, Z.Saleem, M.Ahmad, A.Sohaib, H.Ayaz, M.Mazzara, and R.A. Raza, “Hyperspectral imaging-based unsupervised adulterated red chili content transformation for classification: Identification of red chili adulterants,” Neural Computing and Applications, vol.33, no.21, pp. 14 507–14 521, 2021.
[9] M.H. Khan, Z.Saleem, M.Ahmad, A.Sohaib, H.Ayaz, and M.Mazzara, “Hyperspectral imaging for color adulteration detection in red chili,” Applied Sciences, vol.10, no.17, p. 5955, 2020.
[10] Z.Saleem, M.H. Khan, M.Ahmad, A.Sohaib, H.Ayaz, and M.Mazzara, “Prediction of microbial spoilage and shelf-life of bakery products through hyperspectral imaging,” IEEE Access, vol.8, pp. 176 986–176 996, 2020.
[11] M.H.F. Butt, H.Ayaz, M.Ahmad, J.P. Li, and R.Kuleev, “A fast and compact hybrid cnn for hyperspectral imaging-based bloodstain classification,” in 2022 IEEE Congress on Evolutionary Computation (CEC).IEEE, 2022, pp. 1–8.
[12] M.Zulfiqar, M.Ahmad, A.Sohaib, M.Mazzara, and S.Distefano, “Hyperspectral imaging for bloodstain identification,” Sensors, vol.21, no.9, p. 3045, 2021.
[13] H.Ayaz, M.Ahmad, M.Mazzara, and A.Sohaib, “Hyperspectral imaging for minced meat classification using nonlinear deep features,” Applied Sciences, vol.10, no.21, p. 7783, 2020.
[14] H.Ayaz, M.Ahmad, A.Sohaib, M.N. Yasir, M.A. Zaidan, M.Ali, M.H. Khan, and Z.Saleem, “Myoglobin-based classification of minced meat using hyperspectral imaging,” Applied Sciences, vol.10, no.19, p. 6862, 2020.
[15] D.Hong, B.Zhang, X.Li, Y.Li, C.Li, J.Yao, N.Yokoya, H.Li, P.Ghamisi, X.Jia, A.Plaza, P.Gamba, J.A. Benediktsson, and J.Chanussot, “Spectralgpt: Spectral remote sensing foundation model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, dOI:10.1109/TPAMI.2024.3362475.
[16] M.Ahmad, A.M. Khan, M.Mazzara, S.Distefano, M.Ali, and M.S. Sarfraz, “A fast and compact 3-d cnn for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, 2020.
[17] U.Ghous, M.S. Sarfraz, M.Ahmad, C.Li, and D.Hong, “(2+1)d extreme xception net for hyperspectral image classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, pp. 1–14, 2024.
[18] D.Hong, J.Yao, C.Li, D.Meng, N.Yokoya, and J.Chanussot, “Decoupled-and-coupled networks: Self-supervised hyperspectral image super-resolution with subpixel fusion,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
[19] A.Jamali, S.K. Roy, D.Hong, P.M. Atkinson, and P.Ghamisi, “Attention graph convolutional network for disjoint hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, pp. 1–1, 2024.
[20] M.Ahmad and M.Mazzara, “Scsnet: Sharpened cosine similarity-based neural network for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, vol.21, pp. 1–4, 2024.
[21] J.Yao, B.Zhang, C.Li, D.Hong, and J.Chanussot, “Extended vision transformer (exvit) for land use and land cover classification: A multimodal deep learning framework,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
[22] M.Ahmad, U.Ghous, M.Usama, and M.Mazzara, “Waveformer: Spectral–spatial wavelet transformer for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, vol.21, pp. 1–5, 2024.
[23] X.Huang, M.Dong, J.Li, and X.Guo, “A 3-d-swin transformer-based hierarchical contrastive learning method for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol.60, pp. 1–15, 2022.
[24] Y.Wang, Y.Gu, and A.Nanding, “Sstu: Swin-spectral transformer u-net for hyperspectral whole slide image reconstruction,” Computerized Medical Imaging and Graphics, vol. 114, p. 102367, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0895611124000442
[25] Y.Long, X.Wang, M.Xu, S.Zhang, S.Jiang, and S.Jia, “Dual self-attention swin transformer for hyperspectral image super-resolution,” IEEE Transactions on Geoscience and Remote Sensing, vol.61, pp. 1–12, 2023.
[26] L.Wang, Z.Zheng, N.Kumar, C.Wang, F.Guo, and P.Zhang, “Multilevel class token transformer with cross tokenmixer for hyperspectral images classification,” IEEE Transactions on Geoscience and Remote Sensing, vol.62, pp. 1–13, 2024.
[27] J.Li, Z.Zhang, R.Song, Y.Li, and Q.Du, “Scformer: Spectral coordinate transformer for cross-domain few-shot hyperspectral image classification,” IEEE Transactions on Image Processing, vol.33, pp. 840–855, 2024.
[28] Z.Shu, Y.Wang, and Z.Yu, “Dual attention transformer network for hyperspectral image classification,” Engineering Applications of Artificial Intelligence, vol. 127, p. 107351, 2024.
[29] J.Ma, Y.Zou, X.Tang, X.Zhang, F.Liu, and L.Jiao, “Spatial pooling transformer network and noise-tolerant learning for noisy hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
[30] L.Huang, Y.Chen, and X.He, “Spectral-spatial mamba for hyperspectral image classification,” Remote Sensing, vol.16, no.13, 2024.
[31] H.Zhang, Y.Zhu, D.Wang, L.Zhang, T.Chen, Z.Wang, and Z.Ye, “A survey on visual mamba,” Applied Sciences, vol.14, no.13, 2024.
[32] J.Yao, D.Hong, C.Li, and J.Chanussot, “Spectralmamba: Efficient mamba for hyperspectral image classification,” 2024. [Online]. Available: https://arxiv.org/abs/2404.08489
[33] J.X. Yang, J.Zhou, J.Wang, H.Tian, and A.W.C. Liew, “Hsimamba: Hyperpsectral imaging efficient feature learning with bidirectional state space for classification,” 2024. [Online]. Available: https://arxiv.org/abs/2404.00272
[34] X.Liu, C.Zhang, and L.Zhang, “Vision mamba: A comprehensive survey and taxonomy,” 2024. [Online]. Available: https://arxiv.org/abs/2405.04404
[35] G.Wang, X.Zhang, Z.Peng, T.Zhang, X.Jia, and L.Jiao, “S 2 mamba: A spatial-spectral state space model for hyperspectral image classification,” 2024. [Online]. Available: https://arxiv.org/abs/2404.18213
[36] W.Zhou, S.-I. Kamata, H.Wang, M.-S. Wong, Huiying, and Hou, “Mamba-in-mamba: Centralized mamba-cross-scan in tokenized mamba model for hyperspectral image classification,” 2024. [Online]. Available: https://arxiv.org/abs/2405.12003
[37] Y.He, B.Tu, B.Liu, J.Li, and A.Plaza, “3dss-mamba: 3d-spectral-spatial mamba for hyperspectral image classification,” 2024. [Online]. Available: https://arxiv.org/abs/2405.12487
[38] Y.Zhong, X.Wang, Y.Xu, S.Wang, T.Jia, X.Hu, J.Zhao, L.Wei, and L.Zhang, “Mini-uav-borne hyperspectral remote sensing: From observation and processing to applications,” IEEE Geoscience and Remote Sensing Magazine, vol.6, no.4, pp. 46–62, 2018.
[39] Y.Zhong, X.Hu, C.Luo, X.Wang, J.Zhao, and L.Zhang, “Whu-hi: Uav-borne hyperspectral with high spatial resolution (h2) benchmark datasets and classifier for precise crop identification based on deep convolutional neural network with crf,” Remote Sensing of Environment, vol. 250, p. 112012, 2020.
[40] H.Guo and W.Liu, “S3l: Spectrum transformer for self-supervised learning in hyperspectral image classification,” Remote Sensing, vol.16, no.6, 2024.
[41] X.Zheng, H.Sun, X.Lu, and W.Xie, “Rotation-invariant attention network for hyperspectral image classification,” IEEE Transactions on Image Processing, vol.31, pp. 4251–4265, 2022.
[42] J.Feng, Q.Wang, G.Zhang, X.Jia, and J.Yin, “Cat: Center attention transformer with stratified spatial–spectral token for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol.62, pp. 1–15, 2024.
[43] Z.Xue, Q.Xu, and M.Zhang, “Local transformer with spatial partition restore for hyperspectral image classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol.15, pp. 4307–4325, 2022.
[44] S.Jia, Y.Wang, S.Jiang, and R.He, “A center-masked transformer for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol.62, pp. 1–16, 2024.
[45] R.Li, S.Zheng, C.Duan, Y.Yang, and X.Wang, “Classification of hyperspectral image based on double-branch dual-attention mechanism network,” Remote Sensing, vol.12, no.3, 2020. [Online]. Available: https://www.mdpi.com/2072-4292/12/3/582
[46] Q.Liu, L.Xiao, J.Yang, and Z.Wei, “Multilevel superpixel structured graph u-nets for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol.60, pp. 1–15, 2022.
[47] L.Sun, G.Zhao, Y.Zheng, and Z.Wu, “Spectral–spatial feature tokenization transformer for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol.60, pp. 1–14, 2022.
[48] B.Tu, X.Liao, Q.Li, Y.Peng, and A.Plaza, “Local semantic feature aggregation-based transformer for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol.60, pp. 1–15, 2022.
[49] J.Zhang, Z.Meng, F.Zhao, H.Liu, and Z.Chang, “Convolution transformer mixer for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, vol.19, pp. 1–5, 2022.

Xet Storage Details

Size:: 59.8 kB
Xet hash:: 20df756dd91c29aa84d6f2405cbd9ebd545ce765ef8d62f3f816387687cbab80

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.