File size: 24,940 Bytes

c0a0257

# Explainable IDS

**Making IDS decisions interpretable and assessing explanation reliability**

**Deep Learning Project 5**  
**Major:** ICCN INE2  
**Academic Year:** 2025-2026

**Prepared by:**  
Mohamed Anaddam  
EL FARME AYMAN

**Supervised by:**  
Pr. Tarik Fissaa

---

## Abstract

Intrusion Detection Systems (IDS) are used to monitor network traffic and identify malicious behavior. Deep learning can improve detection capability, but most neural IDS models act as black boxes: they output a prediction without explaining which traffic characteristics caused the alert. In a security operations context, this lack of transparency is a serious limitation because analysts must decide whether an alert is credible, whether it requires escalation, and whether the model is relying on meaningful attack indicators or on spurious correlations.

This project implements an Explainable Intrusion Detection System (X-IDS) on the NSL-KDD dataset. The system trains and compares three deep learning architectures: a Multi-Layer Perceptron (MLP), a Long Short-Term Memory network (LSTM), and a one-dimensional Convolutional Neural Network (1D-CNN). The best performing model was the LSTM, reaching a weighted F1-score of 0.7800, ROC-AUC of 0.9434, and PR-AUC of 0.9222 on the test set.

After training, the project applies two post-hoc explainability methods: SHAP and LIME. SHAP identifies `logged_in`, `dst_host_rerror_rate`, `protocol_type`, `rerror_rate`, and `dst_host_serror_rate` as major contributors to anomaly predictions. LIME highlights a partially overlapping but substantially different set of features. The Spearman rank correlation between SHAP and LIME feature rankings is only 0.0714, showing that explanation methods can disagree strongly.

The project then evaluates explanation reliability. SHAP explanations are stable under very small perturbations, with PCC 0.6293 at epsilon 0.01, but become unstable at larger perturbations. LIME stochastic stability is borderline stable with average Spearman 0.6054. Faithfulness is evaluated by masking the top SHAP features; masking the top 10 features causes an average confidence drop of 0.4938, indicating that SHAP identifies features that materially affect the model output.

Finally, the report analyzes security implications. Explanations are useful for analysts, but they can also leak information to attackers. If adversaries learn which features drive detection, they may manipulate controllable traffic features to evade the model. Therefore, explainability in IDS must be paired with access control, query monitoring, and defense-in-depth.

---

## 1. Introduction

Modern networks generate large volumes of traffic, making manual security monitoring impossible. Intrusion Detection Systems help automate this task by identifying suspicious activity, policy violations, and possible attacks. Traditional IDS solutions often rely on signatures or hand-crafted rules. While such systems are interpretable, they struggle to generalize to unseen attack patterns. Machine learning and deep learning approaches can learn complex patterns from data, but they often sacrifice interpretability.

This trade-off is especially important in cybersecurity. A model that says “attack” without explanation is difficult to trust. Security analysts need to know whether the model is reacting to meaningful indicators such as failed logins, abnormal error rates, suspicious protocol behavior, or repeated connections to the same host. Explanations can help analysts validate alerts, prioritize investigation, debug models, and communicate evidence.

However, explainability also creates a new security risk. If explanations are exposed to the wrong user, they can reveal which features the model relies on. An attacker may then craft malicious traffic that avoids those features. This creates a tension: explanations help defenders, but may also help adversaries.

The goal of this project is to bridge these two aspects: interpretability and adversarial security awareness. The project does not only train an IDS model; it also evaluates whether the model's explanations are stable, faithful, and safe to expose.

---

## 2. Project Requirements and Objectives

The project specification defined the objective as follows:

> Project 5 focuses on Explainable IDS: making IDS decisions interpretable, assessing explanation reliability, training an intrusion detection model, applying explainability techniques, evaluating stability, and analyzing security implications.

From this description, the project objectives are:

1. Train an IDS model on NSL-KDD.
2. Compare multiple deep learning architectures to avoid relying on a single model choice.
3. Apply explainability methods to interpret predictions.
4. Evaluate explanation quality, especially stability and faithfulness.
5. Analyze adversarial risks related to exposing model explanations.
6. Produce a reproducible technical pipeline with code, notebook, figures, and report.

The central research question is:

> Can we make IDS decisions interpretable without compromising detection performance, and are the explanations stable enough to be trusted in security-critical settings?

---

## 3. Background

### 3.1 Intrusion Detection Systems

An Intrusion Detection System monitors network or host activity and attempts to detect malicious behavior. IDS approaches can be broadly divided into:

- Signature-based IDS: detects known attack patterns using predefined rules.
- Anomaly-based IDS: learns patterns of normal behavior and flags deviations.
- Machine-learning-based IDS: learns discriminative patterns from labeled data.

The approach in this project is supervised machine-learning-based anomaly detection: each NSL-KDD connection is classified as either normal or anomalous.

### 3.2 Explainable Artificial Intelligence

Explainable AI (XAI) aims to make model decisions understandable to humans. In this project, explainability is used to answer questions such as:

- Which features caused the model to classify a connection as anomalous?
- Are the top features meaningful from a security perspective?
- Do two explanation methods agree?
- Are explanations stable under small perturbations?
- Could explanations reveal useful information to attackers?

### 3.3 SHAP

SHAP, or SHapley Additive exPlanations, is based on Shapley values from cooperative game theory. The intuition is to treat each feature as a “player” contributing to a prediction. SHAP estimates how much each feature moves the prediction away from a baseline expected output.

SHAP has several advantages:

- It provides local explanations for individual predictions.
- Local explanations can be aggregated into global feature importance.
- It has a strong theoretical foundation.
- KernelExplainer is model-agnostic, so it can explain any model through input-output queries.

Its limitations include computational cost and sensitivity to background data choice.

### 3.4 LIME

LIME, or Local Interpretable Model-Agnostic Explanations, explains a single prediction by sampling perturbed versions of the input, querying the black-box model, and fitting a simple interpretable surrogate model locally. The surrogate approximates the black-box model near the instance being explained.

LIME is intuitive and flexible, but it is stochastic and can be unstable. Different random seeds, perturbation samples, or neighborhood definitions can lead to different explanations.

### 3.5 Explanation Stability and Faithfulness

A good explanation should be stable and faithful.

Stability means that similar inputs should produce similar explanations. If a tiny perturbation changes the explanation drastically, an analyst cannot rely on it.

Faithfulness means that the explanation reflects the model's actual decision process. If an explanation claims that a feature is important, then changing or masking that feature should affect the model output.

This project evaluates stability using perturbations and correlation metrics, and evaluates faithfulness using feature masking.

---

## 4. Dataset: NSL-KDD

The dataset used in this project is NSL-KDD, an improved version of the KDD Cup 99 intrusion detection dataset. It contains network connection records with 41 features and class labels.

### 4.1 Dataset Size

| Split | Records |
|---|---:|
| Train | 151,165 |
| Test | 34,394 |

### 4.2 Class Distribution

| Split | Normal | Anomaly |
|---|---:|---:|
| Train | 80,792 | 70,373 |
| Test | 11,863 | 22,531 |

The train and test distributions differ. In the training set, normal records are slightly more frequent. In the test set, anomalies are more frequent. This distribution shift makes the problem more realistic and challenging because the model must generalize to a different test distribution.

### 4.3 Feature Types

The dataset has 41 features:

- 3 categorical features: `protocol_type`, `service`, `flag`
- 38 numerical features

The features can be grouped into:

1. Basic connection features: duration, protocol, service, flag, bytes.
2. Content features: login status, failed logins, root shell, compromised count.
3. Traffic features: connection counts and error rates.
4. Host-based statistics: destination-host counts and rates.

This feature structure is useful for interpretation because SHAP and LIME results can be mapped back to meaningful network concepts.

---

## 5. Preprocessing

The preprocessing pipeline converts raw NSL-KDD records into neural-network-ready tensors.

### 5.1 Label Encoding

The target is converted into binary labels:

- anomaly = 0
- normal = 1

The categorical input features are encoded using `LabelEncoder`:

| Feature | Number of Categories |
|---|---:|
| `protocol_type` | 3 |
| `service` | 70 |
| `flag` | 11 |

LabelEncoder was chosen because it preserves the original 41-feature structure, which makes explanation outputs easier to interpret. One-hot encoding would expand the feature space and make explanations less readable. The drawback is that LabelEncoder introduces artificial ordering among categories; this is addressed as a limitation.

### 5.2 Feature Scaling

All features are scaled to the range [0, 1] using MinMaxScaler.

This is necessary because NSL-KDD features have very different scales. For example, byte counts can be extremely large, while error rates are already between 0 and 1. Without scaling, large-valued features could dominate gradient updates and perturbation-based explanation analysis.

Scaling is also important for stability evaluation. Since all features are in [0,1], perturbation values such as epsilon = 0.01, 0.03, and 0.05 have a consistent meaning across features.

---

## 6. Model Architectures

Three deep learning models were implemented and compared.

### 6.1 MLP Baseline

The MLP architecture is:

```text
Input(41) -> Linear(256) -> BatchNorm -> ReLU -> Dropout(0.3)
          -> Linear(128) -> BatchNorm -> ReLU -> Dropout(0.2)
          -> Linear(64)  -> ReLU
          -> Linear(2)
```

The MLP has 52,802 parameters. It is a strong baseline for tabular data. Batch normalization stabilizes training, and dropout reduces overfitting.

### 6.2 LSTM Model

The LSTM treats the 41 features as a sequence:

```text
Input(41) -> reshape to (41,1)
          -> 2-layer LSTM(hidden=64, dropout=0.2)
          -> final hidden state
          -> fully connected classifier
```

The LSTM has 52,578 parameters. Although NSL-KDD is not a time-series dataset, its features have semantic grouping. The LSTM may capture dependencies across those feature groups.

### 6.3 1D-CNN Model

The 1D-CNN architecture is:

```text
Input(41) -> reshape to (1,41)
          -> Conv1d(64, kernel=3)
          -> Conv1d(128, kernel=3)
          -> AdaptiveAvgPool1d(8)
          -> fully connected classifier
```

The 1D-CNN has 91,074 parameters. It attempts to learn local feature patterns, especially among neighboring traffic-rate features.

### 6.4 Training Configuration

| Parameter | Value |
|---|---:|
| Optimizer | Adam |
| Learning rate | 1e-3 |
| Weight decay | 1e-4 |
| Batch size | 256 |
| Epochs | 50 |
| Loss | CrossEntropyLoss |
| Random seed | 42 |
| Hardware | Tesla T4 GPU |

Class weights were used in CrossEntropyLoss to account for class imbalance.

---

## 7. Classification Results

The table below summarizes model performance.

| Model | Parameters | Weighted F1 | ROC-AUC | PR-AUC | Time |
|---|---:|---:|---:|---:|---:|
| MLP | 52,802 | 0.7639 | 0.9231 | 0.8699 | 145.1s |
| LSTM | 52,578 | 0.7800 | 0.9434 | 0.9222 | 162.9s |
| 1D-CNN | 91,074 | 0.7579 | 0.9410 | 0.9182 | 173.1s |

![Training curves](report_figures/fig_cell_12_0.png)

Figure 1. Training loss and test accuracy curves for the three models.

### 7.1 Interpretation

The LSTM achieved the strongest results across weighted F1, ROC-AUC, and PR-AUC. This suggests that modeling inter-feature dependencies as a sequence can be beneficial for NSL-KDD. The MLP was faster and simpler, but slightly weaker. The CNN had the most parameters but did not outperform the LSTM, indicating that local convolutional patterns may be less suitable than sequential feature dependencies for this feature ordering.

The difference between models is not enormous, but it is consistent enough to identify the LSTM as the best detector in this experiment.

---

## 8. SHAP Explanation Analysis

SHAP was applied using KernelExplainer. A background set of 100 training samples was used, and SHAP values were computed for 150 test samples.

### 8.1 Global Feature Importance

The top SHAP features for the anomaly class were:

| Rank | Feature | Mean absolute SHAP value |
|---:|---|---:|
| 1 | `logged_in` | 0.0950 |
| 2 | `dst_host_rerror_rate` | 0.0619 |
| 3 | `protocol_type` | 0.0573 |
| 4 | `rerror_rate` | 0.0479 |
| 5 | `dst_host_serror_rate` | 0.0427 |
| 6 | `count` | 0.0398 |
| 7 | `serror_rate` | 0.0380 |
| 8 | `dst_host_same_srv_rate` | 0.0319 |
| 9 | `same_srv_rate` | 0.0256 |
| 10 | `dst_host_srv_serror_rate` | 0.0252 |
| 11 | `srv_rerror_rate` | 0.0231 |
| 12 | `srv_serror_rate` | 0.0230 |
| 13 | `dst_host_same_src_port_rate` | 0.0209 |
| 14 | `dst_host_count` | 0.0156 |
| 15 | `dst_host_diff_srv_rate` | 0.0153 |

![SHAP summary plot](report_figures/fig_cell_16_0.png)

Figure 2. SHAP summary plot for anomaly-class explanations.

![SHAP bar plot](report_figures/fig_cell_17_0.png)

Figure 3. Global SHAP feature importance ranking.

### 8.2 Interpretation of SHAP Features

The strongest SHAP feature, `logged_in`, is meaningful because login status is highly relevant to differentiating normal user behavior from certain attacks. Error-rate features such as `rerror_rate`, `serror_rate`, and host-level error rates are also security-relevant because scans, failed connections, and denial-of-service behavior often produce abnormal connection error patterns.

The presence of destination-host statistics among top features suggests that the model uses aggregated traffic behavior, not only individual packet-level values. This is positive from a security perspective because many host-level statistics are harder for an attacker to directly control.

### 8.3 Local Explanation Example

A local SHAP explanation was generated for an individual test sample. The sample was predicted as normal with very high confidence, and the true label was also normal.

![SHAP force plot](report_figures/fig_cell_18_1.png)

Figure 4. Local SHAP force plot for one prediction.

Local explanations are useful for analysts because they show why a specific alert or non-alert occurred, rather than only showing global feature rankings.

---

## 9. LIME Explanation Analysis

LIME was applied to 30 test samples using LimeTabularExplainer. For each sample, the top 10 local features were extracted, and feature frequencies were counted.

### 9.1 LIME Feature Frequency

| Feature | Frequency in Top-10 Explanations |
|---|---:|
| `wrong_fragment` | 30/30 |
| `rerror_rate` | 30/30 |
| `protocol_type` | 30/30 |
| `dst_host_rerror_rate` | 30/30 |
| `num_failed_logins` | 21/30 |
| `num_shells` | 21/30 |
| `logged_in` | 18/30 |
| `root_shell` | 18/30 |
| `su_attempted` | 17/30 |
| `hot` | 17/30 |

![LIME vs SHAP ranking](report_figures/fig_cell_21_0.png)

Figure 5. SHAP and LIME feature ranking comparison.

### 9.2 SHAP vs LIME Agreement

The Spearman rank correlation between SHAP and LIME feature rankings was:

```text
Spearman rho = 0.0714
p-value = 0.8665
common features = 8
```

This is extremely low. It means that SHAP and LIME did not strongly agree on feature ordering.

### 9.3 Interpretation

This disagreement is an important finding. It shows that explanation results are method-dependent. SHAP and LIME are both model-agnostic, but they make different assumptions:

- SHAP estimates feature contributions using a game-theoretic framework.
- LIME fits a local surrogate model using perturbed samples.

Because IDS explanations may influence security decisions, relying on only one explanation method may be risky.

---

## 10. Explanation Stability Evaluation

Explanation stability was evaluated using perturbation tests inspired by the SAFARI-style robustness evaluation approach.

### 10.1 SHAP Perturbation Stability

Small uniform noise was added to test samples. Since the features were scaled to [0,1], epsilon values have a consistent interpretation.

| Epsilon | Mean SENS_MAX | Mean PCC | Status |
|---:|---:|---:|---|
| 0.01 | 0.3130 | 0.6293 | Stable |
| 0.03 | 0.3751 | 0.5861 | Unstable |
| 0.05 | 0.4527 | 0.5676 | Unstable |

![Stability summary](report_figures/fig_cell_26_0.png)

Figure 6. SHAP sensitivity, SHAP stability, and faithfulness summary.

### 10.2 Interpretation

At epsilon 0.01, SHAP explanations had PCC above the 0.6 stability threshold. At epsilon 0.03 and 0.05, PCC dropped below the threshold. This means that SHAP explanations are stable only for very small perturbations.

The increase in SENS_MAX from 0.3130 to 0.4527 shows that explanation sensitivity grows as perturbation size increases.

### 10.3 LIME Stochastic Stability

LIME was evaluated by running explanations multiple times with different random seeds and computing pairwise Spearman rank correlation.

| Sample | Mean Spearman |
|---:|---:|
| 1 | 0.5912 |
| 2 | 0.6087 |
| 3 | 0.5953 |
| 4 | 0.6018 |
| 5 | 0.6198 |
| 6 | 0.6155 |
| Average | 0.6054 |

LIME's average stability was just above the 0.6 threshold, so it is borderline stable.

---

## 11. Faithfulness Evaluation

Faithfulness was evaluated by masking the top SHAP features and measuring confidence drop.

| Masked Features | Confidence Drop |
|---|---:|
| Top-3 | 0.3355 +/- 0.4244 |
| Top-5 | 0.3592 +/- 0.4142 |
| Top-10 | 0.4938 +/- 0.4420 |

The confidence drop increases as more top-ranked features are masked. This supports the faithfulness of SHAP explanations: the features identified as important are actually used by the model.

The standard deviations are large, which means faithfulness varies between individual samples. This is expected in IDS because different attack types may depend on different feature subsets.

---

## 12. Security Implications

Explainability improves analyst understanding but creates an attack surface.

### 12.1 Explanation Leakage

If an attacker can query the IDS and observe SHAP or LIME explanations, they may infer which features are most important. Then they can attempt evasion by modifying controllable features.

For example:

1. The attacker sends probe traffic and observes explanations.
2. Explanations reveal that error-rate features are important.
3. The attacker adjusts traffic to reduce visible error rates.
4. The model may become more likely to classify the traffic as normal.

### 12.2 Feature Manipulability

Top SHAP features were categorized by manipulability.

| Rank | Feature | SHAP | Manipulability |
|---:|---|---:|---|
| 1 | `logged_in` | 0.0950 | Non-manipulable |
| 2 | `dst_host_rerror_rate` | 0.0619 | Non-manipulable |
| 3 | `protocol_type` | 0.0573 | Partial |
| 4 | `rerror_rate` | 0.0479 | Partial |
| 5 | `dst_host_serror_rate` | 0.0427 | Non-manipulable |
| 6 | `count` | 0.0398 | Partial |
| 7 | `serror_rate` | 0.0380 | Partial |
| 8 | `dst_host_same_srv_rate` | 0.0319 | Non-manipulable |
| 9 | `same_srv_rate` | 0.0256 | Non-manipulable |
| 10 | `dst_host_srv_serror_rate` | 0.0252 | Non-manipulable |

Many important features are non-manipulable host-side statistics or only partially manipulable. This is positive because it makes simple evasion harder. However, some important features are partially controllable, so the risk is not eliminated.

### 12.3 Threat Scenarios

**Scenario 1: Evasion via Explanation Leakage**

An attacker repeatedly queries the IDS explanation interface, identifies top detection features, and crafts traffic to reduce those signals.

**Scenario 2: LIME Instability Exploitation**

If LIME explanations vary across runs, analysts may receive inconsistent justifications for the same traffic. This can reduce trust and slow incident response.

**Scenario 3: Backdoor with Clean Explanations**

A poisoned model may classify triggered malicious traffic as normal while still producing plausible explanations. This shows that explanations cannot replace model integrity checks.

### 12.4 Mitigations

Recommended mitigations include:

- restrict explanation access to trusted analysts,
- rate-limit explanation API queries,
- log explanation requests,
- avoid exposing raw SHAP values externally,
- aggregate explanations for dashboards,
- combine ML IDS with rule-based IDS,
- validate model behavior under adversarial testing.

---

## 13. Limitations

This project has several limitations.

First, NSL-KDD is a benchmark dataset and does not fully represent modern network traffic. Real enterprise traffic is more complex, dynamic, and noisy.

Second, LabelEncoder preserves interpretability but imposes artificial ordering on categorical features. In future work, embeddings or one-hot encoding could be compared.

Third, Kernel SHAP is computationally expensive. Therefore, explanations were computed on sampled subsets rather than the full test set.

Fourth, the project focuses on binary classification. A full IDS should also distinguish between attack families such as DoS, Probe, R2L, and U2R.

Fifth, the security analysis is conceptual and feature-based. It does not include full adversarial attack generation or real network deployment.

---

## 14. Future Work

Future extensions include:

1. Evaluate on newer datasets such as CIC-IDS2017 or UNSW-NB15.
2. Extend from binary classification to multiclass attack classification.
3. Compare LabelEncoder, one-hot encoding, and learned embeddings.
4. Test adversarial evasion attacks guided by SHAP and LIME.
5. Evaluate explanation stability across more samples and attack types.
6. Build a small analyst dashboard with controlled explanation access.
7. Compare post-hoc explanations with inherently interpretable models.

---

## 15. Conclusion

This project implemented an Explainable Intrusion Detection System using deep learning on NSL-KDD. Three models were trained and compared, with the LSTM achieving the best overall detection performance. SHAP and LIME were applied to interpret model predictions. SHAP identified meaningful IDS features such as login status and host-level error rates, while LIME produced a different ranking of important features.

The low SHAP-LIME correlation shows that explanation methods can disagree, which is critical in security settings. Stability evaluation showed that SHAP explanations are reliable only under very small perturbations, and LIME is borderline stable. Faithfulness evaluation showed that masking top SHAP features significantly reduces model confidence, supporting the usefulness of SHAP explanations.

The security analysis shows that explainability is a double-edged sword. It helps defenders understand alerts, but it may also leak information to attackers. Therefore, explanations should be used as analyst-support tools, not as unrestricted public outputs.

The final conclusion is:

> Explainable IDS is possible and useful, but trustworthy explainable IDS requires not only accurate models, but also rigorous evaluation of explanation stability, faithfulness, and adversarial risk.

---

## References

1. Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009). *A Detailed Analysis of the KDD CUP 99 Data Set*. IEEE Symposium on Computational Intelligence for Security and Defense Applications.
2. Lundberg, S. M., & Lee, S.-I. (2017). *A Unified Approach to Interpreting Model Predictions*. NeurIPS.
3. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). *Why Should I Trust You? Explaining the Predictions of Any Classifier*. KDD.
4. Huang et al. (2022). *SAFARI: Versatile and Efficient Evaluations for Robustness of Interpretability*. ICCV.
5. UNB Canadian Institute for Cybersecurity. NSL-KDD Dataset. <https://www.unb.ca/cic/datasets/nsl.html>