deep-learning-project / presentation /TEACHER_PRESENTATION_GUIDE.md

Add comprehensive teacher presentation guide

d6f3bff verified 13 days ago

15.1 kB

Teacher Presentation Guide — Explainable Intrusion Detection System (X-IDS)

Repo: cathrica/deep-learning-project
Project: ICCN-INE2 Deep Learning — Project 5: Explainable IDS
Dataset: NSL-KDD | Models: MLP, LSTM, 1D-CNN | XAI: SHAP + LIME

1. The 30-Second Elevator Pitch

We built an Explainable Intrusion Detection System that detects malicious network connections using deep learning, then explains why each decision was made using SHAP and LIME. We also evaluated whether those explanations are stable, faithful, and safe to expose in a security environment.

Best model: LSTM with weighted F1 = 0.7800, ROC-AUC = 0.9434, PR-AUC = 0.9222. SHAP and LIME did not agree (Spearman = 0.0714), and explanations lost stability as input perturbations grew. Security analysis showed that exposing raw explanations can help attackers evade detection, so access must be controlled.

2. Why This Project Matters (Motivation)

Traditional IDS alerts are black-box — analysts get a flag but no evidence.
Deep learning improves detection but hides reasoning.
In cybersecurity, a false positive wastes analyst time; a false negative lets attacks through.
Explainability can help defenders prioritize alerts and verify model behavior.
Risk: if attackers see which features matter most, they can craft evasion attacks.
Our project asks: Can we explain IDS decisions without destroying trust or security?

3. Dataset — NSL-KDD

Property	Value
Source	UNB Canadian Institute for Cybersecurity
HF Hub	`Mireu-Lab/NSL-KDD`
Records	Train: 151,165 / Test: 34,394
Features	41 (3 categorical + 38 numerical)
Categorical	`protocol_type` (3), `service` (70), `flag` (11)
Task	Binary classification: Normal vs Anomaly
Train distribution	53% Normal / 47% Anomaly
Test distribution	34% Normal / 66% Anomaly

Important detail: The test set has a distribution shift — more anomalies than training. This makes generalization harder and is worth mentioning as a realistic challenge.

Preprocessing Choices

Step	Method	Why
Categorical encoding	LabelEncoder	Preserves 41-feature structure so SHAP/LIME outputs map cleanly to original features. OneHot would explode to 84 columns and hurt interpretability.
Scaling	MinMaxScaler [0,1]	Features have wildly different ranges (e.g., `src_bytes` up to 1.3B vs `serror_rate` 0–1). Scaling stabilizes training and makes ε-perturbations meaningful for stability testing.
Reproducibility	Seed 42, fixed splits	Every experiment is deterministic.

Teacher might ask: Why LabelEncoder instead of OneHot?
Answer: OneHot would create 84 binary features. SHAP and LIME would then explain binary columns instead of semantic features, making interpretation much harder for analysts. The trade-off is artificial ordering in categorical variables, which we acknowledge as a limitation.

4. Models & Architecture Choices

We compared three lightweight architectures with the same training config:

Parameter	Value
Optimizer	Adam
Learning rate	1e-3
Weight decay	1e-4
Batch size	256
Epochs	50
Loss	CrossEntropyLoss with inverse-frequency class weights

4.1 MLP (Baseline)

Input(41) → Linear(256) → BatchNorm → ReLU → Dropout(0.3)
          → Linear(128) → BatchNorm → ReLU → Dropout(0.2)
          → Linear(64)  → ReLU
          → Linear(2 classes)

Parameters: ~50K
Why: Standard tabular baseline. BatchNorm stabilizes gradients; dropout regularizes.

4.2 LSTM (Best Performer)

Input(41) → reshape to (41, 1) → 2-layer LSTM(hidden=64, dropout=0.2)
          → last hidden state → Linear(2 classes)

Parameters: ~35K
Why: Treats the 41 features as a sequence. NSL-KDD features are semantically grouped (basic → content → time-based → host-based). LSTM can learn dependencies between these groups. This inductive bias helped it generalize best despite having fewer parameters than the CNN.

4.3 1D-CNN

Input(41) → reshape to (1, 41) → Conv1d(64, k=3, pad=1) → ReLU
          → Conv1d(128, k=3, pad=1) → ReLU → AdaptiveAvgPool1d(8)
          → Flatten → Linear(64) → ReLU → Linear(2 classes)

Parameters: ~45K
Why: Learns local patterns between neighboring features. Good for rate-based feature blocks. However, it underperformed the LSTM, showing that more parameters ≠ better if the architecture bias mismatches the data structure.

Performance Results

Model	Weighted F1	ROC-AUC	PR-AUC	Training Time
LSTM	0.7800	0.9434	0.9222	162.9s
MLP	0.7639	0.9231	0.8699	145.1s
1D-CNN	0.7579	0.9410	0.9182	173.1s

Teacher might ask: Why did LSTM win despite fewer parameters?
Answer: The LSTM's sequential processing matches the semantic grouping of NSL-KDD features. The CNN assumes local spatial patterns, which is less natural for this tabular feature ordering. The MLP treats all features independently, missing group-level dependencies.

5. Explainability — SHAP & LIME

We used post-hoc explainability (explaining a trained model, not building an interpretable one) because deep learning models are more expressive.

SHAP (SHapley Additive exPlanations)

Method: KernelExplainer (model-agnostic)
What it does: Estimates how much each feature pushes the prediction away from the average prediction, based on game-theoretic Shapley values.
Top anomaly features: logged_in (0.0950), dst_host_rerror_rate (0.0619), protocol_type (0.0573), rerror_rate (0.0479), dst_host_serror_rate (0.0427)
Why these make sense: Login status and error rates are classic intrusion indicators.

LIME (Local Interpretable Model-Agnostic Explanations)

Method: LimeTabularExplainer
What it does: Perturbs the input, observes predictions, fits a simple linear model locally to approximate the black-box model near that point.
Top features (frequency in 30 explanations): wrong_fragment (30/30), rerror_rate (30/30), protocol_type (30/30), dst_host_rerror_rate (30/30)

Key Finding: SHAP vs LIME Disagreement

Metric	Value
Spearman rank correlation	0.0714
p-value	0.8665

Interpretation: The two methods rank features almost completely differently. This is critical: explanations are method-dependent. You cannot trust one method blindly.

Teacher might ask: Which method do you trust more?
Answer: SHAP has stronger theoretical foundations (game theory, consistency properties) and is deterministic. LIME is intuitive but stochastic and sensitive to perturbation settings. For security-critical decisions, I would prefer SHAP but still validate with stability and faithfulness tests.

6. Stability & Faithfulness

An explanation is only useful if it is reliable.

6.1 Stability — Perturbation Test

We added small ε-bounded noise to inputs and measured how much SHAP attributions changed using Pearson Correlation Coefficient (PCC).

Epsilon	PCC	Verdict
0.01	0.6293	✅ Stable (≥ 0.6 threshold)
0.03	0.5861	❌ Unstable
0.05	0.5676	❌ Unstable

Threshold 0.6 is inspired by the SAFARI framework (Huang et al. 2022).

LIME stochastic stability: Mean Spearman across 20 runs = 0.6054 — borderline stable.

6.2 Faithfulness — Feature Masking

If SHAP says a feature is important, removing it should hurt confidence.

Masked features	Confidence drop
Top 3	0.3355
Top 5	0.3592
Top 10	0.4938

Interpretation: The more top features we mask, the bigger the confidence drop. SHAP is identifying features the model actually uses.

Teacher might ask: What is the difference between stability and faithfulness?
Answer: Stability asks: "Do similar inputs get similar explanations?" Faithfulness asks: "Does the explanation actually reflect what the model cares about?" You need both for a trustworthy explanation.

7. Security Implications

7.1 The Dual-Edged Sword

Good: Explanations help analysts verify alerts and prioritize investigations.
Bad: If attackers see explanations, they learn which features to manipulate.

7.2 Feature Manipulability

Category	Manipulable?	Examples
Packet content	✅ Yes	`src_bytes`, `dst_bytes`, `hot`
Connection behavior	⚠️ Partially	`duration`, `count`, `srv_count`
Protocol fields	⚠️ Constrained	`protocol_type`, `flag`
Network statistics	❌ No	`dst_host_count`, `dst_host_same_srv_rate`
Error rates	⚠️ Partially	`serror_rate`, `rerror_rate`

Good news: Our top SHAP features include many non-manipulable host-based statistics, which makes evasion harder than if the model relied only on attacker-controlled payload fields.

7.3 Attack Scenarios

Evasion via explanation leakage: Attacker queries the explanation API, sees that serror_rate and count drive detection, then crafts traffic to spoof those features.
LIME inconsistency exploitation: LIME gives different rankings on rerun. Analysts waste time chasing inconsistent explanations.
Backdoor with clean explanations: A poisoned model misclassifies triggered inputs but shows plausible benign SHAP values.

7.4 Mitigations

Restrict explanation access to trusted analysts
Rate-limit explanation APIs
Log all explanation queries
Aggregate explanations instead of exposing raw per-sample values
Never replace rule-based IDS with ML explanations alone

8. Limitations (Say These Confidently)

Dataset age: NSL-KDD is a benchmark from 2009. Modern traffic (TLS 1.3, encrypted payloads, IoT protocols) looks very different.
LabelEncoder trade-off: Preserves interpretability but imposes artificial ordering on categories.
Computational cost: Kernel SHAP is expensive; we used sampled subsets.
LIME stochasticity: Results vary across random seeds.
Scope: We evaluated explanation quality, not adversarial robustness of the classifier itself. That is a separate (harder) problem.

Teacher might ask: What would you improve with more time?
Answer: Test on modern datasets (CIC-IDS2017, UNSW-NB15), use embeddings or target encoding for categorical features, evaluate multiclass attack-type detection, and run adversarial evasion experiments using the top SHAP features.

9. Likely Teacher Questions & Model Answers

Q: What is your main contribution?

A: We didn't just build an IDS. We built an IDS + explainability pipeline + stability evaluation + security risk analysis. The contribution is showing that explainability in security requires trust evaluation, not just visualization.

Q: Why use deep learning if you need explainability?

A: Deep learning gives better detection performance. Post-hoc explainability (SHAP/LIME) lets us keep that performance while adding interpretability. Inherently interpretable models (decision trees, linear models) don't match the performance on this task.

Q: Why is PR-AUC more important than accuracy?

A: The dataset is imbalanced (especially test set: 66% anomaly). Accuracy would hide poor performance on the minority class. PR-AUC focuses on precision and recall of the positive class, which is what matters when false negatives (missed attacks) are costly.

Q: What is the practical takeaway for a SOC analyst?

A: The model can flag anomalies and show which features drove the decision (e.g., error rates, login status). The analyst uses this as supporting evidence, not as sole proof. Explanations are shown internally only, with access control and logging.

Q: Why binary classification instead of 5-class (DoS, Probe, R2L, U2R)?

A: Binary normal/anomaly is the core IDS problem and keeps the explainability evaluation clean. Multiclass is a natural next step — U2R has only ~52 samples in training, which is extremely challenging.

Q: What does it mean that SHAP and LIME disagree?

A: It means there is no single "true" explanation for a black-box model. Different methods make different assumptions. This is why we evaluate stability and faithfulness — to filter out unreliable explanations regardless of the method.

Q: How do you prevent attackers from using explanations against you?

A: Access control, rate limiting, logging, and aggregation. We also analyzed that the model relies partly on non-manipulable sensor-side statistics, which makes evasion harder than if it relied only on attacker-controlled fields.

10. Key Numbers Cheat Sheet

Memorize these for instant credibility:

Fact	Number
Train records	151,165
Test records	34,394
Features	41
Best model	LSTM
Best weighted F1	0.7800
Best ROC-AUC	0.9434
Best PR-AUC	0.9222
SHAP-LIME Spearman	0.0714
SHAP PCC at ε=0.01	0.6293 (stable)
SHAP PCC at ε=0.05	0.5676 (unstable)
LIME stochastic stability	0.6054 (borderline)
Top-10 masking confidence drop	0.4938
Random seed	42

11. Glossary of Terms

Term	Definition
IDS	Intrusion Detection System — monitors network traffic for malicious activity
X-IDS	Explainable Intrusion Detection System
NSL-KDD	Standard benchmark dataset for intrusion detection
MLP	Multi-Layer Perceptron — fully connected neural network
LSTM	Long Short-Term Memory — recurrent network with memory gates
1D-CNN	One-dimensional convolutional network
SHAP	Feature attribution based on Shapley values from game theory
LIME	Local surrogate model for explaining individual predictions
ROC-AUC	Threshold-independent ranking quality metric
PR-AUC	Precision-recall area — informative for imbalanced data
Weighted F1	F1-score averaged by class support
PCC	Pearson correlation — measures explanation similarity under perturbation
Spearman	Rank correlation — compares feature ranking between methods
SENS_MAX	Maximum explanation shift under bounded perturbation
Faithfulness	Whether highlighted features actually affect model predictions
Evasion	Attacker modifying traffic to avoid detection
Explanation leakage	Attacker learning model behavior from exposed explanations

12. Final One-Liner to Close

"Explainability makes IDS useful, but only stability, faithfulness, and security analysis make it trustworthy."

Good luck on the presentation! 🎓