File size: 4,720 Bytes
498baf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# Model Architecture & Design Choices

## 1. Design Philosophy

The project requires **lightweight models** (max 2h training per experiment). We prioritize:
- Simplicity over complexity β€” fewer layers, interpretable capacity
- Reproducibility β€” fixed seeds, deterministic operations
- Fair comparison β€” same preprocessing, same train/test split across all models

## 2. Preprocessing Pipeline

```
Raw NSL-KDD (41 features + label)
    β”‚
    β”œβ”€ Categorical encoding: LabelEncoder for protocol_type (3), service (70), flag (11)
    β”œβ”€ Feature scaling: MinMaxScaler [0, 1] on all 41 features
    β”œβ”€ Label encoding: Binary (anomaly=0, normal=1)
    β”‚
    └─ Output: X_train (151165, 41), X_test (34394, 41), float32
```

**Why MinMaxScaler?** Network features have vastly different ranges (src_bytes: 0-1.3B vs. serror_rate: 0-1). Scaling to [0,1] prevents large-valued features from dominating gradient updates and makes perturbation-based explainability (Ξ΅-bounded noise) meaningful.

**Why LabelEncoder (not OneHot)?** OneHot would expand 3 categorical features to 84 columns. This makes SHAP/LIME explanations harder to interpret (84 binary features vs 41 semantic features). LabelEncoder preserves the original feature space for cleaner explanations.

## 3. Model Architectures

### 3.1 MLP (Primary Baseline)

```
Input (41) β†’ Linear(256) β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
          β†’ Linear(128) β†’ BatchNorm β†’ ReLU β†’ Dropout(0.2)
          β†’ Linear(64) β†’ ReLU
          β†’ Linear(num_classes)
```

**Parameters**: ~50K
**Justification**:
- 3 hidden layers with decreasing width is standard for tabular classification
- BatchNorm stabilizes training, enables higher learning rates
- Dropout (0.3β†’0.2) regularizes; heavier in early layers where more parameters
- No final activation β€” CrossEntropyLoss includes LogSoftmax

### 3.2 LSTM (Temporal Variant)

```
Input (41) β†’ reshape to (41, 1) β†’ LSTM(hidden=64, layers=2, dropout=0.2)
          β†’ take last hidden state β†’ Linear(num_classes)
```

**Parameters**: ~35K
**Justification**:
- Treats 41 features as a sequence β€” captures inter-feature dependencies
- 2 layers with 64 hidden units is minimal while allowing feature interaction
- LSTM processes features in order: basic→content→time-based→host-based
- This ordering has semantic meaning in NSL-KDD (groups of related features)

### 3.3 1D-CNN (Spatial Variant)

```
Input (41) β†’ reshape to (1, 41) β†’ Conv1d(64, k=3, pad=1) β†’ ReLU
          β†’ Conv1d(128, k=3, pad=1) β†’ ReLU β†’ AdaptiveAvgPool1d(8)
          β†’ Flatten β†’ Linear(64) β†’ ReLU β†’ Linear(num_classes)
```

**Parameters**: ~45K
**Justification**:
- 1D convolutions learn local feature patterns (neighboring features)
- Kernel size 3 captures triplets of features
- AdaptiveAvgPool compresses to fixed size regardless of input length
- Useful for detecting patterns in rate-based features (contiguous block)

## 4. Training Configuration

| Parameter | Value | Justification |
|-----------|-------|---------------|
| Optimizer | Adam | Standard for neural networks; adaptive lr per parameter |
| Learning rate | 1e-3 | Default Adam lr; works well for tabular tasks |
| Weight decay | 1e-4 | Light L2 regularization prevents overfitting |
| Batch size | 256 | Good balance of speed and gradient stability |
| Epochs | 50 | Sufficient for convergence on NSL-KDD (~151K samples) |
| Loss | CrossEntropyLoss | Standard for multi-class; includes class weights for imbalance |
| Class weights | Inverse frequency | Addresses class imbalance between normal (53%) and anomaly (47%) |
| Seed | 42 | Fixed for reproducibility |

## 5. Why These Models for Explainability

| Model | SHAP Method | Speed | Explanation Quality |
|-------|-------------|-------|-------------------|
| MLP | KernelExplainer | Medium | Clean, model-agnostic attributions |
| LSTM | KernelExplainer | Medium | Sequential attributions may differ |
| 1D-CNN | KernelExplainer | Medium | Convolutional attributions capture local patterns |

All three use **KernelExplainer** (model-agnostic SHAP), enabling:
- Direct comparison of feature attributions across architectures
- Analysis of whether model architecture affects explanation stability
- Consistent methodology across all models

## 6. Expected Baseline Performance

Based on published NSL-KDD benchmarks (Tavallaee et al., Revathi & Malathi 2013):

| Model | Binary Accuracy | Binary Weighted F1 |
|-------|----------------|---------------------|
| MLP | 78-85% | 78-83% |
| LSTM | 76-82% | 75-80% |
| 1D-CNN | 77-83% | 76-81% |

**Known challenge**: Test set has more anomaly (65%) than train (47%) β€” distribution shift tests generalization.