Add architecture & design choices document
Browse files- docs/architecture.md +109 -0
docs/architecture.md
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model Architecture & Design Choices
|
| 2 |
+
|
| 3 |
+
## 1. Design Philosophy
|
| 4 |
+
|
| 5 |
+
The project requires **lightweight models** (max 2h training per experiment). We prioritize:
|
| 6 |
+
- Simplicity over complexity β fewer layers, interpretable capacity
|
| 7 |
+
- Reproducibility β fixed seeds, deterministic operations
|
| 8 |
+
- Fair comparison β same preprocessing, same train/test split across all models
|
| 9 |
+
|
| 10 |
+
## 2. Preprocessing Pipeline
|
| 11 |
+
|
| 12 |
+
```
|
| 13 |
+
Raw NSL-KDD (41 features + label)
|
| 14 |
+
β
|
| 15 |
+
ββ Categorical encoding: LabelEncoder for protocol_type (3), service (70), flag (11)
|
| 16 |
+
ββ Feature scaling: MinMaxScaler [0, 1] on all 41 features
|
| 17 |
+
ββ Label encoding: Binary (anomaly=0, normal=1)
|
| 18 |
+
β
|
| 19 |
+
ββ Output: X_train (151165, 41), X_test (34394, 41), float32
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
**Why MinMaxScaler?** Network features have vastly different ranges (src_bytes: 0-1.3B vs. serror_rate: 0-1). Scaling to [0,1] prevents large-valued features from dominating gradient updates and makes perturbation-based explainability (Ξ΅-bounded noise) meaningful.
|
| 23 |
+
|
| 24 |
+
**Why LabelEncoder (not OneHot)?** OneHot would expand 3 categorical features to 84 columns. This makes SHAP/LIME explanations harder to interpret (84 binary features vs 41 semantic features). LabelEncoder preserves the original feature space for cleaner explanations.
|
| 25 |
+
|
| 26 |
+
## 3. Model Architectures
|
| 27 |
+
|
| 28 |
+
### 3.1 MLP (Primary Baseline)
|
| 29 |
+
|
| 30 |
+
```
|
| 31 |
+
Input (41) β Linear(256) β BatchNorm β ReLU β Dropout(0.3)
|
| 32 |
+
β Linear(128) β BatchNorm β ReLU β Dropout(0.2)
|
| 33 |
+
β Linear(64) β ReLU
|
| 34 |
+
β Linear(num_classes)
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
**Parameters**: ~50K
|
| 38 |
+
**Justification**:
|
| 39 |
+
- 3 hidden layers with decreasing width is standard for tabular classification
|
| 40 |
+
- BatchNorm stabilizes training, enables higher learning rates
|
| 41 |
+
- Dropout (0.3β0.2) regularizes; heavier in early layers where more parameters
|
| 42 |
+
- No final activation β CrossEntropyLoss includes LogSoftmax
|
| 43 |
+
|
| 44 |
+
### 3.2 LSTM (Temporal Variant)
|
| 45 |
+
|
| 46 |
+
```
|
| 47 |
+
Input (41) β reshape to (41, 1) β LSTM(hidden=64, layers=2, dropout=0.2)
|
| 48 |
+
β take last hidden state β Linear(num_classes)
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
**Parameters**: ~35K
|
| 52 |
+
**Justification**:
|
| 53 |
+
- Treats 41 features as a sequence β captures inter-feature dependencies
|
| 54 |
+
- 2 layers with 64 hidden units is minimal while allowing feature interaction
|
| 55 |
+
- LSTM processes features in order: basicβcontentβtime-basedβhost-based
|
| 56 |
+
- This ordering has semantic meaning in NSL-KDD (groups of related features)
|
| 57 |
+
|
| 58 |
+
### 3.3 1D-CNN (Spatial Variant)
|
| 59 |
+
|
| 60 |
+
```
|
| 61 |
+
Input (41) β reshape to (1, 41) β Conv1d(64, k=3, pad=1) β ReLU
|
| 62 |
+
β Conv1d(128, k=3, pad=1) β ReLU β AdaptiveAvgPool1d(8)
|
| 63 |
+
β Flatten β Linear(64) β ReLU β Linear(num_classes)
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
**Parameters**: ~45K
|
| 67 |
+
**Justification**:
|
| 68 |
+
- 1D convolutions learn local feature patterns (neighboring features)
|
| 69 |
+
- Kernel size 3 captures triplets of features
|
| 70 |
+
- AdaptiveAvgPool compresses to fixed size regardless of input length
|
| 71 |
+
- Useful for detecting patterns in rate-based features (contiguous block)
|
| 72 |
+
|
| 73 |
+
## 4. Training Configuration
|
| 74 |
+
|
| 75 |
+
| Parameter | Value | Justification |
|
| 76 |
+
|-----------|-------|---------------|
|
| 77 |
+
| Optimizer | Adam | Standard for neural networks; adaptive lr per parameter |
|
| 78 |
+
| Learning rate | 1e-3 | Default Adam lr; works well for tabular tasks |
|
| 79 |
+
| Weight decay | 1e-4 | Light L2 regularization prevents overfitting |
|
| 80 |
+
| Batch size | 256 | Good balance of speed and gradient stability |
|
| 81 |
+
| Epochs | 50 | Sufficient for convergence on NSL-KDD (~151K samples) |
|
| 82 |
+
| Loss | CrossEntropyLoss | Standard for multi-class; includes class weights for imbalance |
|
| 83 |
+
| Class weights | Inverse frequency | Addresses class imbalance between normal (53%) and anomaly (47%) |
|
| 84 |
+
| Seed | 42 | Fixed for reproducibility |
|
| 85 |
+
|
| 86 |
+
## 5. Why These Models for Explainability
|
| 87 |
+
|
| 88 |
+
| Model | SHAP Method | Speed | Explanation Quality |
|
| 89 |
+
|-------|-------------|-------|-------------------|
|
| 90 |
+
| MLP | KernelExplainer | Medium | Clean, model-agnostic attributions |
|
| 91 |
+
| LSTM | KernelExplainer | Medium | Sequential attributions may differ |
|
| 92 |
+
| 1D-CNN | KernelExplainer | Medium | Convolutional attributions capture local patterns |
|
| 93 |
+
|
| 94 |
+
All three use **KernelExplainer** (model-agnostic SHAP), enabling:
|
| 95 |
+
- Direct comparison of feature attributions across architectures
|
| 96 |
+
- Analysis of whether model architecture affects explanation stability
|
| 97 |
+
- Consistent methodology across all models
|
| 98 |
+
|
| 99 |
+
## 6. Expected Baseline Performance
|
| 100 |
+
|
| 101 |
+
Based on published NSL-KDD benchmarks (Tavallaee et al., Revathi & Malathi 2013):
|
| 102 |
+
|
| 103 |
+
| Model | Binary Accuracy | Binary Weighted F1 |
|
| 104 |
+
|-------|----------------|---------------------|
|
| 105 |
+
| MLP | 78-85% | 78-83% |
|
| 106 |
+
| LSTM | 76-82% | 75-80% |
|
| 107 |
+
| 1D-CNN | 77-83% | 76-81% |
|
| 108 |
+
|
| 109 |
+
**Known challenge**: Test set has more anomaly (65%) than train (47%) β distribution shift tests generalization.
|