cathrica commited on
Commit
498baf6
Β·
verified Β·
1 Parent(s): 1136020

Add architecture & design choices document

Browse files
Files changed (1) hide show
  1. docs/architecture.md +109 -0
docs/architecture.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Architecture & Design Choices
2
+
3
+ ## 1. Design Philosophy
4
+
5
+ The project requires **lightweight models** (max 2h training per experiment). We prioritize:
6
+ - Simplicity over complexity β€” fewer layers, interpretable capacity
7
+ - Reproducibility β€” fixed seeds, deterministic operations
8
+ - Fair comparison β€” same preprocessing, same train/test split across all models
9
+
10
+ ## 2. Preprocessing Pipeline
11
+
12
+ ```
13
+ Raw NSL-KDD (41 features + label)
14
+ β”‚
15
+ β”œβ”€ Categorical encoding: LabelEncoder for protocol_type (3), service (70), flag (11)
16
+ β”œβ”€ Feature scaling: MinMaxScaler [0, 1] on all 41 features
17
+ β”œβ”€ Label encoding: Binary (anomaly=0, normal=1)
18
+ β”‚
19
+ └─ Output: X_train (151165, 41), X_test (34394, 41), float32
20
+ ```
21
+
22
+ **Why MinMaxScaler?** Network features have vastly different ranges (src_bytes: 0-1.3B vs. serror_rate: 0-1). Scaling to [0,1] prevents large-valued features from dominating gradient updates and makes perturbation-based explainability (Ξ΅-bounded noise) meaningful.
23
+
24
+ **Why LabelEncoder (not OneHot)?** OneHot would expand 3 categorical features to 84 columns. This makes SHAP/LIME explanations harder to interpret (84 binary features vs 41 semantic features). LabelEncoder preserves the original feature space for cleaner explanations.
25
+
26
+ ## 3. Model Architectures
27
+
28
+ ### 3.1 MLP (Primary Baseline)
29
+
30
+ ```
31
+ Input (41) β†’ Linear(256) β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
32
+ β†’ Linear(128) β†’ BatchNorm β†’ ReLU β†’ Dropout(0.2)
33
+ β†’ Linear(64) β†’ ReLU
34
+ β†’ Linear(num_classes)
35
+ ```
36
+
37
+ **Parameters**: ~50K
38
+ **Justification**:
39
+ - 3 hidden layers with decreasing width is standard for tabular classification
40
+ - BatchNorm stabilizes training, enables higher learning rates
41
+ - Dropout (0.3β†’0.2) regularizes; heavier in early layers where more parameters
42
+ - No final activation β€” CrossEntropyLoss includes LogSoftmax
43
+
44
+ ### 3.2 LSTM (Temporal Variant)
45
+
46
+ ```
47
+ Input (41) β†’ reshape to (41, 1) β†’ LSTM(hidden=64, layers=2, dropout=0.2)
48
+ β†’ take last hidden state β†’ Linear(num_classes)
49
+ ```
50
+
51
+ **Parameters**: ~35K
52
+ **Justification**:
53
+ - Treats 41 features as a sequence β€” captures inter-feature dependencies
54
+ - 2 layers with 64 hidden units is minimal while allowing feature interaction
55
+ - LSTM processes features in order: basic→content→time-based→host-based
56
+ - This ordering has semantic meaning in NSL-KDD (groups of related features)
57
+
58
+ ### 3.3 1D-CNN (Spatial Variant)
59
+
60
+ ```
61
+ Input (41) β†’ reshape to (1, 41) β†’ Conv1d(64, k=3, pad=1) β†’ ReLU
62
+ β†’ Conv1d(128, k=3, pad=1) β†’ ReLU β†’ AdaptiveAvgPool1d(8)
63
+ β†’ Flatten β†’ Linear(64) β†’ ReLU β†’ Linear(num_classes)
64
+ ```
65
+
66
+ **Parameters**: ~45K
67
+ **Justification**:
68
+ - 1D convolutions learn local feature patterns (neighboring features)
69
+ - Kernel size 3 captures triplets of features
70
+ - AdaptiveAvgPool compresses to fixed size regardless of input length
71
+ - Useful for detecting patterns in rate-based features (contiguous block)
72
+
73
+ ## 4. Training Configuration
74
+
75
+ | Parameter | Value | Justification |
76
+ |-----------|-------|---------------|
77
+ | Optimizer | Adam | Standard for neural networks; adaptive lr per parameter |
78
+ | Learning rate | 1e-3 | Default Adam lr; works well for tabular tasks |
79
+ | Weight decay | 1e-4 | Light L2 regularization prevents overfitting |
80
+ | Batch size | 256 | Good balance of speed and gradient stability |
81
+ | Epochs | 50 | Sufficient for convergence on NSL-KDD (~151K samples) |
82
+ | Loss | CrossEntropyLoss | Standard for multi-class; includes class weights for imbalance |
83
+ | Class weights | Inverse frequency | Addresses class imbalance between normal (53%) and anomaly (47%) |
84
+ | Seed | 42 | Fixed for reproducibility |
85
+
86
+ ## 5. Why These Models for Explainability
87
+
88
+ | Model | SHAP Method | Speed | Explanation Quality |
89
+ |-------|-------------|-------|-------------------|
90
+ | MLP | KernelExplainer | Medium | Clean, model-agnostic attributions |
91
+ | LSTM | KernelExplainer | Medium | Sequential attributions may differ |
92
+ | 1D-CNN | KernelExplainer | Medium | Convolutional attributions capture local patterns |
93
+
94
+ All three use **KernelExplainer** (model-agnostic SHAP), enabling:
95
+ - Direct comparison of feature attributions across architectures
96
+ - Analysis of whether model architecture affects explanation stability
97
+ - Consistent methodology across all models
98
+
99
+ ## 6. Expected Baseline Performance
100
+
101
+ Based on published NSL-KDD benchmarks (Tavallaee et al., Revathi & Malathi 2013):
102
+
103
+ | Model | Binary Accuracy | Binary Weighted F1 |
104
+ |-------|----------------|---------------------|
105
+ | MLP | 78-85% | 78-83% |
106
+ | LSTM | 76-82% | 75-80% |
107
+ | 1D-CNN | 77-83% | 76-81% |
108
+
109
+ **Known challenge**: Test set has more anomaly (65%) than train (47%) β€” distribution shift tests generalization.