deep-learning-project / docs /architecture.md

Add architecture & design choices document

498baf6 verified about 1 month ago

4.72 kB

	# Model Architecture & Design Choices

	## 1. Design Philosophy

	The project requires lightweight models (max 2h training per experiment). We prioritize:
	- Simplicity over complexity — fewer layers, interpretable capacity
	- Reproducibility — fixed seeds, deterministic operations
	- Fair comparison — same preprocessing, same train/test split across all models

	## 2. Preprocessing Pipeline

	```
	Raw NSL-KDD (41 features + label)
	│
	├─ Categorical encoding: LabelEncoder for protocol_type (3), service (70), flag (11)
	├─ Feature scaling: MinMaxScaler [0, 1] on all 41 features
	├─ Label encoding: Binary (anomaly=0, normal=1)
	│
	└─ Output: X_train (151165, 41), X_test (34394, 41), float32
	```

	Why MinMaxScaler? Network features have vastly different ranges (src_bytes: 0-1.3B vs. serror_rate: 0-1). Scaling to [0,1] prevents large-valued features from dominating gradient updates and makes perturbation-based explainability (ε-bounded noise) meaningful.

	Why LabelEncoder (not OneHot)? OneHot would expand 3 categorical features to 84 columns. This makes SHAP/LIME explanations harder to interpret (84 binary features vs 41 semantic features). LabelEncoder preserves the original feature space for cleaner explanations.

	## 3. Model Architectures

	### 3.1 MLP (Primary Baseline)

	```
	Input (41) → Linear(256) → BatchNorm → ReLU → Dropout(0.3)
	→ Linear(128) → BatchNorm → ReLU → Dropout(0.2)
	→ Linear(64) → ReLU
	→ Linear(num_classes)
	```

	Parameters: ~50K
	Justification:
	- 3 hidden layers with decreasing width is standard for tabular classification
	- BatchNorm stabilizes training, enables higher learning rates
	- Dropout (0.3→0.2) regularizes; heavier in early layers where more parameters
	- No final activation — CrossEntropyLoss includes LogSoftmax

	### 3.2 LSTM (Temporal Variant)

	```
	Input (41) → reshape to (41, 1) → LSTM(hidden=64, layers=2, dropout=0.2)
	→ take last hidden state → Linear(num_classes)
	```

	Parameters: ~35K
	Justification:
	- Treats 41 features as a sequence — captures inter-feature dependencies
	- 2 layers with 64 hidden units is minimal while allowing feature interaction
	- LSTM processes features in order: basic→content→time-based→host-based
	- This ordering has semantic meaning in NSL-KDD (groups of related features)

	### 3.3 1D-CNN (Spatial Variant)

	```
	Input (41) → reshape to (1, 41) → Conv1d(64, k=3, pad=1) → ReLU
	→ Conv1d(128, k=3, pad=1) → ReLU → AdaptiveAvgPool1d(8)
	→ Flatten → Linear(64) → ReLU → Linear(num_classes)
	```

	Parameters: ~45K
	Justification:
	- 1D convolutions learn local feature patterns (neighboring features)
	- Kernel size 3 captures triplets of features
	- AdaptiveAvgPool compresses to fixed size regardless of input length
	- Useful for detecting patterns in rate-based features (contiguous block)

	## 4. Training Configuration

	\| Parameter \| Value \| Justification \|
	\|-----------\|-------\|---------------\|
	\| Optimizer \| Adam \| Standard for neural networks; adaptive lr per parameter \|
	\| Learning rate \| 1e-3 \| Default Adam lr; works well for tabular tasks \|
	\| Weight decay \| 1e-4 \| Light L2 regularization prevents overfitting \|
	\| Batch size \| 256 \| Good balance of speed and gradient stability \|
	\| Epochs \| 50 \| Sufficient for convergence on NSL-KDD (~151K samples) \|
	\| Loss \| CrossEntropyLoss \| Standard for multi-class; includes class weights for imbalance \|
	\| Class weights \| Inverse frequency \| Addresses class imbalance between normal (53%) and anomaly (47%) \|
	\| Seed \| 42 \| Fixed for reproducibility \|

	## 5. Why These Models for Explainability

	\| Model \| SHAP Method \| Speed \| Explanation Quality \|
	\|-------\|-------------\|-------\|-------------------\|
	\| MLP \| KernelExplainer \| Medium \| Clean, model-agnostic attributions \|
	\| LSTM \| KernelExplainer \| Medium \| Sequential attributions may differ \|
	\| 1D-CNN \| KernelExplainer \| Medium \| Convolutional attributions capture local patterns \|

	All three use KernelExplainer (model-agnostic SHAP), enabling:
	- Direct comparison of feature attributions across architectures
	- Analysis of whether model architecture affects explanation stability
	- Consistent methodology across all models

	## 6. Expected Baseline Performance

	Based on published NSL-KDD benchmarks (Tavallaee et al., Revathi & Malathi 2013):

	\| Model \| Binary Accuracy \| Binary Weighted F1 \|
	\|-------\|----------------\|---------------------\|
	\| MLP \| 78-85% \| 78-83% \|
	\| LSTM \| 76-82% \| 75-80% \|
	\| 1D-CNN \| 77-83% \| 76-81% \|

	Known challenge: Test set has more anomaly (65%) than train (47%) — distribution shift tests generalization.