deep-learning-project / docs /architecture.md
cathrica's picture
Add architecture & design choices document
498baf6 verified

Model Architecture & Design Choices

1. Design Philosophy

The project requires lightweight models (max 2h training per experiment). We prioritize:

  • Simplicity over complexity β€” fewer layers, interpretable capacity
  • Reproducibility β€” fixed seeds, deterministic operations
  • Fair comparison β€” same preprocessing, same train/test split across all models

2. Preprocessing Pipeline

Raw NSL-KDD (41 features + label)
    β”‚
    β”œβ”€ Categorical encoding: LabelEncoder for protocol_type (3), service (70), flag (11)
    β”œβ”€ Feature scaling: MinMaxScaler [0, 1] on all 41 features
    β”œβ”€ Label encoding: Binary (anomaly=0, normal=1)
    β”‚
    └─ Output: X_train (151165, 41), X_test (34394, 41), float32

Why MinMaxScaler? Network features have vastly different ranges (src_bytes: 0-1.3B vs. serror_rate: 0-1). Scaling to [0,1] prevents large-valued features from dominating gradient updates and makes perturbation-based explainability (Ξ΅-bounded noise) meaningful.

Why LabelEncoder (not OneHot)? OneHot would expand 3 categorical features to 84 columns. This makes SHAP/LIME explanations harder to interpret (84 binary features vs 41 semantic features). LabelEncoder preserves the original feature space for cleaner explanations.

3. Model Architectures

3.1 MLP (Primary Baseline)

Input (41) β†’ Linear(256) β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
          β†’ Linear(128) β†’ BatchNorm β†’ ReLU β†’ Dropout(0.2)
          β†’ Linear(64) β†’ ReLU
          β†’ Linear(num_classes)

Parameters: ~50K Justification:

  • 3 hidden layers with decreasing width is standard for tabular classification
  • BatchNorm stabilizes training, enables higher learning rates
  • Dropout (0.3β†’0.2) regularizes; heavier in early layers where more parameters
  • No final activation β€” CrossEntropyLoss includes LogSoftmax

3.2 LSTM (Temporal Variant)

Input (41) β†’ reshape to (41, 1) β†’ LSTM(hidden=64, layers=2, dropout=0.2)
          β†’ take last hidden state β†’ Linear(num_classes)

Parameters: ~35K Justification:

  • Treats 41 features as a sequence β€” captures inter-feature dependencies
  • 2 layers with 64 hidden units is minimal while allowing feature interaction
  • LSTM processes features in order: basicβ†’contentβ†’time-basedβ†’host-based
  • This ordering has semantic meaning in NSL-KDD (groups of related features)

3.3 1D-CNN (Spatial Variant)

Input (41) β†’ reshape to (1, 41) β†’ Conv1d(64, k=3, pad=1) β†’ ReLU
          β†’ Conv1d(128, k=3, pad=1) β†’ ReLU β†’ AdaptiveAvgPool1d(8)
          β†’ Flatten β†’ Linear(64) β†’ ReLU β†’ Linear(num_classes)

Parameters: ~45K Justification:

  • 1D convolutions learn local feature patterns (neighboring features)
  • Kernel size 3 captures triplets of features
  • AdaptiveAvgPool compresses to fixed size regardless of input length
  • Useful for detecting patterns in rate-based features (contiguous block)

4. Training Configuration

Parameter Value Justification
Optimizer Adam Standard for neural networks; adaptive lr per parameter
Learning rate 1e-3 Default Adam lr; works well for tabular tasks
Weight decay 1e-4 Light L2 regularization prevents overfitting
Batch size 256 Good balance of speed and gradient stability
Epochs 50 Sufficient for convergence on NSL-KDD (~151K samples)
Loss CrossEntropyLoss Standard for multi-class; includes class weights for imbalance
Class weights Inverse frequency Addresses class imbalance between normal (53%) and anomaly (47%)
Seed 42 Fixed for reproducibility

5. Why These Models for Explainability

Model SHAP Method Speed Explanation Quality
MLP KernelExplainer Medium Clean, model-agnostic attributions
LSTM KernelExplainer Medium Sequential attributions may differ
1D-CNN KernelExplainer Medium Convolutional attributions capture local patterns

All three use KernelExplainer (model-agnostic SHAP), enabling:

  • Direct comparison of feature attributions across architectures
  • Analysis of whether model architecture affects explanation stability
  • Consistent methodology across all models

6. Expected Baseline Performance

Based on published NSL-KDD benchmarks (Tavallaee et al., Revathi & Malathi 2013):

Model Binary Accuracy Binary Weighted F1
MLP 78-85% 78-83%
LSTM 76-82% 75-80%
1D-CNN 77-83% 76-81%

Known challenge: Test set has more anomaly (65%) than train (47%) β€” distribution shift tests generalization.