CICIDS-2017 SOC Tier-1 Intrusion Detector
Calibrated LightGBM for 5-class network intrusion detection trained on the official CICIDS-2017 dataset.
Performance (calibrated, temporal test split)
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Normal | 0.98 | 1.00 | 0.99 |
| DoS | 1.00 | 0.85 | 0.92 |
| PortScan | 0.95 | 1.00 | 0.97 |
| Brute Force | 0.99 | 0.99 | 0.99 |
| Web Attack | 0.86 | 0.91 | 0.88 |
| Macro F1 | 0.9518 |
Web Attack precision improved from 0.19 โ 0.86 vs the standard baseline.
Artifacts
| File | Description |
|---|---|
tier1_lgbm_calibrated.pkl |
Calibrated LightGBM (isotonic, natural-dist cal set) |
scaler.pkl |
StandardScaler fitted on training data only |
feature_selector.pkl |
RF-based SelectFromModel (24 of 71 features kept) |
selected_features.pkl |
List of 24 selected feature names |
feature_cols.pkl |
Full list of 71 input features (pre-selection) |
label_encoder.pkl |
LabelEncoder: Brute Force=0, DoS=1, Normal=2, PortScan=3, Web Attack=4 |
Usage
import joblib, numpy as np
le = joblib.load('label_encoder.pkl')
scaler = joblib.load('scaler.pkl')
selector = joblib.load('feature_selector.pkl')
model = joblib.load('tier1_lgbm_calibrated.pkl')
# flows: DataFrame with columns matching feature_cols (71 CICFlowMeter features)
X = scaler.transform(flows.values.astype('float32'))
X = selector.transform(X)
preds = le.inverse_transform(model.predict(X))
probas = model.predict_proba(X) # calibrated โ P(Normal)~0.83 for benign traffic
Key design decisions
- **Temporal split**: per-class 80/20 chronological cut per daily file
- **SMOTE**: Web Attack 1.5kโ5k, Brute Force 7kโ15k before undersampling
- **Early stopping**: balanced resampled eval set (not natural-dist) so all 5 classes
contribute equally to the stopping signal
- **Calibration**: `FrozenEstimator` + isotonic on natural-distribution val set (83% Normal)