Microsoft SOC Dataset β Cybersecurity Incident Triage
Student: Yonatane Ben Aroch, Reichman University
Dataset: (https://archive.ics.uci.edu/dataset/498/incident+management+process+enriched+event+log)
Date: May 2026
The Problem
SOC analysts waste 40-60% of their time reviewing false positives.
Goal: Build ML models to automate triage β predict if an alert is a real attack (TP), harmless activity (BP), or false alarm (FP).
Dataset
- Source: (https://archive.ics.uci.edu/dataset/498/incident+management+process+enriched+event+log)
- Original size: 4.1M evidence-level rows, 45 features
- After cleaning: 100K sampled rows, 35 features (10 dropped for >60% missing)
- Targets:
- Regression:
evidence_count(attack complexity) - Classification:
IncidentGrade(TP/BP/FP)
- Regression:
- Class split: 42% BP / 36% TP / 22% FP
Feature Engineering
After cleaning, I had 35 features to work with. I created 5 additional engineered features to capture entity involvement patterns:
| Feature | Type | Description |
|---|---|---|
has_ip |
Binary | 1 if an IP address was involved in the incident |
has_account |
Binary | 1 if a user account was involved |
has_device |
Binary | 1 if a device was involved |
has_sha256 |
Binary | 1 if a file hash was present (malware execution indicator) |
entity_score |
Numeric | Total entity involvement count β sum of flags above (0-4) |
Data Cleaning
Removed OrgId β caused data leakage .
Missing Values
Ten columns had more than 60% missing values and were dropped immediately. to prevent capturing more noise than signal.
Duplicates
No duplicate rows found in the 100K sample.
Outliers
Kept outliers in evidence_count β bimodal distribution is real signal
EDA
Class Distribution
Shows: Class distribution of incident grades
Finding: 42% BenignPositive, 36% TruePositive, 22% FalsePositive
Why it matters: Moderate imbalance β used macro F1 metric (equal weight to all classes)
Entity Type Patterns
Shows: Which entity types (IP, User, Machine, etc.) appear in each grade
Finding: IP entities dominate TruePositive incidents (~11K count)
Why it matters: IP-based alerts deserve higher priority than User-based alerts
Threat Category Profiles
Shows: Threat categories (InitialAccess, Malware, Exfiltration, etc.) by grade
Finding:
- InitialAccess: 54% TruePositive (high threat)
- Malware: 74% BenignPositive (mostly noise)
Why it matters: Not all alert categories are equal β category alone is a strong predictor
Research Questions
Q1 β Which threat categories generate real attacks?
Finding: InitialAccess has the highest TruePositive rate at 54%. Exfiltration is 70% BenignPositive despite sounding alarming. Malware is 74% BenignPositive β most antivirus triggers are routine.
Why this matters: Category is one of the strongest predictors in the classifier. An analyst seeing an InitialAccess alert should treat it very differently than a Malware alert, even if the automated severity scores look similar.
Q2 β Do real attacks leave more evidence?
Finding: Yes, dramatically. TruePositive incidents average 117 evidence items vs 25 for FalsePositive β a 4.7Γ difference. BenignPositive sits in between at 40.
Why this matters: Attack complexity measured by evidence volume is one of the strongest separating signals available. The bimodal distribution isn't noise β it reflects fundamentally different operational realities between incident types.
Q3 β Can clustering identify high-threat groups?
Finding: KMeans with k=3 produced a cluster (Cluster 1) with 47% TruePositive rate β nearly 2Γ higher than Cluster 0 at 22%. The three clusters have meaningfully different threat profiles with clear PCA separation.
Why this matters: Cluster distance from center became the 2nd most important feature (30% importance) in the final classifier. Incidents that don't fit neatly into any cluster β high distance scores β tend to be the real attacks.
Model Results
Baseline Regression
Shows: LinearRegression performance β residual plot
Finding: Two distinct peaks β bimodal distribution not captured
Result: RMSE = 96.1, RΒ² = 0.023 (failed)
Why it matters: Linear models can't handle this data β need tree models
Regression Model Comparison
Shows: RandomForest feature importance for regression
Finding:
- AlertTitle: 61% importance
- cluster_distance: 30%
- Category: 15%
Why it matters: Which detector fired is by far the strongest predictor
Shows: RandomForest actual vs predicted scatter plot
Finding: Good fit along diagonal β captures bimodal pattern
Why it matters: Tree splits naturally handle 25-item vs 117-item peaks
Classification Setup
Shows: Target class distribution for classification task
Finding: Same 42/36/22 split as before
Why it matters: Confirms macro F1 is right metric (not accuracy)
Classification Results
RandomForest won with macro F1 = 0.700. Logistic Regression collapsed to near-zero FalsePositive recall with default parameters (class imbalance killed it), recovering to F1 = 0.404 only with class_weight='balanced'. XGBoost scored 0.669. The feature importance shows Alert Type at 43%, cluster distance at 30%, and Threat Category at 15%.
Key Insights
Threat categories have completely different noise levels. InitialAccess at 54% TP vs Malware at 74% BP β they require different response strategies. Treating all alert categories the same wastes significant analyst time.
Real attacks leave 4.7Γ more forensic evidence. TP incidents average 117 items vs 25 for FP. This bimodal distribution is the clearest separation between real threats and false alarms in the entire dataset.
Clustering found a natural high-threat group. Cluster 1 has a 47% TruePositive rate. Incidents falling into this cluster are nearly twice as likely to be real attacks, and distance from cluster center became the 2nd most important classification feature at 30%.
Alert Type dominates prediction at 43% importance. Which detector fired tells you more about threat grade than any other single signal. The binary flags I engineered added almost nothing for tree models β but they significantly helped the baseline linear model, which shows that feature engineering value depends on the model architecture.
Files
| File | Description |
|---|---|
models/incident_regressor.pkl |
RandomForest β predicts evidence_count. RMSE 70.9, RΒ² 0.469 |
models/soc_classifier.pkl |
RandomForest β predicts IncidentGrade (TP/BP/FP). Macro F1 0.700 |
figures/ (14 PNG files) |
All visualizations β EDA, baseline, clustering, model comparisons |
Assignment_2_*.ipynb |
Complete notebook with all cell outputs |
data/processed/train.csv |
80K training rows |
data/processed/test.csv |
20K test rows |
data/processed/summary_stats.csv |
Descriptive statistics after cleaning |
Try It Live β Interactive Demo
Launch SOC Triage Predictor β
Test the model yourself with real threat categories and see predictions in real-time. Select alert properties, check threat indicators, and get instant triage recommendations.
Conclusions
This dataset turned out to be a genuinely good fit for automated triage. The features I engineered β especially the KMeans clustering approach β made a real difference: switching from the baseline linear model to RandomForest with engineered features dropped RMSE by 26%, and the classification macro F1 reached 0.700, meaning the model correctly identifies threat grade 7 out of 10 times across all three classes equally. The main challenge was the OrgId leakage β I spent time debugging why performance was suspiciously high before realizing the model had simply memorized which organizations generate more incidents. Removing it was the right call even though it hurt the numbers on paper. If I had to deploy this model today, I'd use it as a first-pass filter: auto-close the high-confidence FalsePositives, escalate the high-confidence TruePositives, and route only the ambiguous middle cases to human analysts. That alone could recover a meaningful chunk of the 40-60% analyst time currently lost to noise.
Yonatane Ben-Aroch, May 2026












