Microsoft SOC Dataset — Cybersecurity Incident Triage

Student: Yonatane Ben Aroch, Reichman University
Dataset: (https://archive.ics.uci.edu/dataset/498/incident+management+process+enriched+event+log)
Date: May 2026

The Problem

SOC analysts waste 40-60% of their time reviewing false positives.

Goal: Build ML models to automate triage — predict if an alert is a real attack (TP), harmless activity (BP), or false alarm (FP).

Dataset

Source: (https://archive.ics.uci.edu/dataset/498/incident+management+process+enriched+event+log)
Original size: 4.1M evidence-level rows, 45 features
After cleaning: 100K sampled rows, 35 features (10 dropped for >60% missing)
Targets:
- Regression: evidence_count (attack complexity)
- Classification: IncidentGrade (TP/BP/FP)
Class split: 42% BP / 36% TP / 22% FP

Feature Engineering

After cleaning, I had 35 features to work with. I created 5 additional engineered features to capture entity involvement patterns:

Feature	Type	Description
`has_ip`	Binary	1 if an IP address was involved in the incident
`has_account`	Binary	1 if a user account was involved
`has_device`	Binary	1 if a device was involved
`has_sha256`	Binary	1 if a file hash was present (malware execution indicator)
`entity_score`	Numeric	Total entity involvement count — sum of flags above (0-4)

Data Cleaning

Removed OrgId → caused data leakage .

Missing Values

Ten columns had more than 60% missing values and were dropped immediately. to prevent capturing more noise than signal.

Duplicates

No duplicate rows found in the 100K sample.

Outliers

Kept outliers in evidence_count → bimodal distribution is real signal

EDA

Class Distribution

Shows: Class distribution of incident grades
Finding: 42% BenignPositive, 36% TruePositive, 22% FalsePositive
Why it matters: Moderate imbalance → used macro F1 metric (equal weight to all classes)

Entity Type Patterns

Shows: Which entity types (IP, User, Machine, etc.) appear in each grade
Finding: IP entities dominate TruePositive incidents (~11K count)
Why it matters: IP-based alerts deserve higher priority than User-based alerts

Threat Category Profiles

Shows: Threat categories (InitialAccess, Malware, Exfiltration, etc.) by grade
Finding:

InitialAccess: 54% TruePositive (high threat)
Malware: 74% BenignPositive (mostly noise)

Why it matters: Not all alert categories are equal — category alone is a strong predictor

Research Questions

Q1 — Which threat categories generate real attacks?

Finding: InitialAccess has the highest TruePositive rate at 54%. Exfiltration is 70% BenignPositive despite sounding alarming. Malware is 74% BenignPositive — most antivirus triggers are routine.

Why this matters: Category is one of the strongest predictors in the classifier. An analyst seeing an InitialAccess alert should treat it very differently than a Malware alert, even if the automated severity scores look similar.

Q2 — Do real attacks leave more evidence?

Finding: Yes, dramatically. TruePositive incidents average 117 evidence items vs 25 for FalsePositive — a 4.7× difference. BenignPositive sits in between at 40.

Why this matters: Attack complexity measured by evidence volume is one of the strongest separating signals available. The bimodal distribution isn't noise — it reflects fundamentally different operational realities between incident types.

Q3 — Can clustering identify high-threat groups?

Finding: KMeans with k=3 produced a cluster (Cluster 1) with 47% TruePositive rate — nearly 2× higher than Cluster 0 at 22%. The three clusters have meaningfully different threat profiles with clear PCA separation.

Why this matters: Cluster distance from center became the 2nd most important feature (30% importance) in the final classifier. Incidents that don't fit neatly into any cluster — high distance scores — tend to be the real attacks.

Model Results

Baseline Regression

Shows: LinearRegression performance — residual plot
Finding: Two distinct peaks → bimodal distribution not captured
Result: RMSE = 96.1, R² = 0.023 (failed)
Why it matters: Linear models can't handle this data → need tree models

Regression Model Comparison

Shows: RandomForest feature importance for regression
Finding:

AlertTitle: 61% importance
cluster_distance: 30%
Category: 15%

Why it matters: Which detector fired is by far the strongest predictor

Shows: RandomForest actual vs predicted scatter plot
Finding: Good fit along diagonal — captures bimodal pattern
Why it matters: Tree splits naturally handle 25-item vs 117-item peaks

Classification Setup

Shows: Target class distribution for classification task
Finding: Same 42/36/22 split as before
Why it matters: Confirms macro F1 is right metric (not accuracy)

Classification Results

RandomForest won with macro F1 = 0.700. Logistic Regression collapsed to near-zero FalsePositive recall with default parameters (class imbalance killed it), recovering to F1 = 0.404 only with class_weight='balanced'. XGBoost scored 0.669. The feature importance shows Alert Type at 43%, cluster distance at 30%, and Threat Category at 15%.

Key Insights

Threat categories have completely different noise levels. InitialAccess at 54% TP vs Malware at 74% BP — they require different response strategies. Treating all alert categories the same wastes significant analyst time.
Real attacks leave 4.7× more forensic evidence. TP incidents average 117 items vs 25 for FP. This bimodal distribution is the clearest separation between real threats and false alarms in the entire dataset.
Clustering found a natural high-threat group. Cluster 1 has a 47% TruePositive rate. Incidents falling into this cluster are nearly twice as likely to be real attacks, and distance from cluster center became the 2nd most important classification feature at 30%.
Alert Type dominates prediction at 43% importance. Which detector fired tells you more about threat grade than any other single signal. The binary flags I engineered added almost nothing for tree models — but they significantly helped the baseline linear model, which shows that feature engineering value depends on the model architecture.

Files

File	Description
`models/incident_regressor.pkl`	RandomForest — predicts `evidence_count`. RMSE 70.9, R² 0.469
`models/soc_classifier.pkl`	RandomForest — predicts IncidentGrade (TP/BP/FP). Macro F1 0.700
`figures/` (14 PNG files)	All visualizations — EDA, baseline, clustering, model comparisons
`Assignment_2_*.ipynb`	Complete notebook with all cell outputs
`data/processed/train.csv`	80K training rows
`data/processed/test.csv`	20K test rows
`data/processed/summary_stats.csv`	Descriptive statistics after cleaning

Try It Live — Interactive Demo

Launch SOC Triage Predictor →

Test the model yourself with real threat categories and see predictions in real-time. Select alert properties, check threat indicators, and get instant triage recommendations.

Conclusions

This dataset turned out to be a genuinely good fit for automated triage. The features I engineered — especially the KMeans clustering approach — made a real difference: switching from the baseline linear model to RandomForest with engineered features dropped RMSE by 26%, and the classification macro F1 reached 0.700, meaning the model correctly identifies threat grade 7 out of 10 times across all three classes equally. The main challenge was the OrgId leakage — I spent time debugging why performance was suspiciously high before realizing the model had simply memorized which organizations generate more incidents. Removing it was the right call even though it hurt the numbers on paper. If I had to deploy this model today, I'd use it as a first-pass filter: auto-close the high-confidence FalsePositives, escalate the high-confidence TruePositives, and route only the ambiguous middle cases to human analysts. That alone could recover a meaningful chunk of the 40-60% analyst time currently lost to noise.

Yonatane Ben-Aroch, May 2026

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

yonatane22-bh
/

Microsoft_SOC_analysis