Microsoft SOC Dataset β€” Cybersecurity Incident Triage

Student: Yonatane Ben Aroch, Reichman University
Dataset: (https://archive.ics.uci.edu/dataset/498/incident+management+process+enriched+event+log)
Date: May 2026


The Problem

SOC analysts waste 40-60% of their time reviewing false positives.

Goal: Build ML models to automate triage β€” predict if an alert is a real attack (TP), harmless activity (BP), or false alarm (FP).


Dataset

Feature Engineering

After cleaning, I had 35 features to work with. I created 5 additional engineered features to capture entity involvement patterns:

Feature Type Description
has_ip Binary 1 if an IP address was involved in the incident
has_account Binary 1 if a user account was involved
has_device Binary 1 if a device was involved
has_sha256 Binary 1 if a file hash was present (malware execution indicator)
entity_score Numeric Total entity involvement count β€” sum of flags above (0-4)

Data Cleaning

Removed OrgId β†’ caused data leakage .

Missing Values

Ten columns had more than 60% missing values and were dropped immediately. to prevent capturing more noise than signal.

Duplicates

No duplicate rows found in the 100K sample.

Outliers

Kept outliers in evidence_count β†’ bimodal distribution is real signal


EDA

Class Distribution

Grade Distribution

Shows: Class distribution of incident grades
Finding: 42% BenignPositive, 36% TruePositive, 22% FalsePositive
Why it matters: Moderate imbalance β†’ used macro F1 metric (equal weight to all classes)

Entity Type Patterns

Entity Type by Grade

Shows: Which entity types (IP, User, Machine, etc.) appear in each grade
Finding: IP entities dominate TruePositive incidents (~11K count)
Why it matters: IP-based alerts deserve higher priority than User-based alerts

Threat Category Profiles

Category by Grade

Shows: Threat categories (InitialAccess, Malware, Exfiltration, etc.) by grade
Finding:

  • InitialAccess: 54% TruePositive (high threat)
  • Malware: 74% BenignPositive (mostly noise)

Why it matters: Not all alert categories are equal β€” category alone is a strong predictor


Research Questions

Q1 β€” Which threat categories generate real attacks?

TP by Category

Finding: InitialAccess has the highest TruePositive rate at 54%. Exfiltration is 70% BenignPositive despite sounding alarming. Malware is 74% BenignPositive β€” most antivirus triggers are routine.

Why this matters: Category is one of the strongest predictors in the classifier. An analyst seeing an InitialAccess alert should treat it very differently than a Malware alert, even if the automated severity scores look similar.

Q2 β€” Do real attacks leave more evidence?

Evidence by Grade

Finding: Yes, dramatically. TruePositive incidents average 117 evidence items vs 25 for FalsePositive β€” a 4.7Γ— difference. BenignPositive sits in between at 40.

Why this matters: Attack complexity measured by evidence volume is one of the strongest separating signals available. The bimodal distribution isn't noise β€” it reflects fundamentally different operational realities between incident types.

Q3 β€” Can clustering identify high-threat groups?

Cluster Visualization

Finding: KMeans with k=3 produced a cluster (Cluster 1) with 47% TruePositive rate β€” nearly 2Γ— higher than Cluster 0 at 22%. The three clusters have meaningfully different threat profiles with clear PCA separation.

Why this matters: Cluster distance from center became the 2nd most important feature (30% importance) in the final classifier. Incidents that don't fit neatly into any cluster β€” high distance scores β€” tend to be the real attacks.


Model Results

Baseline Regression

Baseline Coefficients

Baseline Evaluation

Shows: LinearRegression performance β€” residual plot
Finding: Two distinct peaks β†’ bimodal distribution not captured
Result: RMSE = 96.1, RΒ² = 0.023 (failed)
Why it matters: Linear models can't handle this data β†’ need tree models

Regression Model Comparison

Regression Comparison

Regression Importance Shows: RandomForest feature importance for regression
Finding:

  • AlertTitle: 61% importance
  • cluster_distance: 30%
  • Category: 15%

Why it matters: Which detector fired is by far the strongest predictor

Winner Regression Eval

Shows: RandomForest actual vs predicted scatter plot
Finding: Good fit along diagonal β€” captures bimodal pattern
Why it matters: Tree splits naturally handle 25-item vs 117-item peaks

Classification Setup

Grade Distribution

Shows: Target class distribution for classification task
Finding: Same 42/36/22 split as before
Why it matters: Confirms macro F1 is right metric (not accuracy)

Classification Results

Confusion Matrices

Classification Importance

RandomForest won with macro F1 = 0.700. Logistic Regression collapsed to near-zero FalsePositive recall with default parameters (class imbalance killed it), recovering to F1 = 0.404 only with class_weight='balanced'. XGBoost scored 0.669. The feature importance shows Alert Type at 43%, cluster distance at 30%, and Threat Category at 15%.


Key Insights

  1. Threat categories have completely different noise levels. InitialAccess at 54% TP vs Malware at 74% BP β€” they require different response strategies. Treating all alert categories the same wastes significant analyst time.

  2. Real attacks leave 4.7Γ— more forensic evidence. TP incidents average 117 items vs 25 for FP. This bimodal distribution is the clearest separation between real threats and false alarms in the entire dataset.

  3. Clustering found a natural high-threat group. Cluster 1 has a 47% TruePositive rate. Incidents falling into this cluster are nearly twice as likely to be real attacks, and distance from cluster center became the 2nd most important classification feature at 30%.

  4. Alert Type dominates prediction at 43% importance. Which detector fired tells you more about threat grade than any other single signal. The binary flags I engineered added almost nothing for tree models β€” but they significantly helped the baseline linear model, which shows that feature engineering value depends on the model architecture.


Files

File Description
models/incident_regressor.pkl RandomForest β€” predicts evidence_count. RMSE 70.9, RΒ² 0.469
models/soc_classifier.pkl RandomForest β€” predicts IncidentGrade (TP/BP/FP). Macro F1 0.700
figures/ (14 PNG files) All visualizations β€” EDA, baseline, clustering, model comparisons
Assignment_2_*.ipynb Complete notebook with all cell outputs
data/processed/train.csv 80K training rows
data/processed/test.csv 20K test rows
data/processed/summary_stats.csv Descriptive statistics after cleaning

Try It Live β€” Interactive Demo

Launch SOC Triage Predictor β†’

Test the model yourself with real threat categories and see predictions in real-time. Select alert properties, check threat indicators, and get instant triage recommendations.

Conclusions

This dataset turned out to be a genuinely good fit for automated triage. The features I engineered β€” especially the KMeans clustering approach β€” made a real difference: switching from the baseline linear model to RandomForest with engineered features dropped RMSE by 26%, and the classification macro F1 reached 0.700, meaning the model correctly identifies threat grade 7 out of 10 times across all three classes equally. The main challenge was the OrgId leakage β€” I spent time debugging why performance was suspiciously high before realizing the model had simply memorized which organizations generate more incidents. Removing it was the right call even though it hurt the numbers on paper. If I had to deploy this model today, I'd use it as a first-pass filter: auto-close the high-confidence FalsePositives, escalate the high-confidence TruePositives, and route only the ambiguous middle cases to human analysts. That alone could recover a meaningful chunk of the 40-60% analyst time currently lost to noise.


Yonatane Ben-Aroch, May 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using yonatane22-bh/Microsoft_SOC_analysis 1