CYB008 Baseline Classifier

SOC alert triage classifier trained on the CYB008 synthetic SOC alert sample. Predicts which of 5 triage outcome classes (auto_resolved_soar / duplicate_merged / false_positive_closed / true_positive_remediated / true_positive_escalated) an alert will reach, from per-alert features. ALSO ships a leakage diagnostic for the three structural-oracle columns dropped from the feature pipeline.

Read this first. This repo ships two related artifacts: (1) a working baseline classifier for resolution_outcome (the primary product), and (2) a leakage_diagnostic.json file documenting (a) the three structural oracle columns that were dropped from the feature set, and (b) the separate finding that the README's first suggested use case — MITRE ATT&CK tactic classification — is not learnable on this sample. Both files matter; the diagnostic is required reading for anyone evaluating CYB008 for a triage product.

Model overview

Property	Value
Primary task	5-class `resolution_outcome` classification (SOC alert triage)
Secondary artifact	`leakage_diagnostic.json` — structural oracle + unlearnable-target audit
Training data	`xpertsystems/cyb008-sample` (9,200 alerts)
Models	XGBoost + PyTorch MLP
Input features	53 (after one-hot encoding)
Split	Stratified random (no natural group key in this dataset — see rationale below)
Validation	Single seed (artifact) + multi-seed aggregate across 10 seeds
License	CC-BY-NC-4.0 (matches dataset)
Status	Reference baseline + leakage diagnostic

Why this task — and what was dropped

The CYB008 README lists alert triage (TP vs FP prediction) as its first suggested use case and MITRE ATT&CK tactic classification as its second. We piloted both on the sample dataset:

Triage outcome: works honestly. After dropping 3 structural oracle columns, the model achieves acc 0.777 ± 0.007, ROC-AUC 0.955 ± 0.003 on 5-class classification. This is the primary baseline.
MITRE tactic classification: does NOT work on this sample. Without mitre_technique_id (which is a perfect ATT&CK-by-design oracle), the per-tactic feature distributions are nearly identical (raw_score 0.37–0.39 across all 12 tactics, similar for enriched score and fatigue). A trained XGBoost achieves accuracy 0.08, below the majority baseline of 0.14. The README's stated use case cannot be honestly demonstrated on the sample. See leakage_diagnostic.json for the full finding and our recommendation to the dataset author.

The three structural oracle columns (dropped)

CYB008 has three columns that structurally encode the resolution_outcome label:

Column	Oracle relationship
`alert_lifecycle_phase`	3 of 4 values deterministically map to specific outcomes (auto_closed → auto_resolved_soar; escalated → true_positive_escalated; suppressed_duplicate → duplicate_merged)
`automation_resolved`	Exact 1:1 with `auto_resolved_soar` outcome
`escalation_flag`	1319 escalation flags = 1319 `true_positive_escalated` outcomes (near-1:1)

With all three present, plain XGBoost achieves 100% test accuracy across all seeds — mechanical, not learned. With all three dropped, accuracy is 0.79 with ROC-AUC 0.96: real learning on a non-trivial 5-class task. The published baseline trains with these three columns excluded.

Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:

model_xgb.json — gradient-boosted trees
model_mlp.safetensors — PyTorch MLP in SafeTensors format

On CYB008 the MLP slightly outperforms XGBoost on the test fold (0.767 vs 0.766 accuracy, 0.955 vs 0.952 ROC-AUC at seed 42) — only the second SKU in the XpertSystems baseline catalog where this happens (after CYB007).

Quick start

pip install xgboost torch safetensors pandas huggingface_hub

from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb008-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import transform_single, load_meta, INT_TO_LABEL

meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Predict (see inference_example.ipynb for the full pattern)
# Note: do NOT include alert_lifecycle_phase, automation_resolved, or
# escalation_flag in your record - those were the oracle columns.
X = transform_single(my_alert_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])

See inference_example.ipynb for the full copy-paste demo.

Training data

Trained on the public sample of CYB008, 9,200 per-alert records:

Outcome	Alerts	Class share
`false_positive_closed`	2,996	32.6%
`auto_resolved_soar`	2,642	28.7%
`true_positive_remediated`	1,848	20.1%
`true_positive_escalated`	1,319	14.3%
`duplicate_merged`	395	4.3%

Stratified split (no natural group key)

CYB008 does not have a natural row-level group key for group-aware splitting:

25 analysts — group-aware split would yield only ~4 test analysts
5 SOCs — would yield 1 test SOC
589 incidents — only 9% of alerts have a non-null incident_id

Alerts are essentially independent given features, so we use StratifiedShuffleSplit (nested 70/15/15), the same approach as CYB001 for network flow classification:

Fold	Alerts
Train	6,440
Validation	1,380
Test	1,380

Class imbalance is addressed with class_weight='balanced' (XGBoost sample_weight) and weighted cross-entropy (MLP).

Feature pipeline

The bundled feature_engineering.py is the canonical feature recipe. 53 features survive after encoding, drawn from:

Per-alert numeric (9): raw_score, enriched_score, time_in_phase_minutes, queue_depth_at_ingestion, soar_playbook_triggered, sla_breached_flag, mttd_minutes, mttr_minutes, fatigue_score_at_alert
Per-alert categorical (5, one-hot): alert_severity (7 values), alert_source (8 values), mitre_tactic (12 values), analyst_tier (3 values), siem_platform (8 values)
Engineered (6): enrichment_lift, log_mttr, log_mttd, queue_pressure, enrichment_per_minute, is_high_confidence

Excluded columns

Oracle columns (dropped to allow honest evaluation):

Column	Why excluded
`alert_lifecycle_phase`	3 of 4 values are deterministic outcome oracles
`automation_resolved`	1:1 with `auto_resolved_soar` outcome
`escalation_flag`	Near-1:1 with `true_positive_escalated` outcome

High-cardinality columns (dropped for tractability):

Column	Why excluded
`mitre_technique_id`	36 unique values; perfect oracle for `mitre_tactic` but unrelated to this target
`detection_rule_id`	656 unique values; one-hot explosion with no real per-tactic affinity (only 5% of rules map to a single tactic)

Partial-oracle features (kept as legitimate observables)

soar_playbook_triggered is a necessary but not sufficient condition for auto_resolved_soar — when 0, the alert is never auto-resolved; when 1, the outcome is auto-resolved 68% of the time but can also be TP-remediated, TP-escalated, FP-closed, or duplicate-merged. This is a legitimate observable that downstream operators would already have on hand at decision time. KEPT in the pipeline.

Evaluation

Test-set metrics, seed 42 (n = 1,380 alerts)

XGBoost (the published model_xgb.json artifact)

Metric	Value
Macro ROC-AUC (OvR)	0.9522
Accuracy	0.7659
Macro-F1	0.7430
Weighted-F1	0.7672

MLP (the published model_mlp.safetensors artifact) — slightly outperforms XGBoost

Metric	Value
Macro ROC-AUC (OvR)	0.9552
Accuracy	0.7674
Macro-F1	0.7510
Weighted-F1	0.7691

With 6,440 training rows and 53 features, the MLP has enough data to compete favorably with boosted trees. Both models are published.

Multi-seed robustness (XGBoost, 10 seeds)

Very stable performance — std 0.007 on accuracy is among the tightest in the XpertSystems catalog:

Metric	Mean	Std	Min	Max
Accuracy	0.777	0.007	0.766	0.792
Macro-F1	0.765	0.011	0.743	0.783
Macro ROC-AUC OvR	0.955	0.003	0.950	0.960

Full per-seed results in multi_seed_results.json. All 10 seeds yielded all 5 classes in the test fold (stratified split guarantees this).

Per-class F1 (seed 42)

Outcome	Class share	XGBoost F1	MLP F1
`false_positive_closed`	32.6%	0.904	0.910
`duplicate_merged`	4.3%	0.794	0.825
`auto_resolved_soar`	28.7%	0.757	0.751
`true_positive_remediated`	20.1%	0.701	0.698
`true_positive_escalated`	14.3%	0.559	0.571

The model performs best on false_positive_closed (clearest behavioural profile — low scores, fast resolution by L1 analysts) and duplicate_merged (smallest class but distinctive — duplicate-suppressed severity is a strong tell). The hardest discrimination is between true_positive_remediated and true_positive_escalated — both are genuine threats, differing primarily by whether the alert was closed by the original analyst or passed to a higher tier. In production this matters less because both are TP outcomes; binary TP-vs-FP recall is much higher.

Ablation: which feature groups matter

Configuration	Accuracy	Macro-F1	ROC-AUC	Δ accuracy
Full feature set (published)	0.7659	0.7430	0.9522	—
No alert severity	0.5138	0.3933	0.7304	−0.2522
No `soar_playbook_triggered`	0.6188	0.5773	0.8369	−0.1471
No analyst tier	0.7717	0.7471	0.9524	+0.0058
No siem platform	0.7681	0.7474	0.9522	+0.0022
No alert source	0.7638	0.7406	0.9511	−0.0022
No engineered features	0.7681	0.7480	0.9533	+0.0022
No mitre_tactic	0.7812	0.7656	0.9530	+0.0152
No timing features	0.7775	0.7572	0.9547	+0.0116
No score features	0.7710	0.7569	0.9541	+0.0051

Four findings:

Alert severity carries the dominant signal (drops 25 pp accuracy, 22 pp ROC-AUC). This is intuitive: severity directly drives triage priority, which drives outcome. false_positive severity → false_positive_closed; duplicate_suppressed severity → duplicate_merged.
soar_playbook_triggered is the second-strongest signal (drops 15 pp accuracy). It's a partial oracle for the auto_resolved_soar outcome class.
MITRE tactic and analyst tier contribute essentially nothing. The model performs marginally better without them — they add noise that the trees over-fit on the training set.
Engineered features and timing features are near-flat. The trees recover composites from raw inputs. Kept in the pipeline as a documented baseline reference.

Architecture

XGBoost: multi-class gradient boosting (multi:softprob, 5 classes), hist tree method, class-balanced sample weights, early stopping on validation mlogloss.

MLP: 53 → 128 → 64 → 5, each hidden layer followed by BatchNorm1d → ReLU → Dropout(0.3), weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1.

Training hyperparameters are held internally by XpertSystems.

Limitations

This is a baseline reference, not a production SOC triage system.

MITRE tactic classification is unlearnable on this sample. The README lists it as a suggested use case but the per-tactic feature distributions are too similar (raw_score 0.37–0.39 across all 12 tactics). See leakage_diagnostic.json for the full audit. Real SOC data has stronger per-tactic feature signatures.
TP-remediated vs TP-escalated is the hardest discrimination. F1 0.56 on TP-escalated is the weakest per-class result. Both are genuine threats; the difference is workflow rather than threat nature. For most operational uses (TP-vs-FP recall, SLA-breach reduction), this confusion does not matter.
MLP modestly outperforms XGBoost. Both are shipped; we recommend running both and treating disagreement as a triage triage signal. The boost is modest enough that for production deployment, the choice between them is essentially an engineering preference.
Synthetic-vs-real transfer. The dataset is synthetic and calibrated to 12 SOC-operations benchmarks (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Forrester Wave SOAR, Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk State of Security, Verizon DBIR). Real SOC telemetry has different noise characteristics and the structural-oracle pattern documented above (alert_lifecycle_phase deterministically encoding outcome) would not be present in real data — real lifecycle phases transition stochastically. Do not assume metrics transfer end-to-end.
9,200 alerts is a modest training set. The 1,380-alert test fold yields stable multi-seed metrics (std 0.007), but full confidence intervals for downstream production decisions should come from the full ~280k-alert product.

Notes on dataset schema

The CYB008 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive.

What the README says	What the data actually contains
`incident_summary` has 8 columns	Data has 23 columns including incident_type, kill_chain_stages_observed, false_positive_rate, soar_actions_taken, etc.
`alert_severity` has 6 values (info / low / medium / high / critical / false_positive)	7 values: adds `duplicate_suppressed`. All values are suffixed (`high_severity`, `low_severity`, `critical_confirmed`, `informational`).
`analyst_tier` has 4 values (tier_1 / tier_2 / tier_3 / manager)	3 values on alerts (`L1_junior`, `L2_senior`, `L3_threat_hunter`); 4 on `soc_topology` (adds `L4_incident_commander`).
14 MITRE ATT&CK tactics	12 tactics in the data (no `reconnaissance` or `resource_development` from PRE-ATT&CK).
Detection source mix: edr, siem, ndr, ids, ueba, casb, deception, threat intel	Field is `alert_source` (not `detection_source`); 8 values: `edr_behavioural_engine`, `nids_signature`, `ueba_user_anomaly`, `cspm_cloud_rule`, `siem_correlation_rule`, `threat_intel_ioc_match`, `honeypot_trigger`, `itdr_identity_anomaly`.
`triage_score` / `enrichment_score` columns	Actual names: `raw_score` / `enriched_score`.
`alert_timestamp` (ISO string)	Actual: `alert_timestamp_min` (integer minutes from epoch).
`kill_chain_stage`, `storm_event_flag` columns on alerts	Not present in the data.
Field rename: `detection_source` ↔ data `alert_source`	Same fact noted twice
`resolution_outcome` values (true_positive / false_positive / duplicate / suppressed)	Actual 5 values: `auto_resolved_soar`, `duplicate_merged`, `false_positive_closed`, `true_positive_escalated`, `true_positive_remediated`.
Extra columns in data not in README	`shift_id`, `time_in_phase_minutes`, `queue_depth_at_ingestion`, `fatigue_score_at_alert`, `siem_platform`, `soar_playbook_id`, `detection_rule_id`, `alert_lifecycle_phase`

None of these affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns.

Intended use

Evaluating fit of the CYB008 dataset for your SOC-triage research
Baseline reference for new model architectures
Reference example of structural-leakage diagnostics in synthetic SOC datasets — the diagnostic methodology is reusable
Feature engineering reference for per-alert SOC telemetry

Out-of-scope use

Production SOC triage decisions on real telemetry
MITRE ATT&CK tactic prediction (this baseline establishes that task is unlearnable on the sample)
SLA-breach prediction (also tested as unlearnable on the sample — acc 0.68 vs majority 0.82)
Any operational decision affecting actual security operations without further validation on your own data

Reproducibility

Outputs above were produced with seed = 42 (published artifact), nested StratifiedShuffleSplit (70/15/15), on the published sample (xpertsystems/cyb008-sample, version 1.0.0, generated 2026-05-16). The feature pipeline in feature_engineering.py is deterministic and the trained weights in this repo correspond exactly to the metrics above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in multi_seed_results.json confirm robust performance across splits.

The training script itself is private to XpertSystems.

Files in this repo

File	Purpose
`model_xgb.json`	XGBoost weights (seed 42)
`model_mlp.safetensors`	PyTorch MLP weights (seed 42)
`feature_engineering.py`	Feature pipeline
`feature_meta.json`	Feature column order + categorical levels
`feature_scaler.json`	MLP input mean/std (XGBoost ignores)
`validation_results.json`	Per-class metrics, confusion matrix, architecture
`ablation_results.json`	Per-feature-group ablation
`multi_seed_results.json`	XGBoost metrics across 10 seeds
`leakage_diagnostic.json`	Structural-oracle audit + unlearnable-target finding
`inference_example.ipynb`	End-to-end inference demo notebook
`README.md`	This file

Contact and full product

The full CYB008 dataset contains ~335,000 rows across four files, with calibrated benchmark validation against 12 metrics drawn from authoritative SOC operations and threat intelligence sources (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Forrester Wave SOAR, Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk State of Security, Verizon DBIR). The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy.

📧 pradeep@xpertsystems.ai
🌐 https://xpertsystems.ai
🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb008-sample
🤖 Companion models:
- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
- https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
- https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
- https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
- https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
- https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type)

Citation

@misc{xpertsystems_cyb008_baseline_2026,
  title  = {CYB008 Baseline Classifier: XGBoost and MLP for SOC Alert Triage Outcome Classification, with Structural-Leakage and Unlearnable-Target Diagnostic},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb008-baseline-classifier},
  note   = {Baseline reference model trained on xpertsystems/cyb008-sample}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train xpertsystems/cyb008-baseline-classifier

Evaluation results

Test macro ROC-AUC OvR (XGBoost, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
self-reported

0.952
Test accuracy (XGBoost, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
self-reported

0.766
Test macro-F1 (XGBoost, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
self-reported

0.743
Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds) on CYB008 Synthetic SOC Alert Dataset (Sample)
self-reported

0.777
Multi-seed ROC-AUC mean ± 0.003 (XGBoost, 10 seeds) on CYB008 Synthetic SOC Alert Dataset (Sample)
self-reported

0.955
Test macro ROC-AUC OvR (MLP, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
self-reported

0.955
Test accuracy (MLP, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
self-reported

0.767
Test macro-F1 (MLP, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
self-reported

0.751