CYB008 Baseline Classifier
SOC alert triage classifier trained on the CYB008 synthetic SOC alert
sample. Predicts which of 5 triage outcome classes
(auto_resolved_soar / duplicate_merged / false_positive_closed /
true_positive_remediated / true_positive_escalated) an alert
will reach, from per-alert features. ALSO ships a leakage diagnostic
for the three structural-oracle columns dropped from the feature
pipeline.
Read this first. This repo ships two related artifacts: (1) a working baseline classifier for
resolution_outcome(the primary product), and (2) aleakage_diagnostic.jsonfile documenting (a) the three structural oracle columns that were dropped from the feature set, and (b) the separate finding that the README's first suggested use case — MITRE ATT&CK tactic classification — is not learnable on this sample. Both files matter; the diagnostic is required reading for anyone evaluating CYB008 for a triage product.
Model overview
| Property | Value |
|---|---|
| Primary task | 5-class resolution_outcome classification (SOC alert triage) |
| Secondary artifact | leakage_diagnostic.json — structural oracle + unlearnable-target audit |
| Training data | xpertsystems/cyb008-sample (9,200 alerts) |
| Models | XGBoost + PyTorch MLP |
| Input features | 53 (after one-hot encoding) |
| Split | Stratified random (no natural group key in this dataset — see rationale below) |
| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
| License | CC-BY-NC-4.0 (matches dataset) |
| Status | Reference baseline + leakage diagnostic |
Why this task — and what was dropped
The CYB008 README lists alert triage (TP vs FP prediction) as its first suggested use case and MITRE ATT&CK tactic classification as its second. We piloted both on the sample dataset:
Triage outcome: works honestly. After dropping 3 structural oracle columns, the model achieves acc 0.777 ± 0.007, ROC-AUC 0.955 ± 0.003 on 5-class classification. This is the primary baseline.
MITRE tactic classification: does NOT work on this sample. Without
mitre_technique_id(which is a perfect ATT&CK-by-design oracle), the per-tactic feature distributions are nearly identical (raw_score 0.37–0.39 across all 12 tactics, similar for enriched score and fatigue). A trained XGBoost achieves accuracy 0.08, below the majority baseline of 0.14. The README's stated use case cannot be honestly demonstrated on the sample. Seeleakage_diagnostic.jsonfor the full finding and our recommendation to the dataset author.
The three structural oracle columns (dropped)
CYB008 has three columns that structurally encode the
resolution_outcome label:
| Column | Oracle relationship |
|---|---|
alert_lifecycle_phase |
3 of 4 values deterministically map to specific outcomes (auto_closed → auto_resolved_soar; escalated → true_positive_escalated; suppressed_duplicate → duplicate_merged) |
automation_resolved |
Exact 1:1 with auto_resolved_soar outcome |
escalation_flag |
1319 escalation flags = 1319 true_positive_escalated outcomes (near-1:1) |
With all three present, plain XGBoost achieves 100% test accuracy across all seeds — mechanical, not learned. With all three dropped, accuracy is 0.79 with ROC-AUC 0.96: real learning on a non-trivial 5-class task. The published baseline trains with these three columns excluded.
Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:
model_xgb.json— gradient-boosted treesmodel_mlp.safetensors— PyTorch MLP in SafeTensors format
On CYB008 the MLP slightly outperforms XGBoost on the test fold (0.767 vs 0.766 accuracy, 0.955 vs 0.952 ROC-AUC at seed 42) — only the second SKU in the XpertSystems baseline catalog where this happens (after CYB007).
Quick start
pip install xgboost torch safetensors pandas huggingface_hub
from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file
REPO = "xpertsystems/cyb008-baseline-classifier"
paths = {n: hf_hub_download(REPO, n) for n in [
"model_xgb.json", "model_mlp.safetensors",
"feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}
import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import transform_single, load_meta, INT_TO_LABEL
meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
# Predict (see inference_example.ipynb for the full pattern)
# Note: do NOT include alert_lifecycle_phase, automation_resolved, or
# escalation_flag in your record - those were the oracle columns.
X = transform_single(my_alert_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])
See inference_example.ipynb for the full
copy-paste demo.
Training data
Trained on the public sample of CYB008, 9,200 per-alert records:
| Outcome | Alerts | Class share |
|---|---|---|
false_positive_closed |
2,996 | 32.6% |
auto_resolved_soar |
2,642 | 28.7% |
true_positive_remediated |
1,848 | 20.1% |
true_positive_escalated |
1,319 | 14.3% |
duplicate_merged |
395 | 4.3% |
Stratified split (no natural group key)
CYB008 does not have a natural row-level group key for group-aware splitting:
- 25 analysts — group-aware split would yield only ~4 test analysts
- 5 SOCs — would yield 1 test SOC
- 589 incidents — only 9% of alerts have a non-null
incident_id
Alerts are essentially independent given features, so we use StratifiedShuffleSplit (nested 70/15/15), the same approach as CYB001 for network flow classification:
| Fold | Alerts |
|---|---|
| Train | 6,440 |
| Validation | 1,380 |
| Test | 1,380 |
Class imbalance is addressed with class_weight='balanced' (XGBoost
sample_weight) and weighted cross-entropy (MLP).
Feature pipeline
The bundled feature_engineering.py is the canonical feature recipe.
53 features survive after encoding, drawn from:
- Per-alert numeric (9):
raw_score,enriched_score,time_in_phase_minutes,queue_depth_at_ingestion,soar_playbook_triggered,sla_breached_flag,mttd_minutes,mttr_minutes,fatigue_score_at_alert - Per-alert categorical (5, one-hot):
alert_severity(7 values),alert_source(8 values),mitre_tactic(12 values),analyst_tier(3 values),siem_platform(8 values) - Engineered (6):
enrichment_lift,log_mttr,log_mttd,queue_pressure,enrichment_per_minute,is_high_confidence
Excluded columns
Oracle columns (dropped to allow honest evaluation):
| Column | Why excluded |
|---|---|
alert_lifecycle_phase |
3 of 4 values are deterministic outcome oracles |
automation_resolved |
1:1 with auto_resolved_soar outcome |
escalation_flag |
Near-1:1 with true_positive_escalated outcome |
High-cardinality columns (dropped for tractability):
| Column | Why excluded |
|---|---|
mitre_technique_id |
36 unique values; perfect oracle for mitre_tactic but unrelated to this target |
detection_rule_id |
656 unique values; one-hot explosion with no real per-tactic affinity (only 5% of rules map to a single tactic) |
Partial-oracle features (kept as legitimate observables)
soar_playbook_triggered is a necessary but not sufficient condition
for auto_resolved_soar — when 0, the alert is never auto-resolved;
when 1, the outcome is auto-resolved 68% of the time but can also be
TP-remediated, TP-escalated, FP-closed, or duplicate-merged. This is
a legitimate observable that downstream operators would already have
on hand at decision time. KEPT in the pipeline.
Evaluation
Test-set metrics, seed 42 (n = 1,380 alerts)
XGBoost (the published model_xgb.json artifact)
| Metric | Value |
|---|---|
| Macro ROC-AUC (OvR) | 0.9522 |
| Accuracy | 0.7659 |
| Macro-F1 | 0.7430 |
| Weighted-F1 | 0.7672 |
MLP (the published model_mlp.safetensors artifact) — slightly outperforms XGBoost
| Metric | Value |
|---|---|
| Macro ROC-AUC (OvR) | 0.9552 |
| Accuracy | 0.7674 |
| Macro-F1 | 0.7510 |
| Weighted-F1 | 0.7691 |
With 6,440 training rows and 53 features, the MLP has enough data to compete favorably with boosted trees. Both models are published.
Multi-seed robustness (XGBoost, 10 seeds)
Very stable performance — std 0.007 on accuracy is among the tightest in the XpertSystems catalog:
| Metric | Mean | Std | Min | Max |
|---|---|---|---|---|
| Accuracy | 0.777 | 0.007 | 0.766 | 0.792 |
| Macro-F1 | 0.765 | 0.011 | 0.743 | 0.783 |
| Macro ROC-AUC OvR | 0.955 | 0.003 | 0.950 | 0.960 |
Full per-seed results in multi_seed_results.json.
All 10 seeds yielded all 5 classes in the test fold (stratified split
guarantees this).
Per-class F1 (seed 42)
| Outcome | Class share | XGBoost F1 | MLP F1 |
|---|---|---|---|
false_positive_closed |
32.6% | 0.904 | 0.910 |
duplicate_merged |
4.3% | 0.794 | 0.825 |
auto_resolved_soar |
28.7% | 0.757 | 0.751 |
true_positive_remediated |
20.1% | 0.701 | 0.698 |
true_positive_escalated |
14.3% | 0.559 | 0.571 |
The model performs best on false_positive_closed (clearest behavioural
profile — low scores, fast resolution by L1 analysts) and
duplicate_merged (smallest class but distinctive — duplicate-suppressed
severity is a strong tell). The hardest discrimination is between
true_positive_remediated and true_positive_escalated — both are
genuine threats, differing primarily by whether the alert was closed
by the original analyst or passed to a higher tier. In production this
matters less because both are TP outcomes; binary TP-vs-FP recall is
much higher.
Ablation: which feature groups matter
| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy |
|---|---|---|---|---|
| Full feature set (published) | 0.7659 | 0.7430 | 0.9522 | — |
| No alert severity | 0.5138 | 0.3933 | 0.7304 | −0.2522 |
No soar_playbook_triggered |
0.6188 | 0.5773 | 0.8369 | −0.1471 |
| No analyst tier | 0.7717 | 0.7471 | 0.9524 | +0.0058 |
| No siem platform | 0.7681 | 0.7474 | 0.9522 | +0.0022 |
| No alert source | 0.7638 | 0.7406 | 0.9511 | −0.0022 |
| No engineered features | 0.7681 | 0.7480 | 0.9533 | +0.0022 |
| No mitre_tactic | 0.7812 | 0.7656 | 0.9530 | +0.0152 |
| No timing features | 0.7775 | 0.7572 | 0.9547 | +0.0116 |
| No score features | 0.7710 | 0.7569 | 0.9541 | +0.0051 |
Four findings:
- Alert severity carries the dominant signal (drops 25 pp
accuracy, 22 pp ROC-AUC). This is intuitive: severity directly
drives triage priority, which drives outcome.
false_positiveseverity →false_positive_closed;duplicate_suppressedseverity →duplicate_merged. soar_playbook_triggeredis the second-strongest signal (drops 15 pp accuracy). It's a partial oracle for theauto_resolved_soaroutcome class.- MITRE tactic and analyst tier contribute essentially nothing. The model performs marginally better without them — they add noise that the trees over-fit on the training set.
- Engineered features and timing features are near-flat. The trees recover composites from raw inputs. Kept in the pipeline as a documented baseline reference.
Architecture
XGBoost: multi-class gradient boosting (multi:softprob, 5 classes),
hist tree method, class-balanced sample weights, early stopping on
validation mlogloss.
MLP: 53 → 128 → 64 → 5, each hidden layer followed by BatchNorm1d
→ ReLU → Dropout(0.3), weighted cross-entropy loss, AdamW optimizer,
early stopping on validation macro-F1.
Training hyperparameters are held internally by XpertSystems.
Limitations
This is a baseline reference, not a production SOC triage system.
MITRE tactic classification is unlearnable on this sample. The README lists it as a suggested use case but the per-tactic feature distributions are too similar (raw_score 0.37–0.39 across all 12 tactics). See
leakage_diagnostic.jsonfor the full audit. Real SOC data has stronger per-tactic feature signatures.TP-remediated vs TP-escalated is the hardest discrimination. F1 0.56 on TP-escalated is the weakest per-class result. Both are genuine threats; the difference is workflow rather than threat nature. For most operational uses (TP-vs-FP recall, SLA-breach reduction), this confusion does not matter.
MLP modestly outperforms XGBoost. Both are shipped; we recommend running both and treating disagreement as a triage triage signal. The boost is modest enough that for production deployment, the choice between them is essentially an engineering preference.
Synthetic-vs-real transfer. The dataset is synthetic and calibrated to 12 SOC-operations benchmarks (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Forrester Wave SOAR, Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk State of Security, Verizon DBIR). Real SOC telemetry has different noise characteristics and the structural-oracle pattern documented above (alert_lifecycle_phase deterministically encoding outcome) would not be present in real data — real lifecycle phases transition stochastically. Do not assume metrics transfer end-to-end.
9,200 alerts is a modest training set. The 1,380-alert test fold yields stable multi-seed metrics (std 0.007), but full confidence intervals for downstream production decisions should come from the full ~280k-alert product.
Notes on dataset schema
The CYB008 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive.
| What the README says | What the data actually contains |
|---|---|
incident_summary has 8 columns |
Data has 23 columns including incident_type, kill_chain_stages_observed, false_positive_rate, soar_actions_taken, etc. |
alert_severity has 6 values (info / low / medium / high / critical / false_positive) |
7 values: adds duplicate_suppressed. All values are suffixed (high_severity, low_severity, critical_confirmed, informational). |
analyst_tier has 4 values (tier_1 / tier_2 / tier_3 / manager) |
3 values on alerts (L1_junior, L2_senior, L3_threat_hunter); 4 on soc_topology (adds L4_incident_commander). |
| 14 MITRE ATT&CK tactics | 12 tactics in the data (no reconnaissance or resource_development from PRE-ATT&CK). |
| Detection source mix: edr, siem, ndr, ids, ueba, casb, deception, threat intel | Field is alert_source (not detection_source); 8 values: edr_behavioural_engine, nids_signature, ueba_user_anomaly, cspm_cloud_rule, siem_correlation_rule, threat_intel_ioc_match, honeypot_trigger, itdr_identity_anomaly. |
triage_score / enrichment_score columns |
Actual names: raw_score / enriched_score. |
alert_timestamp (ISO string) |
Actual: alert_timestamp_min (integer minutes from epoch). |
kill_chain_stage, storm_event_flag columns on alerts |
Not present in the data. |
Field rename: detection_source ↔ data alert_source |
Same fact noted twice |
resolution_outcome values (true_positive / false_positive / duplicate / suppressed) |
Actual 5 values: auto_resolved_soar, duplicate_merged, false_positive_closed, true_positive_escalated, true_positive_remediated. |
| Extra columns in data not in README | shift_id, time_in_phase_minutes, queue_depth_at_ingestion, fatigue_score_at_alert, siem_platform, soar_playbook_id, detection_rule_id, alert_lifecycle_phase |
None of these affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns.
Intended use
- Evaluating fit of the CYB008 dataset for your SOC-triage research
- Baseline reference for new model architectures
- Reference example of structural-leakage diagnostics in synthetic SOC datasets — the diagnostic methodology is reusable
- Feature engineering reference for per-alert SOC telemetry
Out-of-scope use
- Production SOC triage decisions on real telemetry
- MITRE ATT&CK tactic prediction (this baseline establishes that task is unlearnable on the sample)
- SLA-breach prediction (also tested as unlearnable on the sample — acc 0.68 vs majority 0.82)
- Any operational decision affecting actual security operations without further validation on your own data
Reproducibility
Outputs above were produced with seed = 42 (published artifact),
nested StratifiedShuffleSplit (70/15/15), on the published sample
(xpertsystems/cyb008-sample, version 1.0.0, generated 2026-05-16).
The feature pipeline in feature_engineering.py is deterministic and
the trained weights in this repo correspond exactly to the metrics
above.
Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200)
in multi_seed_results.json confirm robust performance across splits.
The training script itself is private to XpertSystems.
Files in this repo
| File | Purpose |
|---|---|
model_xgb.json |
XGBoost weights (seed 42) |
model_mlp.safetensors |
PyTorch MLP weights (seed 42) |
feature_engineering.py |
Feature pipeline |
feature_meta.json |
Feature column order + categorical levels |
feature_scaler.json |
MLP input mean/std (XGBoost ignores) |
validation_results.json |
Per-class metrics, confusion matrix, architecture |
ablation_results.json |
Per-feature-group ablation |
multi_seed_results.json |
XGBoost metrics across 10 seeds |
leakage_diagnostic.json |
Structural-oracle audit + unlearnable-target finding |
inference_example.ipynb |
End-to-end inference demo notebook |
README.md |
This file |
Contact and full product
The full CYB008 dataset contains ~335,000 rows across four files, with calibrated benchmark validation against 12 metrics drawn from authoritative SOC operations and threat intelligence sources (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Forrester Wave SOAR, Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk State of Security, Verizon DBIR). The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy.
- 📧 pradeep@xpertsystems.ai
- 🌐 https://xpertsystems.ai
- 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb008-sample
- 🤖 Companion models:
- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
- https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
- https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
- https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
- https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
- https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type)
Citation
@misc{xpertsystems_cyb008_baseline_2026,
title = {CYB008 Baseline Classifier: XGBoost and MLP for SOC Alert Triage Outcome Classification, with Structural-Leakage and Unlearnable-Target Diagnostic},
author = {XpertSystems.ai},
year = {2026},
url = {https://huggingface.co/xpertsystems/cyb008-baseline-classifier},
note = {Baseline reference model trained on xpertsystems/cyb008-sample}
}
Dataset used to train xpertsystems/cyb008-baseline-classifier
Evaluation results
- Test macro ROC-AUC OvR (XGBoost, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)self-reported0.952
- Test accuracy (XGBoost, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)self-reported0.766
- Test macro-F1 (XGBoost, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)self-reported0.743
- Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds) on CYB008 Synthetic SOC Alert Dataset (Sample)self-reported0.777
- Multi-seed ROC-AUC mean ± 0.003 (XGBoost, 10 seeds) on CYB008 Synthetic SOC Alert Dataset (Sample)self-reported0.955
- Test macro ROC-AUC OvR (MLP, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)self-reported0.955
- Test accuracy (MLP, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)self-reported0.767
- Test macro-F1 (MLP, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)self-reported0.751