CYB008 Baseline Classifier

SOC alert triage classifier trained on the CYB008 synthetic SOC alert sample. Predicts which of 5 triage outcome classes (auto_resolved_soar / duplicate_merged / false_positive_closed / true_positive_remediated / true_positive_escalated) an alert will reach, from per-alert features. ALSO ships a leakage diagnostic for the three structural-oracle columns dropped from the feature pipeline.

Read this first. This repo ships two related artifacts: (1) a working baseline classifier for resolution_outcome (the primary product), and (2) a leakage_diagnostic.json file documenting (a) the three structural oracle columns that were dropped from the feature set, and (b) the separate finding that the README's first suggested use case — MITRE ATT&CK tactic classification — is not learnable on this sample. Both files matter; the diagnostic is required reading for anyone evaluating CYB008 for a triage product.

Model overview

Property Value
Primary task 5-class resolution_outcome classification (SOC alert triage)
Secondary artifact leakage_diagnostic.json — structural oracle + unlearnable-target audit
Training data xpertsystems/cyb008-sample (9,200 alerts)
Models XGBoost + PyTorch MLP
Input features 53 (after one-hot encoding)
Split Stratified random (no natural group key in this dataset — see rationale below)
Validation Single seed (artifact) + multi-seed aggregate across 10 seeds
License CC-BY-NC-4.0 (matches dataset)
Status Reference baseline + leakage diagnostic

Why this task — and what was dropped

The CYB008 README lists alert triage (TP vs FP prediction) as its first suggested use case and MITRE ATT&CK tactic classification as its second. We piloted both on the sample dataset:

  • Triage outcome: works honestly. After dropping 3 structural oracle columns, the model achieves acc 0.777 ± 0.007, ROC-AUC 0.955 ± 0.003 on 5-class classification. This is the primary baseline.

  • MITRE tactic classification: does NOT work on this sample. Without mitre_technique_id (which is a perfect ATT&CK-by-design oracle), the per-tactic feature distributions are nearly identical (raw_score 0.37–0.39 across all 12 tactics, similar for enriched score and fatigue). A trained XGBoost achieves accuracy 0.08, below the majority baseline of 0.14. The README's stated use case cannot be honestly demonstrated on the sample. See leakage_diagnostic.json for the full finding and our recommendation to the dataset author.

The three structural oracle columns (dropped)

CYB008 has three columns that structurally encode the resolution_outcome label:

Column Oracle relationship
alert_lifecycle_phase 3 of 4 values deterministically map to specific outcomes (auto_closed → auto_resolved_soar; escalated → true_positive_escalated; suppressed_duplicate → duplicate_merged)
automation_resolved Exact 1:1 with auto_resolved_soar outcome
escalation_flag 1319 escalation flags = 1319 true_positive_escalated outcomes (near-1:1)

With all three present, plain XGBoost achieves 100% test accuracy across all seeds — mechanical, not learned. With all three dropped, accuracy is 0.79 with ROC-AUC 0.96: real learning on a non-trivial 5-class task. The published baseline trains with these three columns excluded.

Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:

  • model_xgb.json — gradient-boosted trees
  • model_mlp.safetensors — PyTorch MLP in SafeTensors format

On CYB008 the MLP slightly outperforms XGBoost on the test fold (0.767 vs 0.766 accuracy, 0.955 vs 0.952 ROC-AUC at seed 42) — only the second SKU in the XpertSystems baseline catalog where this happens (after CYB007).

Quick start

pip install xgboost torch safetensors pandas huggingface_hub
from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb008-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import transform_single, load_meta, INT_TO_LABEL

meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Predict (see inference_example.ipynb for the full pattern)
# Note: do NOT include alert_lifecycle_phase, automation_resolved, or
# escalation_flag in your record - those were the oracle columns.
X = transform_single(my_alert_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])

See inference_example.ipynb for the full copy-paste demo.

Training data

Trained on the public sample of CYB008, 9,200 per-alert records:

Outcome Alerts Class share
false_positive_closed 2,996 32.6%
auto_resolved_soar 2,642 28.7%
true_positive_remediated 1,848 20.1%
true_positive_escalated 1,319 14.3%
duplicate_merged 395 4.3%

Stratified split (no natural group key)

CYB008 does not have a natural row-level group key for group-aware splitting:

  • 25 analysts — group-aware split would yield only ~4 test analysts
  • 5 SOCs — would yield 1 test SOC
  • 589 incidents — only 9% of alerts have a non-null incident_id

Alerts are essentially independent given features, so we use StratifiedShuffleSplit (nested 70/15/15), the same approach as CYB001 for network flow classification:

Fold Alerts
Train 6,440
Validation 1,380
Test 1,380

Class imbalance is addressed with class_weight='balanced' (XGBoost sample_weight) and weighted cross-entropy (MLP).

Feature pipeline

The bundled feature_engineering.py is the canonical feature recipe. 53 features survive after encoding, drawn from:

  • Per-alert numeric (9): raw_score, enriched_score, time_in_phase_minutes, queue_depth_at_ingestion, soar_playbook_triggered, sla_breached_flag, mttd_minutes, mttr_minutes, fatigue_score_at_alert
  • Per-alert categorical (5, one-hot): alert_severity (7 values), alert_source (8 values), mitre_tactic (12 values), analyst_tier (3 values), siem_platform (8 values)
  • Engineered (6): enrichment_lift, log_mttr, log_mttd, queue_pressure, enrichment_per_minute, is_high_confidence

Excluded columns

Oracle columns (dropped to allow honest evaluation):

Column Why excluded
alert_lifecycle_phase 3 of 4 values are deterministic outcome oracles
automation_resolved 1:1 with auto_resolved_soar outcome
escalation_flag Near-1:1 with true_positive_escalated outcome

High-cardinality columns (dropped for tractability):

Column Why excluded
mitre_technique_id 36 unique values; perfect oracle for mitre_tactic but unrelated to this target
detection_rule_id 656 unique values; one-hot explosion with no real per-tactic affinity (only 5% of rules map to a single tactic)

Partial-oracle features (kept as legitimate observables)

soar_playbook_triggered is a necessary but not sufficient condition for auto_resolved_soar — when 0, the alert is never auto-resolved; when 1, the outcome is auto-resolved 68% of the time but can also be TP-remediated, TP-escalated, FP-closed, or duplicate-merged. This is a legitimate observable that downstream operators would already have on hand at decision time. KEPT in the pipeline.

Evaluation

Test-set metrics, seed 42 (n = 1,380 alerts)

XGBoost (the published model_xgb.json artifact)

Metric Value
Macro ROC-AUC (OvR) 0.9522
Accuracy 0.7659
Macro-F1 0.7430
Weighted-F1 0.7672

MLP (the published model_mlp.safetensors artifact) — slightly outperforms XGBoost

Metric Value
Macro ROC-AUC (OvR) 0.9552
Accuracy 0.7674
Macro-F1 0.7510
Weighted-F1 0.7691

With 6,440 training rows and 53 features, the MLP has enough data to compete favorably with boosted trees. Both models are published.

Multi-seed robustness (XGBoost, 10 seeds)

Very stable performance — std 0.007 on accuracy is among the tightest in the XpertSystems catalog:

Metric Mean Std Min Max
Accuracy 0.777 0.007 0.766 0.792
Macro-F1 0.765 0.011 0.743 0.783
Macro ROC-AUC OvR 0.955 0.003 0.950 0.960

Full per-seed results in multi_seed_results.json. All 10 seeds yielded all 5 classes in the test fold (stratified split guarantees this).

Per-class F1 (seed 42)

Outcome Class share XGBoost F1 MLP F1
false_positive_closed 32.6% 0.904 0.910
duplicate_merged 4.3% 0.794 0.825
auto_resolved_soar 28.7% 0.757 0.751
true_positive_remediated 20.1% 0.701 0.698
true_positive_escalated 14.3% 0.559 0.571

The model performs best on false_positive_closed (clearest behavioural profile — low scores, fast resolution by L1 analysts) and duplicate_merged (smallest class but distinctive — duplicate-suppressed severity is a strong tell). The hardest discrimination is between true_positive_remediated and true_positive_escalated — both are genuine threats, differing primarily by whether the alert was closed by the original analyst or passed to a higher tier. In production this matters less because both are TP outcomes; binary TP-vs-FP recall is much higher.

Ablation: which feature groups matter

Configuration Accuracy Macro-F1 ROC-AUC Δ accuracy
Full feature set (published) 0.7659 0.7430 0.9522
No alert severity 0.5138 0.3933 0.7304 −0.2522
No soar_playbook_triggered 0.6188 0.5773 0.8369 −0.1471
No analyst tier 0.7717 0.7471 0.9524 +0.0058
No siem platform 0.7681 0.7474 0.9522 +0.0022
No alert source 0.7638 0.7406 0.9511 −0.0022
No engineered features 0.7681 0.7480 0.9533 +0.0022
No mitre_tactic 0.7812 0.7656 0.9530 +0.0152
No timing features 0.7775 0.7572 0.9547 +0.0116
No score features 0.7710 0.7569 0.9541 +0.0051

Four findings:

  1. Alert severity carries the dominant signal (drops 25 pp accuracy, 22 pp ROC-AUC). This is intuitive: severity directly drives triage priority, which drives outcome. false_positive severity → false_positive_closed; duplicate_suppressed severity → duplicate_merged.
  2. soar_playbook_triggered is the second-strongest signal (drops 15 pp accuracy). It's a partial oracle for the auto_resolved_soar outcome class.
  3. MITRE tactic and analyst tier contribute essentially nothing. The model performs marginally better without them — they add noise that the trees over-fit on the training set.
  4. Engineered features and timing features are near-flat. The trees recover composites from raw inputs. Kept in the pipeline as a documented baseline reference.

Architecture

XGBoost: multi-class gradient boosting (multi:softprob, 5 classes), hist tree method, class-balanced sample weights, early stopping on validation mlogloss.

MLP: 53 → 128 → 64 → 5, each hidden layer followed by BatchNorm1dReLUDropout(0.3), weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1.

Training hyperparameters are held internally by XpertSystems.

Limitations

This is a baseline reference, not a production SOC triage system.

  1. MITRE tactic classification is unlearnable on this sample. The README lists it as a suggested use case but the per-tactic feature distributions are too similar (raw_score 0.37–0.39 across all 12 tactics). See leakage_diagnostic.json for the full audit. Real SOC data has stronger per-tactic feature signatures.

  2. TP-remediated vs TP-escalated is the hardest discrimination. F1 0.56 on TP-escalated is the weakest per-class result. Both are genuine threats; the difference is workflow rather than threat nature. For most operational uses (TP-vs-FP recall, SLA-breach reduction), this confusion does not matter.

  3. MLP modestly outperforms XGBoost. Both are shipped; we recommend running both and treating disagreement as a triage triage signal. The boost is modest enough that for production deployment, the choice between them is essentially an engineering preference.

  4. Synthetic-vs-real transfer. The dataset is synthetic and calibrated to 12 SOC-operations benchmarks (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Forrester Wave SOAR, Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk State of Security, Verizon DBIR). Real SOC telemetry has different noise characteristics and the structural-oracle pattern documented above (alert_lifecycle_phase deterministically encoding outcome) would not be present in real data — real lifecycle phases transition stochastically. Do not assume metrics transfer end-to-end.

  5. 9,200 alerts is a modest training set. The 1,380-alert test fold yields stable multi-seed metrics (std 0.007), but full confidence intervals for downstream production decisions should come from the full ~280k-alert product.

Notes on dataset schema

The CYB008 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive.

What the README says What the data actually contains
incident_summary has 8 columns Data has 23 columns including incident_type, kill_chain_stages_observed, false_positive_rate, soar_actions_taken, etc.
alert_severity has 6 values (info / low / medium / high / critical / false_positive) 7 values: adds duplicate_suppressed. All values are suffixed (high_severity, low_severity, critical_confirmed, informational).
analyst_tier has 4 values (tier_1 / tier_2 / tier_3 / manager) 3 values on alerts (L1_junior, L2_senior, L3_threat_hunter); 4 on soc_topology (adds L4_incident_commander).
14 MITRE ATT&CK tactics 12 tactics in the data (no reconnaissance or resource_development from PRE-ATT&CK).
Detection source mix: edr, siem, ndr, ids, ueba, casb, deception, threat intel Field is alert_source (not detection_source); 8 values: edr_behavioural_engine, nids_signature, ueba_user_anomaly, cspm_cloud_rule, siem_correlation_rule, threat_intel_ioc_match, honeypot_trigger, itdr_identity_anomaly.
triage_score / enrichment_score columns Actual names: raw_score / enriched_score.
alert_timestamp (ISO string) Actual: alert_timestamp_min (integer minutes from epoch).
kill_chain_stage, storm_event_flag columns on alerts Not present in the data.
Field rename: detection_source ↔ data alert_source Same fact noted twice
resolution_outcome values (true_positive / false_positive / duplicate / suppressed) Actual 5 values: auto_resolved_soar, duplicate_merged, false_positive_closed, true_positive_escalated, true_positive_remediated.
Extra columns in data not in README shift_id, time_in_phase_minutes, queue_depth_at_ingestion, fatigue_score_at_alert, siem_platform, soar_playbook_id, detection_rule_id, alert_lifecycle_phase

None of these affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns.

Intended use

  • Evaluating fit of the CYB008 dataset for your SOC-triage research
  • Baseline reference for new model architectures
  • Reference example of structural-leakage diagnostics in synthetic SOC datasets — the diagnostic methodology is reusable
  • Feature engineering reference for per-alert SOC telemetry

Out-of-scope use

  • Production SOC triage decisions on real telemetry
  • MITRE ATT&CK tactic prediction (this baseline establishes that task is unlearnable on the sample)
  • SLA-breach prediction (also tested as unlearnable on the sample — acc 0.68 vs majority 0.82)
  • Any operational decision affecting actual security operations without further validation on your own data

Reproducibility

Outputs above were produced with seed = 42 (published artifact), nested StratifiedShuffleSplit (70/15/15), on the published sample (xpertsystems/cyb008-sample, version 1.0.0, generated 2026-05-16). The feature pipeline in feature_engineering.py is deterministic and the trained weights in this repo correspond exactly to the metrics above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in multi_seed_results.json confirm robust performance across splits.

The training script itself is private to XpertSystems.

Files in this repo

File Purpose
model_xgb.json XGBoost weights (seed 42)
model_mlp.safetensors PyTorch MLP weights (seed 42)
feature_engineering.py Feature pipeline
feature_meta.json Feature column order + categorical levels
feature_scaler.json MLP input mean/std (XGBoost ignores)
validation_results.json Per-class metrics, confusion matrix, architecture
ablation_results.json Per-feature-group ablation
multi_seed_results.json XGBoost metrics across 10 seeds
leakage_diagnostic.json Structural-oracle audit + unlearnable-target finding
inference_example.ipynb End-to-end inference demo notebook
README.md This file

Contact and full product

The full CYB008 dataset contains ~335,000 rows across four files, with calibrated benchmark validation against 12 metrics drawn from authoritative SOC operations and threat intelligence sources (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Forrester Wave SOAR, Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk State of Security, Verizon DBIR). The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy.

Citation

@misc{xpertsystems_cyb008_baseline_2026,
  title  = {CYB008 Baseline Classifier: XGBoost and MLP for SOC Alert Triage Outcome Classification, with Structural-Leakage and Unlearnable-Target Diagnostic},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb008-baseline-classifier},
  note   = {Baseline reference model trained on xpertsystems/cyb008-sample}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train xpertsystems/cyb008-baseline-classifier

Evaluation results

  • Test macro ROC-AUC OvR (XGBoost, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
    self-reported
    0.952
  • Test accuracy (XGBoost, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
    self-reported
    0.766
  • Test macro-F1 (XGBoost, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
    self-reported
    0.743
  • Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds) on CYB008 Synthetic SOC Alert Dataset (Sample)
    self-reported
    0.777
  • Multi-seed ROC-AUC mean ± 0.003 (XGBoost, 10 seeds) on CYB008 Synthetic SOC Alert Dataset (Sample)
    self-reported
    0.955
  • Test macro ROC-AUC OvR (MLP, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
    self-reported
    0.955
  • Test accuracy (MLP, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
    self-reported
    0.767
  • Test macro-F1 (MLP, seed 42) on CYB008 Synthetic SOC Alert Dataset (Sample)
    self-reported
    0.751