CYB007 Baseline Classifier

Insider-threat type classifier trained on the CYB007 synthetic insider-threat sample. Predicts which of 3 actor types (negligent_user / malicious_employee / privileged_insider) is behind an observed insider incident from per-timestep trajectory telemetry.

Baseline reference, not for production use. This model demonstrates that the CYB007 sample dataset is learnable end-to-end and gives prospective buyers a working starting point for insider-threat detection research. It is not a production UEBA system, DLP engine, or HR-investigation tool. See Limitations.

Model overview

Property	Value
Task	3-class actor_threat_type classification
Training data	`xpertsystems/cyb007-sample` (32,500 timesteps across 500 incidents)
Models	XGBoost + PyTorch MLP
Input features	28 (after one-hot encoding)
Split	Group-aware by incident_id (disjoint train/val/test incidents)
Validation	Single seed (artifact) + multi-seed aggregate across 10 seeds
License	CC-BY-NC-4.0 (matches dataset)
Status	Reference baseline

Why this task — CYB007 ships the README's stated headline use case

This is the second XpertSystems baseline (after CYB005) that ships the dataset's stated headline use case rather than pivoting away from it. The CYB007 README's first suggested use case is "training insider threat classifier models (4-tier actor attribution)", and that is the task this baseline trains on (with one schema correction: the sample data contains 3 of the 4 tiers — compromised_account is absent from the sample).

CYB003 (malware family), CYB004 (phishing actor tier), and CYB006 (threat-actor tier) all had to pivot away from their README headline targets — n=100 groups isn't enough to support group-aware tier classification, and CYB006 in particular had structural distributional leakage. CYB007's 500 incidents (matching CYB005's profile of 500 campaigns × 75 timesteps) is large enough that tier attribution learns honestly under group-aware splitting, with no oracle features and multi-seed std of just 0.012.

Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal. Unusually for the XpertSystems baseline catalog, on CYB007 the MLP slightly outperforms XGBoost on the test fold (0.869 vs 0.853 accuracy at seed 42, 0.966 vs 0.963 ROC-AUC):

model_xgb.json — gradient-boosted trees
model_mlp.safetensors — PyTorch MLP in SafeTensors format

Quick start

pip install xgboost torch safetensors pandas huggingface_hub

from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb007-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import transform_single, load_meta, INT_TO_LABEL

meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Predict (see inference_example.ipynb for the full pattern)
X = transform_single(my_timestep_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])

See inference_example.ipynb for the full copy-paste demo.

Training data

Trained on the public sample of CYB007, 32,500 per-timestep telemetry rows from 500 insider threat incidents (65 timesteps per incident):

Tier	Incidents	Timestep rows	Class share
`negligent_user`	250	16,250	50.0%
`malicious_employee`	150	9,750	30.0%
`privileged_insider`	100	6,500	20.0%

Group-aware split

A single incident generates 65 highly-correlated timesteps. Random row-level splitting would put timesteps from the same incident in both train and test, inflating metrics in a way that does not generalize to new incidents.

This release uses GroupShuffleSplit by incident_id (nested, 70/15/15):

Fold	Incidents	Timesteps
Train	350	22,750
Validation	75	4,875
Test	75	4,875

All test incidents are completely unseen during training. Class imbalance is addressed with class_weight='balanced' (XGBoost sample_weight) and weighted cross-entropy (MLP).

Feature pipeline

The bundled feature_engineering.py is the canonical feature recipe. 28 features survive after encoding, drawn from:

Per-timestep numeric (7): timestep, data_access_volume_mb, privilege_event_count, communication_anomaly_score, dlp_confidence_score, exfiltration_volume_mb_cumulative, behavioural_risk_score
Per-timestep categorical (3, one-hot): incident_phase (8 values), detection_outcome (4 values), target_data_sensitivity_tier (3 values)
Engineered (6): log_data_volume, log_cumulative_exfil, exfil_velocity, is_privileged_event, risk_x_dlp_composite, is_late_stage

Leakage audit

Two features have strongly tier-correlated means but with substantial distributional overlap. Neither was dropped:

Feature	Distribution by tier	Verdict
`data_access_volume_mb`	negligent [0, 88] mean 14 / malicious [0, 328] mean 44 / privileged [0, 2541] mean 302; median ~9 MB for all three	Massive overlap in [0, 88]; real signal, not oracle. KEEP.
`exfiltration_volume_mb_cumulative`	negligent [0, ~50] mean 5 / malicious [0, ~500] mean 90 / privileged [0, ~10000] mean 818	Heavy-tailed with overlap in low-quantile region. KEEP.

The honest test: dropping both features collapses accuracy from 0.85 to 0.47 (below the 0.50 majority baseline). This confirms they carry legitimate discriminative signal that defines what privileged_insider means — a privileged user with elevated data access — rather than being an oracle leak.

detection_outcome is a near-oracle for incident phase (purity 0.79, max 1.00 for reconnaissance which is 100% suppressed). But its purity vs tier is uniform (~0.50 across all tiers), so it has no oracle relationship to the target. KEEP.

No columns dropped for this task.

Evaluation

Test-set metrics, seed 42 (n = 4,875 timesteps from 75 disjoint incidents)

XGBoost (the published model_xgb.json artifact)

Metric	Value
Macro ROC-AUC (OvR)	0.9628
Accuracy	0.8529
Macro-F1	0.8496
Weighted-F1	0.8543

MLP (the published model_mlp.safetensors artifact) — slightly outperforms XGBoost

Metric	Value
Macro ROC-AUC (OvR)	0.9661
Accuracy	0.8685
Macro-F1	0.8636
Weighted-F1	0.8682

The MLP outperforming XGBoost is unusual for tabular data and unusual within the XpertSystems baseline catalog — CYB001–CYB006 all had XGBoost ahead. With 22,750 training rows and only 28 features, the MLP has enough data to fit cleanly and the tabular advantage of trees is reduced. Both models are published.

Multi-seed robustness (XGBoost, 10 seeds)

Very stable performance — std 0.012 on accuracy is among the tightest in the XpertSystems catalog:

Metric	Mean	Std	Min	Max
Accuracy	0.855	0.012	0.831	0.873
Macro-F1	0.839	0.010	0.829	0.860
Macro ROC-AUC OvR	0.961	0.007	0.949	0.972

Full per-seed results in multi_seed_results.json. All 10 seeds yielded all 3 tiers in the test fold.

Per-class F1 (seed 42)

Tier	Class share	XGBoost F1	MLP F1
`negligent_user`	50%	0.876	0.894
`privileged_insider`	20%	0.846	0.856
`malicious_employee`	30%	0.826	0.841

The model performs evenly across all three tiers — no class collapse. The strongest performance on privileged_insider despite it being the minority class (20%) confirms that the volume-based behavioural signature (sustained large data access) is reliably discriminative. malicious_employee is the marginally hardest tier because they operate in a middle zone — more aggressive than negligent users but without the privileged access volumes that distinguish insiders.

Ablation: which feature groups matter

Configuration	Accuracy	Macro-F1	ROC-AUC	Δ accuracy
Full feature set (published)	0.8529	0.8496	0.9628	—
No volume features	0.4890	0.4736	0.6828	−0.3639
No behavioural features	0.7126	0.7055	0.8961	−0.1403
No `timestep`	0.8394	0.8336	0.9569	−0.0135
No context features	0.8544	0.8490	0.9632	−0.0000
No engineered features	0.8597	0.8560	0.9629	+0.0068

Four findings:

Volume features carry the overwhelmingly dominant signal (drops 36 pp accuracy, 28 pp ROC-AUC when removed). This is by design — privileged insiders are defined by access to large data volumes, and the synthetic generator models this faithfully.
Behavioural features (privilege events, communication anomaly, DLP confidence, risk scores) contribute 14 pp accuracy. They add a second axis of discrimination beyond pure volume.
timestep contributes only 1 pp. Tier attribution is largely invariant to where in the incident lifecycle you are — different from phase prediction, which is strongly timestep-driven.
Context features (incident_phase, sensitivity tier) and engineered composites are recovered by the trees from raw inputs. They are retained in the pipeline as a documented baseline reference but contribute essentially zero on their own.

Architecture

XGBoost: multi-class gradient boosting (multi:softprob, 3 classes), hist tree method, class-balanced sample weights, early stopping on validation mlogloss.

MLP: 28 → 128 → 64 → 3, each hidden layer followed by BatchNorm1d → ReLU → Dropout(0.3), weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1.

Training hyperparameters are held internally by XpertSystems.

Limitations

This is a baseline reference, not a production insider-threat detection system.

The dataset has 3 tiers, not 4. The CYB007 README claims a 4-tier scheme including compromised_account but the sample contains only negligent_user, malicious_employee, and privileged_insider. If your work requires the 4th tier, request regeneration.
Volume-feature dominance is a property of the dataset. Real insider-threat telemetry has more variance — some negligent users accidentally trigger large data downloads, some privileged insiders work patiently with small transfers. The sample's per-tier volume distributions overlap, but not as much as in real environments. Buyers should test the model on their own data before assuming the 0.86 accuracy transfers.
MLP modestly outperforms XGBoost. With 22,750 training rows, the MLP has enough data to compete favorably. On smaller training sets (n < 1k rows) we would expect XGBoost to be stronger.
Synthetic-vs-real transfer. The dataset is synthetic and calibrated to insider-threat research benchmarks (CERT Insider Threat Center, Verizon DBIR, IBM Cost of Insider Threats, Ponemon Institute, MITRE ATT&CK, NIST SP 800-53 / SP 800-207, Securonix, Forrester UEBA, Gartner ZTNA, CrowdStrike, Mandiant). Real insider telemetry has different noise characteristics, and adversarial insiders may deliberately mimic negligent-user patterns. Do not assume metrics transfer.
Adversarial robustness not evaluated. The dataset does not simulate insiders deliberately spoofing a different tier's behavioural footprint to evade attribution.
The 75-incident test fold is robust but not large. Multi-seed std of 0.012 on accuracy confirms the metric is stable, but full confidence intervals for downstream production decisions should come from the full ~4,800-incident product.

Notes on dataset schema

The CYB007 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive.

What the README says	What the data actually contains
4 actor tiers including `compromised_account`	3 tiers only: `negligent_user`, `malicious_employee`, `privileged_insider`. No `compromised_account` rows in the sample.
6 incident phases	8 phases: adds `idle_dwell` and `lateral_access` to the 6 documented
Per-timestep columns: `payload_entropy`, `cover_actions_taken`, `dlp_alerts_raised`, `detection_flag`, `blast_radius`, `sensitive_data_accessed`, `threat_type_tier`	Actual per-timestep columns: `privilege_event_count`, `communication_anomaly_score`, `dlp_confidence_score`, `detection_outcome` (categorical 4-value, not boolean), `behavioural_risk_score`, `target_data_sensitivity_tier`, `actor_threat_type`
Summary field `ueba_status`	Actual field is `ueba_deployment_status` (only on `org_topology.csv`, not on `insider_trajectories.csv` or `incident_summary.csv`)
Summary field `collusion_flag`	Actual: `coordinated_incident_flag`
Summary field `lateral_access_flag`	Actual: `lateral_access_count` (not boolean)
Summary field `sabotage_flag`	Actual: `sabotage_events_executed` (count)
Summary field `cover_tracks_flag`	Actual: `cover_tracks_events` (count)
Summary field `hr_trigger_flag`	Actual: `hr_case_triggers_caused` (count)
Summary field `exfiltration_success_flag`	Actual: `exfiltration_successes` (count) and `exfiltration_success_rate` (float)
Summary field `dwell_time_ratio`	Not present in summary; `actor_efficiency_score` is the closest analog

None of these affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns.

Intended use

Evaluating fit of the CYB007 dataset for your insider-threat research
Baseline reference for new model architectures (sequence models, graph models considering collusion structure)
Teaching and demo for multi-class tabular classification on insider-threat telemetry
Feature engineering reference for per-timestep insider activity

Out-of-scope use

Production insider-threat detection on real telemetry
HR investigation or employment decisions
Adversarial-evasion evaluation (dataset not adversarially generated)
Any operational or legal decision affecting actual persons

Reproducibility

Outputs above were produced with seed = 42 (published artifact), group-aware nested GroupShuffleSplit (70/15/15 by incident_id), on the published sample (xpertsystems/cyb007-sample, version 1.0.0, generated 2026-05-16). The feature pipeline in feature_engineering.py is deterministic and the trained weights in this repo correspond exactly to the metrics above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in multi_seed_results.json confirm robust performance across splits.

The training script itself is private to XpertSystems.

Files in this repo

File	Purpose
`model_xgb.json`	XGBoost weights (seed 42)
`model_mlp.safetensors`	PyTorch MLP weights (seed 42)
`feature_engineering.py`	Feature pipeline
`feature_meta.json`	Feature column order + categorical levels
`feature_scaler.json`	MLP input mean/std (XGBoost ignores)
`validation_results.json`	Per-class metrics, confusion matrix, architecture
`ablation_results.json`	Per-feature-group ablation
`multi_seed_results.json`	XGBoost metrics across 10 seeds
`inference_example.ipynb`	End-to-end inference demo notebook
`README.md`	This file

Contact and full product

The full CYB007 dataset contains ~335,000 rows across four files, with calibrated benchmark validation against 12 metrics drawn from authoritative insider-threat research sources (CERT Insider Threat Center, Verizon DBIR, IBM Cost of Insider Threats, Ponemon Institute, MITRE ATT&CK, NIST SP 800-53 / SP 800-207, Securonix, Forrester UEBA, Gartner ZTNA, CrowdStrike, Mandiant M-Trends). The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy.

📧 pradeep@xpertsystems.ai
🌐 https://xpertsystems.ai
🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb007-sample
🤖 Companion models:
- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
- https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
- https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
- https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
- https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)

Citation

@misc{xpertsystems_cyb007_baseline_2026,
  title  = {CYB007 Baseline Classifier: XGBoost and MLP for Insider Threat Type Classification},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb007-baseline-classifier},
  note   = {Baseline reference model trained on xpertsystems/cyb007-sample}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train xpertsystems/cyb007-baseline-classifier

Evaluation results

Test macro ROC-AUC OvR (XGBoost, seed 42) on CYB007 Synthetic Insider Threat Dataset (Sample)
self-reported

0.963
Test accuracy (XGBoost, seed 42) on CYB007 Synthetic Insider Threat Dataset (Sample)
self-reported

0.853
Test macro-F1 (XGBoost, seed 42) on CYB007 Synthetic Insider Threat Dataset (Sample)
self-reported

0.850
Multi-seed accuracy mean ± 0.012 (XGBoost, 10 seeds) on CYB007 Synthetic Insider Threat Dataset (Sample)
self-reported

0.855
Multi-seed ROC-AUC mean ± 0.007 (XGBoost, 10 seeds) on CYB007 Synthetic Insider Threat Dataset (Sample)
self-reported

0.961
Test macro ROC-AUC OvR (MLP, seed 42) on CYB007 Synthetic Insider Threat Dataset (Sample)
self-reported

0.966
Test accuracy (MLP, seed 42) on CYB007 Synthetic Insider Threat Dataset (Sample)
self-reported

0.869
Test macro-F1 (MLP, seed 42) on CYB007 Synthetic Insider Threat Dataset (Sample)
self-reported

0.864