CYB006 Baseline Classifier

User-risk-tier classifier trained on the CYB006 synthetic login activity sample. Predicts which of 3 risk tiers (low / medium / high) a user belongs to, from per-user identity aggregates and non-leaky session aggregates. ALSO ships a leakage diagnostic for the README's stated headline use case (threat-actor tier classification).

Read this first. This repo ships two artifacts: (1) a working baseline classifier for user_risk_tier (the primary product), and (2) a separate diagnostic file (leakage_diagnostic.json) documenting why the README's stated headline use case — 4-class threat-actor tier classification — is not a usable ML task on the sample dataset. Both matter; the diagnostic is required reading for anyone evaluating CYB006 for a threat-detection product.

Model overview

Property	Value
Primary task	3-class user_risk_tier classification (`low`/`medium`/`high`)
Secondary artifact	`leakage_diagnostic.json` — audit of threat-actor detection on this sample
Training data	`xpertsystems/cyb006-sample` (200 users × 25 sessions = 5,000 sessions)
Models	XGBoost + PyTorch MLP
Input features	34 (per-user aggregates + session aggregates + engineered)
Split	Stratified by user_risk_tier (this is a user-level task, n=200)
Validation	Single seed (artifact) + multi-seed aggregate across 10 seeds
License	CC-BY-NC-4.0 (matches dataset)
Status	Reference baseline + structural-leakage diagnostic

Why this task — and why not threat-actor classification?

The CYB006 README's first suggested use case is "training account takeover (ATO) detection models" and second is "threat-actor tier classification — 4-class with realistic class imbalance". We piloted the threat-actor target first and discovered that the sample dataset contains structural distributional non-overlap between threat-actor and legitimate session populations across at least six independent feature groups:

Oracle feature	Actor range / value	Non-actor range / value
`velocity_anomaly_score`	[0.52, 0.82]	[0.00, 0.25] — zero overlap
`session_timestamp_utc`	[6,417, 1,440,062]	[1,445,187, 18,000,137] — disjoint windows
`credential_attempt_count`	[1, 59] (mean 12.9)	[1, 2] (mean 1.07)
`login_outcome`	`success_normal` only occurs for non-actors; `failure_account_locked` / `account_takeover_confirmed` / `session_hijacked` / `success_anomalous` only occur for actors
`geo_country_code`	`KP`, `XX`, `CN`, `BY` appear only for actors
`device_trust_level`	`trusted_managed` / `compliant_enrolled` appear only for non-actors

As a consequence, plain XGBoost achieves 100% test accuracy on threat-actor binary detection (any-actor vs none) across every random seed, and stays at 97% accuracy and AUC 0.99 even with all six oracle feature groups dropped (40+ columns excluded). This is not a useful ML benchmark; it's a property of the synthetic generator. Real identity-security telemetry has substantial overlap between threat and legitimate behaviour, with state-of-the-art detection systems operating at AUC 0.7–0.9, not 1.0.

The diagnostic finding is documented quantitatively in leakage_diagnostic.json and summarised in the Leakage diagnostic section below.

We therefore pivoted to user_risk_tier (3-class user-level classification) as the primary baseline target. This task:

Has overlapping per-tier feature distributions — no oracle features
Carries modest real signal (acc 0.66, AUC 0.80 over majority 0.57)
Targets a legitimate use case (the README lists "Insider threat scoring with composite behavioral indicators")
Demonstrates honest ML rigor on the dataset

Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:

model_xgb.json — gradient-boosted trees, primary recommendation
model_mlp.safetensors — PyTorch MLP in SafeTensors format

Quick start

pip install xgboost torch safetensors pandas huggingface_hub

from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb006-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import (
    transform_single, load_meta, INT_TO_LABEL,
    compute_session_aggregates_for_user
)

meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Compose a per-user record from user_risk_summary row + session aggregates
user_record = user_summary_row.to_dict()
user_record.update(compute_session_aggregates_for_user(user_sessions))

X = transform_single(user_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])

See inference_example.ipynb for the full copy-paste demo.

Training data

Trained on the public sample of CYB006, 200 per-user rows from user_risk_summary.csv enriched with per-user session aggregates computed from login_sessions.csv:

Tier	Users	Class share
`low`	114	57%
`medium`	47	23.5%
`high`	39	19.5%

The CYB006 README claims a 4-tier scheme (low/medium/high/critical). The sample data contains only 3 — there is no critical tier present.

Stratified split

This is a user-level task (one row per user, 200 users total). Group-aware splitting does not apply since there is no many-rows-per-group structure to leak. We use StratifiedShuffleSplit (nested 70/15/15) to preserve the 3-tier class distribution across folds:

Fold	Users
Train	139
Validation	31
Test	30

Class imbalance is addressed with class_weight='balanced' (XGBoost sample_weight) and weighted cross-entropy (MLP).

Feature pipeline

The bundled feature_engineering.py is the canonical feature recipe. 34 features survive after encoding, drawn from:

Per-user numeric (14, from user_risk_summary.csv): total_login_attempts, successful_logins, failed_logins, mfa_failures, impossible_travel_events, lateral_hop_count, privilege_escalations, account_lockout_count, geo_dispersion_score, login_velocity_score, session_anomaly_rate, ueba_alert_count, overall_identity_risk_score, insider_threat_indicator_score
Per-user categorical (1, one-hot): peak_privilege_level_accessed (6 values)
Session aggregates (8, derived from login_sessions.csv): avg_session_duration_seconds, avg_mfa_response_latency_ms, avg_geo_anomaly_score, max_geo_anomaly_score, frac_impossible_travel, n_unique_countries, n_unique_devices, n_unique_applications
Engineered (6): failed_login_rate, mfa_failure_rate, ueba_alerts_per_session, hops_per_escalation, geo_velocity_composite, composite_anomaly_score

Leakage exclusions

Three columns from user_risk_summary.csv are dropped to avoid contamination:

threat_actor_flag — perfect oracle for tier='high' subset (only high-tier users can be threat actors)
account_takeover_flag — 2 positive cases out of 200 (1%); too sparse and oracle-prone
credential_attack_victim_flag — 1 positive case out of 200 (0.5%); same issue

Four columns from login_sessions.csv are NOT aggregated into session features because they exhibited the structural non-overlap documented in Leakage diagnostic:

velocity_anomaly_score, session_timestamp_utc, credential_attempt_count, login_outcome

Evaluation

Test-set metrics, seed 42 (n = 30 disjoint users)

XGBoost (the published model_xgb.json artifact)

Metric	Value
Macro ROC-AUC (OvR)	0.8017
Accuracy	0.6667
Macro-F1	0.6454
Weighted-F1	0.6606

MLP (the published model_mlp.safetensors artifact)

Metric	Value
Macro ROC-AUC (OvR)	0.6974
Accuracy	0.6000
Macro-F1	0.5914
Weighted-F1	0.6068

Multi-seed robustness (XGBoost, 10 seeds)

Metric	Mean	Std	Min	Max
Accuracy	0.700	0.082	0.533	0.867
Macro-F1	0.638	0.093	0.445	0.814
Macro ROC-AUC OvR	0.812	0.048	0.738	0.877

Full per-seed results in multi_seed_results.json. With only 30 test users per seed, single-seed accuracy varies materially (0.53–0.87 across seeds). ROC-AUC 0.812 ± 0.048 is the more reliable performance estimate. All 10 seeds yield all 3 tiers in the test fold thanks to stratification.

Per-class F1 (seed 42)

Tier	Class share	XGBoost F1	MLP F1
`low`	57%	0.727	0.647
`medium`	23.5%	0.286	0.400
`high`	19.5%	0.923	0.727

The model performs best on high (the most behaviourally distinct tier — high failed-login rates, frequent impossible travel, elevated anomaly scores) and low (the majority class). The medium tier is hardest, which is the expected behaviour for a 3-tier ordinal task — mid-class samples sit between two boundaries and pick up confusion from both sides.

Ablation: which feature groups matter

Configuration	Accuracy	Macro-F1	Δ accuracy
Full feature set (published)	0.6667	0.6454	—
No user aggregates (count features)	0.5333	0.4586	−0.1333
No risk scores	0.5667	0.5300	−0.1000
No engineered features	0.5667	0.5444	−0.1000
No session aggregates	0.7000	0.6130	+0.0333

Findings:

User-level count features matter most (failed logins, lateral hops, MFA failures). Dropping them costs 13 pp accuracy.
Risk scores and engineered features each contribute ~10 pp. With only 139 training users, the trees can't fully recover engineered composites from raw inputs.
Session aggregates slightly hurt accuracy in seed 42 (gain 3 pp when dropped). With n=200, additional features can crowd the small data; the trees do better with fewer signals when each one is information-dense. Session aggregates are kept in the published pipeline because they help on most other seeds.

Architecture

XGBoost: multi-class gradient boosting (multi:softprob, 3 classes), hist tree method, class-balanced sample weights, early stopping on validation mlogloss.

MLP: 34 → 128 → 64 → 3, each hidden layer followed by BatchNorm1d → ReLU → Dropout(0.3), weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1.

Training hyperparameters are held internally by XpertSystems.

Leakage diagnostic

This is the most important section of the model card. The full diagnostic is in leakage_diagnostic.json. Summary:

Setup: Train an XGBoost binary classifier to predict threat_actor_capability_tier != 'none' from per-session features. Use group-aware split by user_id (15% test = 30 disjoint users). Cumulatively drop suspected oracle feature groups and re-evaluate.

Configuration	n_features	Accuracy	ROC-AUC
Full feature set	166	1.0000	1.0000
− behavioural oracles (velocity, timestamp, credential count)	163	0.9991	1.0000
− login_outcome	154	0.9982	1.0000
− geo_country_code	138	0.9987	1.0000
− device_trust_level	133	0.9982	0.9999
− user_risk_tier	130	0.9978	0.9996
− geo_anomaly_score	129	0.9707	0.9897

Even after dropping six oracle feature groups (37 columns), the model still achieves 97% test accuracy and AUC 0.99. The leakage is not localised to a few suspect features; it is distributed across the entire feature space because the synthetic generator produces threat-actor sessions that are anomalous on every dimension simultaneously, with no overlap into legitimate behaviour.

Recommendation to dataset author

For threat-actor detection to be a useful ML benchmark on this dataset, the next generator version should introduce distributional overlap between threat-actor and legitimate session populations across all anomaly indicators:

velocity_anomaly_score: extend non-actor distribution into [0.0, 0.5] and shrink actor to [0.3, 0.9] for substantial overlap in [0.3, 0.5]
session_timestamp_utc: interleave threat-actor and legitimate sessions across the same time window
credential_attempt_count: allow some non-actor users to exhibit elevated counts (mistyped passwords, MFA fatigue)
login_outcome: allow failure_account_locked and success_anomalous for some legitimate sessions
geo_country_code: include a baseline frequency of risky-country logins among legitimate users (business travel, contractors)
device_trust_level: allow threat actors to occasionally use compliant devices (token theft scenarios)

Target operating regime: real-world detection AUC 0.7–0.9, not 1.0.

What this means for buyers

If you're evaluating CYB006 for a threat-detection product, you should know that:

The sample dataset cannot be used to honestly benchmark threat-actor detection models. A trivially regularised model will score 100%, which doesn't differentiate good detection systems from bad ones.
The user-risk-tier task shipped in this baseline is a legitimate ML benchmark on the sample data. It generalises modestly (AUC 0.81) and is the right starting point for evaluating insider-threat scoring on the sample.
The full ~1.1M-row CYB006 product may or may not have the same structural property. Confirm with XpertSystems before committing to a threat-detection use case.

Limitations

This is a baseline reference, not a production identity-security system.

Small held-out test fold (n=30). With only 30 test users per seed, single-seed metrics swing 0.53–0.87 in accuracy. The multi-seed ROC-AUC of 0.81 ± 0.05 is the reliable estimate. The full ~1.1M-row product would tighten the confidence interval substantially.
The medium tier is harder than the others. F1 0.29 on medium (vs 0.92 on high) is expected — ordinal middle classes are typically the hardest under a flat-classification setup.
MLP weaker than XGBoost. AUC 0.70 vs 0.80. With only 139 training users, the MLP cannot match boosted trees on tabular data.
Threat-actor detection task is not usable on this sample. See Leakage diagnostic above.
Synthetic-vs-real transfer. The dataset is synthetic and calibrated to identity-security benchmarks (Microsoft Digital Defense Report, Okta Customer Identity Trends, Verizon DBIR, CISA Joint Advisories, Mandiant M-Trends, MITRE ATT&CK Evaluations). Real identity telemetry has different noise characteristics; do not assume metrics transfer.
3 tiers, not 4. README lists low/medium/high/critical but the data contains only 3. If you need 4-class support, wait for a regenerated sample.

Notes on dataset schema

The CYB006 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive.

What the README says	What the data actually contains
`session_phase` has 6 values	All 5,000 rows have `session_phase = session_termination` — the field is constant. There is no usable session-phase target.
`login_outcome` has 4 values (`success / failed / mfa_required / blocked`)	9 values: `success_normal`, `failure_bad_password`, `failure_account_locked`, `failure_mfa_rejected`, `failure_device_untrusted`, `failure_geo_blocked`, `success_anomalous`, `account_takeover_confirmed`, `session_hijacked`
4 actor tiers	5 values: 4 tier labels + `none` (92% of rows have `none`)
`mfa_challenge_type` has 5 values	7: adds `authenticator_app`, `hardware_token`, `voice_call`
`authentication_method` has 4 values	5: no `api_key`; adds `password_plus_mfa`, `phishing_resistant_fido2`
`user_risk_tier` has 4 values (`low/medium/high/critical`)	3 values: no `critical`
`session_timestamp_utc` is an ISO timestamp string	It is an integer
`user_risk_summary.csv` columns listed	Adds `peak_privilege_level_accessed`, `credential_attack_victim_flag` (not in README)

None of these affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns.

Intended use

Evaluating fit of the CYB006 dataset for your insider-threat or user-risk-scoring research
Baseline reference for new model architectures
Reference example of structural-leakage diagnostics in synthetic cybersecurity datasets — the diagnostic methodology in train_classifier.py is reusable
Feature engineering reference for per-user identity aggregates

Out-of-scope use

Production identity-security detection on real telemetry
Threat-actor attribution (this baseline does not address that task; see why above)
Any operational security or law-enforcement decision

Reproducibility

Outputs above were produced with seed = 42 (published artifact), nested StratifiedShuffleSplit (70/15/15 by user_risk_tier), on the published sample (xpertsystems/cyb006-sample, version 1.0.0, generated 2026-05-16). The feature pipeline in feature_engineering.py is deterministic and the trained weights in this repo correspond exactly to the metrics above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in multi_seed_results.json confirm robust performance across splits.

The training script itself is private to XpertSystems.

Files in this repo

File	Purpose
`model_xgb.json`	XGBoost weights (seed 42)
`model_mlp.safetensors`	PyTorch MLP weights (seed 42)
`feature_engineering.py`	Feature pipeline
`feature_meta.json`	Feature column order + categorical levels
`feature_scaler.json`	MLP input mean/std (XGBoost ignores)
`validation_results.json`	Per-class metrics, confusion matrix, architecture
`ablation_results.json`	Per-feature-group ablation
`multi_seed_results.json`	XGBoost metrics across 10 seeds
`leakage_diagnostic.json`	Structural-leakage audit on threat-actor detection
`inference_example.ipynb`	End-to-end inference demo notebook
`README.md`	This file

Contact and full product

The full CYB006 dataset contains ~1.1 million rows across four files, with 12 calibrated benchmark validation tests drawn from authoritative identity security and threat intelligence sources (Microsoft Digital Defense Report, Okta Customer Identity Trends, Verizon DBIR, CISA Joint Advisories, Mandiant M-Trends, MITRE ATT&CK Evaluations). The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy.

📧 pradeep@xpertsystems.ai
🌐 https://xpertsystems.ai
🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb006-sample
🤖 Companion models:
- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
- https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
- https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
- https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)

Citation

@misc{xpertsystems_cyb006_baseline_2026,
  title  = {CYB006 Baseline Classifier: XGBoost and MLP for User Risk Tier Classification, with Structural-Leakage Diagnostic on Threat-Actor Detection},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb006-baseline-classifier},
  note   = {Baseline reference model trained on xpertsystems/cyb006-sample}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train xpertsystems/cyb006-baseline-classifier

Evaluation results

Test macro ROC-AUC OvR (XGBoost, seed 42) on CYB006 Synthetic Login Activity Dataset (Sample)
self-reported

0.802
Test accuracy (XGBoost, seed 42) on CYB006 Synthetic Login Activity Dataset (Sample)
self-reported

0.667
Test macro-F1 (XGBoost, seed 42) on CYB006 Synthetic Login Activity Dataset (Sample)
self-reported

0.645
Multi-seed accuracy mean ± 0.082 (XGBoost, 10 seeds) on CYB006 Synthetic Login Activity Dataset (Sample)
self-reported

0.700
Multi-seed ROC-AUC mean ± 0.048 (XGBoost, 10 seeds) on CYB006 Synthetic Login Activity Dataset (Sample)
self-reported

0.812
Test macro ROC-AUC OvR (MLP, seed 42) on CYB006 Synthetic Login Activity Dataset (Sample)
self-reported

0.697
Test accuracy (MLP, seed 42) on CYB006 Synthetic Login Activity Dataset (Sample)
self-reported

0.600
Test macro-F1 (MLP, seed 42) on CYB006 Synthetic Login Activity Dataset (Sample)
self-reported

0.591