CYB003 Baseline Classifier

Malware execution-phase classifier trained on the CYB003 synthetic malware behaviour sample. Predicts which of 10 execution phases a per-timestep telemetry record belongs to, from observable behavioural and PE-static features.

Baseline reference, not for production use. This model demonstrates that the CYB003 sample dataset is learnable end-to-end and gives prospective buyers a working starting point. It is not a production sandbox, EDR, or threat-detection system. See Limitations.

Model overview

Property	Value
Task	10-class execution_phase classification
Training data	`xpertsystems/cyb003-sample` (6,000 timesteps across 100 malware samples)
Models	XGBoost + PyTorch MLP
Input features	69 (after one-hot encoding)
Split	Group-aware by sample_id (disjoint train/val/test samples)
Validation	Single seed (artifact) + multi-seed aggregate across 10 seeds
License	CC-BY-NC-4.0 (matches dataset)
Status	Reference baseline

Why this task instead of malware family classification?

The CYB003 dataset README leads with "training malware family classifiers" as a suggested use case. We piloted that target first and found it is not learnable from the sample dataset under proper group-aware evaluation: with only 100 unique samples spread across 10 families, XGBoost on per-timestep features lands at ~15% accuracy and ROC-AUC ~0.58 — at majority baseline. Per-sample aggregation gives the same result.

This is a sample-size constraint, not a feature-engineering failure. With ~7 samples per family on average, a held-out test set of 15 samples covers at most ~8 families and yields a model that cannot generalize. The full 280k-row CYB003 product, with ~28 samples per family at the sample's distribution, will not have this constraint.

We pivoted to execution_phase prediction, which has 6,000 rows of per-timestep data and learns cleanly: 91% accuracy, ROC-AUC 0.98, stable across seeds. This is a legitimate SOC use case — dynamic-analysis tools and EDR systems regularly need to tag what phase of execution observed malware activity belongs to — and it shows the dataset is well-calibrated even when the headline product use case needs more data.

Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:

model_xgb.json — gradient-boosted trees, primary recommendation
model_mlp.safetensors — PyTorch MLP in SafeTensors format

Quick start

pip install xgboost torch safetensors pandas huggingface_hub

from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb003-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import transform_single, load_meta, INT_TO_LABEL

meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Predict (see inference_example.ipynb for the full pattern)
X = transform_single(my_timestep_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])

See inference_example.ipynb for the full copy-paste demo.

Training data

Trained on the public sample of CYB003, 6,000 per-timestep telemetry rows from 100 malware samples (60 timesteps per sample):

Phase	Total rows	Train share	Test rows (seed 42)
`initial_drop`	801	13.4%	120
`lateral_movement`	799	13.3%	120
`persistence_establishment`	787	13.1%	119
`data_exfiltration`	783	13.1%	100
`c2_communication`	709	11.8%	87
`privilege_escalation`	705	11.8%	107
`payload_execution`	705	11.8%	109
`dormancy_dwell`	250	4.2%	83
`sandbox_evasion_stall`	234	3.9%	32
`self_destruct_cleanup`	227	3.8%	23

Group-aware split

A single malware sample generates 60 highly-correlated timesteps. Random row-level splitting would put timesteps from the same sample in both train and test, inflating metrics in a way that does not generalize to new samples.

This release uses GroupShuffleSplit by sample_id (nested, 70/15/15):

Fold	Samples	Timesteps
Train	69	4,140
Validation	16	960
Test	15	900

All test samples are completely unseen during training. Class imbalance is addressed with class_weight='balanced' (XGBoost sample_weight) and weighted cross-entropy (MLP).

Feature pipeline

The bundled feature_engineering.py is the canonical feature recipe. 69 features survive after encoding, drawn from:

Per-timestep numeric (10): timestep, api_call_rate, registry_write_count, network_connection_count, process_injection_flag, c2_beacon_interval_sec, av_signature_hit_flag, sandbox_evasion_flag, lateral_propagation_count, privilege_escalation_flag
PE static features (11): pe_entropy_mean, pe_entropy_std, import_hash_cluster, section_count, packed_section_ratio, string_entropy_mean, byte_histogram_chi2, code_section_rx_ratio, resource_section_entropy, suspicious_import_count, packer_detected_flag
Categorical (6, one-hot encoded): malware_family, threat_actor_tier, target_platform, obfuscation_technique, detection_outcome, ep_stack
Engineered (6): api_burst_score, is_c2_active, is_high_net_volume, is_stealth_step, is_destructive_step, lateral_activity_score

Leakage audit

No categorical feature has phase->phase purity above 0.17 (uniform random baseline is 0.10), so nothing in the dataset is an oracle for the target. The model relies on a mix of timestep (strong but not deterministic) and behavioural features.

Evaluation

Test-set metrics, seed 42 (n = 900 timesteps from 15 disjoint samples)

XGBoost (the published model_xgb.json artifact)

Metric	Value
Macro ROC-AUC (OvR)	0.9792
Accuracy	0.9178
Macro-F1	0.7781
Weighted-F1	0.9173

MLP (the published model_mlp.safetensors artifact)

Metric	Value
Macro ROC-AUC (OvR)	0.9681
Accuracy	0.8222
Macro-F1	0.7072
Weighted-F1	0.8278

Multi-seed robustness (XGBoost, 10 seeds)

Accuracy and ROC-AUC are tight across seeds — the task is genuinely learnable, not seed-lucky:

Metric	Mean	Std	Min	Max
Accuracy	0.905	0.010	0.882	0.921
Macro-F1	0.784	0.013	0.759	0.807
Macro ROC-AUC OvR	0.975	0.002	0.972	0.979

Full per-seed results in multi_seed_results.json. All 10 seeds yielded all 10 classes in the test fold, supporting clean multi-class ROC-AUC computation.

Per-class F1 (seed 42) — where the signal is and isn't

Phase	XGBoost F1	MLP F1	Note
`c2_communication`	1.000	1.000	Trivial: tight timestep window 52-59 + c2_beacon signal
`persistence_establishment`	0.992	0.870	Tight timestep window 9-17 + registry writes
`lateral_movement`	0.992	0.907	Tight timestep window 26-34 + lateral_propagation
`privilege_escalation`	0.991	0.915	Tight timestep window 18-25 + privilege flag
`data_exfiltration`	0.970	0.918	Tight timestep window 43-51 + network volume
`payload_execution`	0.963	0.698	Tight timestep window 35-42 + API bursts
`initial_drop`	0.945	0.886	Tight timestep window 0-8
`dormancy_dwell`	0.530	0.520	Hard: spans full 0-59 timestep range
`self_destruct_cleanup`	0.273	0.282	Hard: spans full 0-59, low row count (227)
`sandbox_evasion_stall`	0.125	0.077	Hard: spans full 0-59, low row count (234)

Seven phases are near-trivially classified because they sit in tight timestep windows with characteristic behavioural signatures. Three phases — dormancy_dwell, sandbox_evasion_stall, self_destruct_cleanup — scatter across the full 0–59 timestep range and lack distinctive behavioural features (idle/evasion phases have low activity by design), so a flat-tabular event-level model can't reliably disambiguate them. Sequence models that consider neighbouring timesteps would help here.

Ablation: which feature groups matter

Configuration	Accuracy	Macro-F1	ROC-AUC	Δ accuracy
Full feature set (published)	0.9178	0.7781	0.9792	—
No `timestep`	0.6933	0.5963	0.9264	−0.2244
No behavioural features	0.9089	0.7579	0.9705	−0.0089
No PE static features	0.9167	0.7808	0.9786	−0.0011
No engineered features	0.9200	0.7931	0.9797	+0.0022

Three clear findings:

timestep is by far the dominant feature (drops 22 pp when removed, ROC-AUC still 0.93). Malware execution progresses in time, and where you are in that timeline carries most of the phase signal.
PE static features are barely used for phase prediction. This is honest: PE features (entropy, packed sections, import hashes) inform family classification, not phase classification. A buyer doing family work should expect to use them; for phase work they can be dropped.
Engineered features and behavioural features each contribute ~1 pp. Trees recover most of the engineered features on their own.

Architecture

XGBoost: multi-class gradient boosting (multi:softprob, 10 classes), hist tree method, class-balanced sample weights, early stopping on validation mlogloss.

MLP: 69 → 128 → 64 → 10, each hidden layer followed by BatchNorm1d → ReLU → Dropout(0.3), weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1.

Training hyperparameters (learning rate, batch size, n_estimators, early-stopping patience, weight decay, class-weighting strategy) are held internally by XpertSystems and are not part of this release.

Limitations

This is a baseline reference, not a production sandbox or threat detector.

Three phases are genuinely hard at sample size. dormancy_dwell, sandbox_evasion_stall, and self_destruct_cleanup span the full 0–59 timestep range and have low row counts. Per-class F1 = 0.13–0.53. These are the phases by design lacking distinctive moment-to-moment features (the malware is being quiet to evade detection). Sequence models or per-sample aggregation would substantially improve these.
The pivot away from malware family classification is dataset-limited, not method-limited. Family classification on 100 samples with 10 classes is at majority baseline. The full 280k-row CYB003 product provides ~5,600 samples and supports proper family classification.
Synthetic-vs-real transfer. The dataset is synthetic and calibrated to threat-intelligence and AV-testing benchmark targets (VirusTotal, AV-TEST, MITRE ATT&CK Evaluations, Mandiant M-Trends, CrowdStrike GTR, Verizon DBIR). Real malware telemetry has different noise characteristics, adversary adaptation, and instrumentation gaps. Do not assume metrics transfer.
Adversarial robustness not evaluated. The dataset is not adversarially generated; the model has not been red-teamed against evasive samples.
MLP brittleness on OOD inputs. With ~4k training timesteps, the MLP can produce confidently-wrong predictions on hand-crafted records far from the training manifold. XGBoost is more robust. Use both; treat disagreement as a signal for human review.
timestep dominance is a property of the dataset. Real malware in production doesn't have a clean "timestep" feature on a per-sample 60-step normalized timeline — that's a simulator artifact. A buyer transferring this baseline to real sandbox traces would need to recover an equivalent temporal-position feature from execution-trace timestamps relative to detonation.

Notes on dataset schema

The CYB003 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive.

What the README says	What the data actually contains
`pe_entropy` (one column)	`pe_entropy_mean` + `pe_entropy_std` (two columns)
`process_injection_count`	`process_injection_flag` (binary, not a count)
`c2_beacon_active`	`c2_beacon_interval_sec` (seconds, 0 when inactive)
`av_detected`, `edr_detected`, `sandbox_evaded`, `dwell_time_hours`, `persistence_mechanism`, `lotl_technique_used` (per-timestep)	None of these exist on per-timestep; equivalents (`av_signature_hit_flag`, `sandbox_evasion_flag`) do exist with different names
`ep_stack`: 3 values (`legacy_av`, `ngav_ml_based`, `edr_full`)	`ep_stack`: 8 values (`legacy_av_only`, `ngav_ml_based`, `edr_endpoint_detect`, `av_plus_firewall`, `xdr_extended_detect`, `managed_detection_response`, `deception_honeypot`, `no_protection`)
9 malware families listed	10 families in the data (`apt_implant` is the additional one)
`coordinated_campaign_flag` (described as a flag)	Constant = 1 for all rows in the sample (uninformative)

The actual per-timestep table also contains rich PE-static features not listed in the README: import_hash_cluster, section_count, packed_section_ratio, string_entropy_mean, byte_histogram_chi2, code_section_rx_ratio, resource_section_entropy, suspicious_import_count. These are excellent features for family classification work and are documented in the model's feature_engineering.py.

None of these discrepancies affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns, not the README descriptions.

Intended use

Evaluating fit of the CYB003 dataset for your malware-analysis or sandbox-detection research
Baseline reference for new model architectures (especially sequence models, which should beat this baseline on the late/scattered phases)
Teaching and demo for tabular classification on malware telemetry
Feature engineering reference for per-timestep behavioural data

Out-of-scope use

Production sandbox analysis on real malware
EDR phase tagging on real systems
Family attribution (this baseline does not address that task; see why above)
Adversarial-evasion evaluation (dataset not adversarially generated)
Any operational security decision

Reproducibility

Outputs above were produced with seed = 42 (published artifact), group-aware nested GroupShuffleSplit (70/15/15 by sample_id), on the published sample (xpertsystems/cyb003-sample, version 1.0.0, generated 2026-05-16). The feature pipeline in feature_engineering.py is deterministic and the trained weights in this repo correspond exactly to the metrics above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in multi_seed_results.json confirm robust performance across splits.

The training script itself is private to XpertSystems. The published artifacts contain the feature pipeline, model weights, scaler, metadata, and validation results — sufficient to reproduce inference but not training.

Files in this repo

File	Purpose
`model_xgb.json`	XGBoost weights (seed 42)
`model_mlp.safetensors`	PyTorch MLP weights (seed 42)
`feature_engineering.py`	Feature pipeline (load → engineer → encode)
`feature_meta.json`	Feature column order + categorical levels
`feature_scaler.json`	MLP input mean/std (XGBoost ignores)
`validation_results.json`	Per-class metrics, confusion matrix, architecture
`ablation_results.json`	Per-feature-group ablation (timestep, behavioural, PE static, engineered)
`multi_seed_results.json`	XGBoost metrics across 10 seeds with aggregate statistics
`inference_example.ipynb`	End-to-end inference demo notebook
`README.md`	This file

Contact and full product

The full CYB003 dataset contains ~349,000 rows across four files, with calibrated benchmark validation against 12 metrics drawn from authoritative threat intelligence and AV-testing sources (VirusTotal, AV-TEST, MITRE ATT&CK Evaluations, Mandiant, CrowdStrike, Verizon). The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy.

📧 pradeep@xpertsystems.ai
🌐 https://xpertsystems.ai
🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb003-sample
🤖 Companion models:
- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)

Citation

@misc{xpertsystems_cyb003_baseline_2026,
  title  = {CYB003 Baseline Classifier: XGBoost and MLP for Malware Execution Phase Classification},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb003-baseline-classifier},
  note   = {Baseline reference model trained on xpertsystems/cyb003-sample}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train xpertsystems/cyb003-baseline-classifier

Evaluation results

Test macro ROC-AUC OvR (XGBoost, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
self-reported

0.979
Test accuracy (XGBoost, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
self-reported

0.918
Test macro-F1 (XGBoost, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
self-reported

0.778
Multi-seed accuracy mean ± 0.010 (XGBoost, 10 seeds) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
self-reported

0.905
Multi-seed ROC-AUC mean ± 0.002 (XGBoost, 10 seeds) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
self-reported

0.975
Test macro ROC-AUC OvR (MLP, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
self-reported

0.968
Test accuracy (MLP, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
self-reported

0.822
Test macro-F1 (MLP, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
self-reported

0.707