CYB003 Baseline Classifier

Malware execution-phase classifier trained on the CYB003 synthetic malware behaviour sample. Predicts which of 10 execution phases a per-timestep telemetry record belongs to, from observable behavioural and PE-static features.

Baseline reference, not for production use. This model demonstrates that the CYB003 sample dataset is learnable end-to-end and gives prospective buyers a working starting point. It is not a production sandbox, EDR, or threat-detection system. See Limitations.

Model overview

Property Value
Task 10-class execution_phase classification
Training data xpertsystems/cyb003-sample (6,000 timesteps across 100 malware samples)
Models XGBoost + PyTorch MLP
Input features 69 (after one-hot encoding)
Split Group-aware by sample_id (disjoint train/val/test samples)
Validation Single seed (artifact) + multi-seed aggregate across 10 seeds
License CC-BY-NC-4.0 (matches dataset)
Status Reference baseline

Why this task instead of malware family classification?

The CYB003 dataset README leads with "training malware family classifiers" as a suggested use case. We piloted that target first and found it is not learnable from the sample dataset under proper group-aware evaluation: with only 100 unique samples spread across 10 families, XGBoost on per-timestep features lands at ~15% accuracy and ROC-AUC ~0.58 — at majority baseline. Per-sample aggregation gives the same result.

This is a sample-size constraint, not a feature-engineering failure. With ~7 samples per family on average, a held-out test set of 15 samples covers at most ~8 families and yields a model that cannot generalize. The full 280k-row CYB003 product, with ~28 samples per family at the sample's distribution, will not have this constraint.

We pivoted to execution_phase prediction, which has 6,000 rows of per-timestep data and learns cleanly: 91% accuracy, ROC-AUC 0.98, stable across seeds. This is a legitimate SOC use case — dynamic-analysis tools and EDR systems regularly need to tag what phase of execution observed malware activity belongs to — and it shows the dataset is well-calibrated even when the headline product use case needs more data.

Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:

  • model_xgb.json — gradient-boosted trees, primary recommendation
  • model_mlp.safetensors — PyTorch MLP in SafeTensors format

Quick start

pip install xgboost torch safetensors pandas huggingface_hub
from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb003-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import transform_single, load_meta, INT_TO_LABEL

meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Predict (see inference_example.ipynb for the full pattern)
X = transform_single(my_timestep_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])

See inference_example.ipynb for the full copy-paste demo.

Training data

Trained on the public sample of CYB003, 6,000 per-timestep telemetry rows from 100 malware samples (60 timesteps per sample):

Phase Total rows Train share Test rows (seed 42)
initial_drop 801 13.4% 120
lateral_movement 799 13.3% 120
persistence_establishment 787 13.1% 119
data_exfiltration 783 13.1% 100
c2_communication 709 11.8% 87
privilege_escalation 705 11.8% 107
payload_execution 705 11.8% 109
dormancy_dwell 250 4.2% 83
sandbox_evasion_stall 234 3.9% 32
self_destruct_cleanup 227 3.8% 23

Group-aware split

A single malware sample generates 60 highly-correlated timesteps. Random row-level splitting would put timesteps from the same sample in both train and test, inflating metrics in a way that does not generalize to new samples.

This release uses GroupShuffleSplit by sample_id (nested, 70/15/15):

Fold Samples Timesteps
Train 69 4,140
Validation 16 960
Test 15 900

All test samples are completely unseen during training. Class imbalance is addressed with class_weight='balanced' (XGBoost sample_weight) and weighted cross-entropy (MLP).

Feature pipeline

The bundled feature_engineering.py is the canonical feature recipe. 69 features survive after encoding, drawn from:

  • Per-timestep numeric (10): timestep, api_call_rate, registry_write_count, network_connection_count, process_injection_flag, c2_beacon_interval_sec, av_signature_hit_flag, sandbox_evasion_flag, lateral_propagation_count, privilege_escalation_flag
  • PE static features (11): pe_entropy_mean, pe_entropy_std, import_hash_cluster, section_count, packed_section_ratio, string_entropy_mean, byte_histogram_chi2, code_section_rx_ratio, resource_section_entropy, suspicious_import_count, packer_detected_flag
  • Categorical (6, one-hot encoded): malware_family, threat_actor_tier, target_platform, obfuscation_technique, detection_outcome, ep_stack
  • Engineered (6): api_burst_score, is_c2_active, is_high_net_volume, is_stealth_step, is_destructive_step, lateral_activity_score

Leakage audit

No categorical feature has phase->phase purity above 0.17 (uniform random baseline is 0.10), so nothing in the dataset is an oracle for the target. The model relies on a mix of timestep (strong but not deterministic) and behavioural features.

Evaluation

Test-set metrics, seed 42 (n = 900 timesteps from 15 disjoint samples)

XGBoost (the published model_xgb.json artifact)

Metric Value
Macro ROC-AUC (OvR) 0.9792
Accuracy 0.9178
Macro-F1 0.7781
Weighted-F1 0.9173

MLP (the published model_mlp.safetensors artifact)

Metric Value
Macro ROC-AUC (OvR) 0.9681
Accuracy 0.8222
Macro-F1 0.7072
Weighted-F1 0.8278

Multi-seed robustness (XGBoost, 10 seeds)

Accuracy and ROC-AUC are tight across seeds — the task is genuinely learnable, not seed-lucky:

Metric Mean Std Min Max
Accuracy 0.905 0.010 0.882 0.921
Macro-F1 0.784 0.013 0.759 0.807
Macro ROC-AUC OvR 0.975 0.002 0.972 0.979

Full per-seed results in multi_seed_results.json. All 10 seeds yielded all 10 classes in the test fold, supporting clean multi-class ROC-AUC computation.

Per-class F1 (seed 42) — where the signal is and isn't

Phase XGBoost F1 MLP F1 Note
c2_communication 1.000 1.000 Trivial: tight timestep window 52-59 + c2_beacon signal
persistence_establishment 0.992 0.870 Tight timestep window 9-17 + registry writes
lateral_movement 0.992 0.907 Tight timestep window 26-34 + lateral_propagation
privilege_escalation 0.991 0.915 Tight timestep window 18-25 + privilege flag
data_exfiltration 0.970 0.918 Tight timestep window 43-51 + network volume
payload_execution 0.963 0.698 Tight timestep window 35-42 + API bursts
initial_drop 0.945 0.886 Tight timestep window 0-8
dormancy_dwell 0.530 0.520 Hard: spans full 0-59 timestep range
self_destruct_cleanup 0.273 0.282 Hard: spans full 0-59, low row count (227)
sandbox_evasion_stall 0.125 0.077 Hard: spans full 0-59, low row count (234)

Seven phases are near-trivially classified because they sit in tight timestep windows with characteristic behavioural signatures. Three phases — dormancy_dwell, sandbox_evasion_stall, self_destruct_cleanup — scatter across the full 0–59 timestep range and lack distinctive behavioural features (idle/evasion phases have low activity by design), so a flat-tabular event-level model can't reliably disambiguate them. Sequence models that consider neighbouring timesteps would help here.

Ablation: which feature groups matter

Configuration Accuracy Macro-F1 ROC-AUC Δ accuracy
Full feature set (published) 0.9178 0.7781 0.9792
No timestep 0.6933 0.5963 0.9264 −0.2244
No behavioural features 0.9089 0.7579 0.9705 −0.0089
No PE static features 0.9167 0.7808 0.9786 −0.0011
No engineered features 0.9200 0.7931 0.9797 +0.0022

Three clear findings:

  1. timestep is by far the dominant feature (drops 22 pp when removed, ROC-AUC still 0.93). Malware execution progresses in time, and where you are in that timeline carries most of the phase signal.
  2. PE static features are barely used for phase prediction. This is honest: PE features (entropy, packed sections, import hashes) inform family classification, not phase classification. A buyer doing family work should expect to use them; for phase work they can be dropped.
  3. Engineered features and behavioural features each contribute ~1 pp. Trees recover most of the engineered features on their own.

Architecture

XGBoost: multi-class gradient boosting (multi:softprob, 10 classes), hist tree method, class-balanced sample weights, early stopping on validation mlogloss.

MLP: 69 → 128 → 64 → 10, each hidden layer followed by BatchNorm1dReLUDropout(0.3), weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1.

Training hyperparameters (learning rate, batch size, n_estimators, early-stopping patience, weight decay, class-weighting strategy) are held internally by XpertSystems and are not part of this release.

Limitations

This is a baseline reference, not a production sandbox or threat detector.

  1. Three phases are genuinely hard at sample size. dormancy_dwell, sandbox_evasion_stall, and self_destruct_cleanup span the full 0–59 timestep range and have low row counts. Per-class F1 = 0.13–0.53. These are the phases by design lacking distinctive moment-to-moment features (the malware is being quiet to evade detection). Sequence models or per-sample aggregation would substantially improve these.

  2. The pivot away from malware family classification is dataset-limited, not method-limited. Family classification on 100 samples with 10 classes is at majority baseline. The full 280k-row CYB003 product provides ~5,600 samples and supports proper family classification.

  3. Synthetic-vs-real transfer. The dataset is synthetic and calibrated to threat-intelligence and AV-testing benchmark targets (VirusTotal, AV-TEST, MITRE ATT&CK Evaluations, Mandiant M-Trends, CrowdStrike GTR, Verizon DBIR). Real malware telemetry has different noise characteristics, adversary adaptation, and instrumentation gaps. Do not assume metrics transfer.

  4. Adversarial robustness not evaluated. The dataset is not adversarially generated; the model has not been red-teamed against evasive samples.

  5. MLP brittleness on OOD inputs. With ~4k training timesteps, the MLP can produce confidently-wrong predictions on hand-crafted records far from the training manifold. XGBoost is more robust. Use both; treat disagreement as a signal for human review.

  6. timestep dominance is a property of the dataset. Real malware in production doesn't have a clean "timestep" feature on a per-sample 60-step normalized timeline — that's a simulator artifact. A buyer transferring this baseline to real sandbox traces would need to recover an equivalent temporal-position feature from execution-trace timestamps relative to detonation.

Notes on dataset schema

The CYB003 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive.

What the README says What the data actually contains
pe_entropy (one column) pe_entropy_mean + pe_entropy_std (two columns)
process_injection_count process_injection_flag (binary, not a count)
c2_beacon_active c2_beacon_interval_sec (seconds, 0 when inactive)
av_detected, edr_detected, sandbox_evaded, dwell_time_hours, persistence_mechanism, lotl_technique_used (per-timestep) None of these exist on per-timestep; equivalents (av_signature_hit_flag, sandbox_evasion_flag) do exist with different names
ep_stack: 3 values (legacy_av, ngav_ml_based, edr_full) ep_stack: 8 values (legacy_av_only, ngav_ml_based, edr_endpoint_detect, av_plus_firewall, xdr_extended_detect, managed_detection_response, deception_honeypot, no_protection)
9 malware families listed 10 families in the data (apt_implant is the additional one)
coordinated_campaign_flag (described as a flag) Constant = 1 for all rows in the sample (uninformative)

The actual per-timestep table also contains rich PE-static features not listed in the README: import_hash_cluster, section_count, packed_section_ratio, string_entropy_mean, byte_histogram_chi2, code_section_rx_ratio, resource_section_entropy, suspicious_import_count. These are excellent features for family classification work and are documented in the model's feature_engineering.py.

None of these discrepancies affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns, not the README descriptions.

Intended use

  • Evaluating fit of the CYB003 dataset for your malware-analysis or sandbox-detection research
  • Baseline reference for new model architectures (especially sequence models, which should beat this baseline on the late/scattered phases)
  • Teaching and demo for tabular classification on malware telemetry
  • Feature engineering reference for per-timestep behavioural data

Out-of-scope use

  • Production sandbox analysis on real malware
  • EDR phase tagging on real systems
  • Family attribution (this baseline does not address that task; see why above)
  • Adversarial-evasion evaluation (dataset not adversarially generated)
  • Any operational security decision

Reproducibility

Outputs above were produced with seed = 42 (published artifact), group-aware nested GroupShuffleSplit (70/15/15 by sample_id), on the published sample (xpertsystems/cyb003-sample, version 1.0.0, generated 2026-05-16). The feature pipeline in feature_engineering.py is deterministic and the trained weights in this repo correspond exactly to the metrics above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in multi_seed_results.json confirm robust performance across splits.

The training script itself is private to XpertSystems. The published artifacts contain the feature pipeline, model weights, scaler, metadata, and validation results — sufficient to reproduce inference but not training.

Files in this repo

File Purpose
model_xgb.json XGBoost weights (seed 42)
model_mlp.safetensors PyTorch MLP weights (seed 42)
feature_engineering.py Feature pipeline (load → engineer → encode)
feature_meta.json Feature column order + categorical levels
feature_scaler.json MLP input mean/std (XGBoost ignores)
validation_results.json Per-class metrics, confusion matrix, architecture
ablation_results.json Per-feature-group ablation (timestep, behavioural, PE static, engineered)
multi_seed_results.json XGBoost metrics across 10 seeds with aggregate statistics
inference_example.ipynb End-to-end inference demo notebook
README.md This file

Contact and full product

The full CYB003 dataset contains ~349,000 rows across four files, with calibrated benchmark validation against 12 metrics drawn from authoritative threat intelligence and AV-testing sources (VirusTotal, AV-TEST, MITRE ATT&CK Evaluations, Mandiant, CrowdStrike, Verizon). The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy.

Citation

@misc{xpertsystems_cyb003_baseline_2026,
  title  = {CYB003 Baseline Classifier: XGBoost and MLP for Malware Execution Phase Classification},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb003-baseline-classifier},
  note   = {Baseline reference model trained on xpertsystems/cyb003-sample}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train xpertsystems/cyb003-baseline-classifier

Evaluation results

  • Test macro ROC-AUC OvR (XGBoost, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
    self-reported
    0.979
  • Test accuracy (XGBoost, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
    self-reported
    0.918
  • Test macro-F1 (XGBoost, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
    self-reported
    0.778
  • Multi-seed accuracy mean ± 0.010 (XGBoost, 10 seeds) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
    self-reported
    0.905
  • Multi-seed ROC-AUC mean ± 0.002 (XGBoost, 10 seeds) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
    self-reported
    0.975
  • Test macro ROC-AUC OvR (MLP, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
    self-reported
    0.968
  • Test accuracy (MLP, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
    self-reported
    0.822
  • Test macro-F1 (MLP, seed 42) on CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
    self-reported
    0.707