File size: 11,782 Bytes

e6a6835

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CYB006 Baseline Classifier — Inference Example\n",
    "\n",
    "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **user risk tier** (`low` / `medium` / `high`) of an identity from per-user aggregates joined with non-leaky session aggregates.\n",
    "\n",
    "**This is a baseline reference model**, not a production identity-security platform. See the model card for full metrics and limitations — and importantly, see the **`leakage_diagnostic.json`** for why this baseline targets `user_risk_tier` rather than the README's stated headline use case of threat-actor tier attribution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Download model artifacts from Hugging Face"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import hf_hub_download\n",
    "\n",
    "REPO_ID = \"xpertsystems/cyb006-baseline-classifier\"\n",
    "\n",
    "files = {}\n",
    "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
    "             \"feature_engineering.py\", \"feature_meta.json\",\n",
    "             \"feature_scaler.json\"]:\n",
    "    files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
    "    print(f\"  downloaded: {name}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys, os\n",
    "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
    "if fe_dir not in sys.path:\n",
    "    sys.path.insert(0, fe_dir)\n",
    "\n",
    "from feature_engineering import (\n",
    "    transform_single, load_meta, INT_TO_LABEL,\n",
    "    compute_session_aggregates_for_user\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Load models and metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import numpy as np\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import xgboost as xgb\n",
    "from safetensors.torch import load_file\n",
    "\n",
    "meta = load_meta(files[\"feature_meta.json\"])\n",
    "with open(files[\"feature_scaler.json\"]) as f:\n",
    "    scaler = json.load(f)\n",
    "\n",
    "N_FEATURES = len(meta[\"feature_names\"])\n",
    "N_CLASSES = len(meta[\"int_to_label\"])\n",
    "print(f\"feature count: {N_FEATURES}\")\n",
    "print(f\"class count:   {N_CLASSES}\")\n",
    "print(f\"label classes: {list(meta['int_to_label'].values())}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb_model = xgb.XGBClassifier()\n",
    "xgb_model.load_model(files[\"model_xgb.json\"])\n",
    "\n",
    "# MLP architecture (must match training)\n",
    "class RiskTierMLP(nn.Module):\n",
    "    def __init__(self, n_features, n_classes=3, hidden1=128, hidden2=64, dropout=0.3):\n",
    "        super().__init__()\n",
    "        self.net = nn.Sequential(\n",
    "            nn.Linear(n_features, hidden1),\n",
    "            nn.BatchNorm1d(hidden1),\n",
    "            nn.ReLU(),\n",
    "            nn.Dropout(dropout),\n",
    "            nn.Linear(hidden1, hidden2),\n",
    "            nn.BatchNorm1d(hidden2),\n",
    "            nn.ReLU(),\n",
    "            nn.Dropout(dropout),\n",
    "            nn.Linear(hidden2, n_classes),\n",
    "        )\n",
    "    def forward(self, x):\n",
    "        return self.net(x)\n",
    "\n",
    "mlp_model = RiskTierMLP(N_FEATURES, n_classes=N_CLASSES)\n",
    "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
    "mlp_model.eval()\n",
    "print(\"models loaded\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Prediction helper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
    "SD = np.array(scaler[\"std\"],  dtype=np.float32)\n",
    "\n",
    "def predict_risk_tier(user_record: dict) -> dict:\n",
    "    \"\"\"Predict the user risk tier from a per-user record.\n",
    "\n",
    "    The record should contain per-user aggregates (from user_risk_summary)\n",
    "    PLUS the session aggregates produced by compute_session_aggregates_for_user.\n",
    "    See the example record below.\n",
    "    \"\"\"\n",
    "    X = transform_single(user_record, meta)\n",
    "\n",
    "    xgb_proba = xgb_model.predict_proba(X)[0]\n",
    "    xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
    "\n",
    "    Xs = ((X - MU) / SD).astype(np.float32)\n",
    "    with torch.no_grad():\n",
    "        logits = mlp_model(torch.tensor(Xs))\n",
    "        mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
    "    mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
    "\n",
    "    return {\n",
    "        \"xgboost\": {\n",
    "            \"label\": xgb_label,\n",
    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
    "        },\n",
    "        \"mlp\": {\n",
    "            \"label\": mlp_label,\n",
    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
    "        },\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Run on an example record\n",
    "\n",
    "Real high-risk user from the sample dataset: 98 login attempts in window, 25 failures, 9 account lockouts, 9 impossible-travel events, 6 unique countries, peak privilege `admin_domain`. Both models should predict `high`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Real per-user record from the sample dataset (true tier: high)\n",
    "example_record = {\n",
    "    # Per-user aggregates (from user_risk_summary.csv)\n",
    "    \"total_login_attempts\": 98,\n",
    "    \"successful_logins\": 0,\n",
    "    \"failed_logins\": 25,\n",
    "    \"mfa_failures\": 0,\n",
    "    \"impossible_travel_events\": 9,\n",
    "    \"lateral_hop_count\": 1,\n",
    "    \"privilege_escalations\": 1,\n",
    "    \"account_lockout_count\": 9,\n",
    "    \"geo_dispersion_score\": 0.6474,\n",
    "    \"login_velocity_score\": 0.6387,\n",
    "    \"session_anomaly_rate\": 1.0,\n",
    "    \"ueba_alert_count\": 0,\n",
    "    \"overall_identity_risk_score\": 0.3452,\n",
    "    \"peak_privilege_level_accessed\": \"admin_domain\",\n",
    "    \"insider_threat_indicator_score\": 0.0,\n",
    "    # Session aggregates (computed via compute_session_aggregates_for_user)\n",
    "    \"avg_session_duration_seconds\": 352.24,\n",
    "    \"avg_mfa_response_latency_ms\": 26.67,\n",
    "    \"avg_geo_anomaly_score\": 0.6474,\n",
    "    \"max_geo_anomaly_score\": 1.0,\n",
    "    \"frac_impossible_travel\": 0.36,\n",
    "    \"n_unique_countries\": 6,\n",
    "    \"n_unique_devices\": 25,\n",
    "    \"n_unique_applications\": 1,\n",
    "}\n",
    "\n",
    "result = predict_risk_tier(example_record)\n",
    "\n",
    "print(f\"XGBoost  ->  {result['xgboost']['label']}\")\n",
    "for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1]):\n",
    "    print(f\"    P({lbl:8s}) = {p:.4f}\")\n",
    "\n",
    "print(f\"\\nMLP      ->  {result['mlp']['label']}\")\n",
    "for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1]):\n",
    "    print(f\"    P({lbl:8s}) = {p:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Batch prediction on the sample dataset\n",
    "\n",
    "Score every user in `user_risk_summary.csv` after joining their session aggregates from `login_sessions.csv`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import snapshot_download\n",
    "import pandas as pd\n",
    "\n",
    "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb006-sample\", repo_type=\"dataset\")\n",
    "users = pd.read_csv(f\"{ds_path}/user_risk_summary.csv\")\n",
    "sessions = pd.read_csv(f\"{ds_path}/login_sessions.csv\")\n",
    "\n",
    "preds = []\n",
    "for _, row in users.head(50).iterrows():\n",
    "    user_sessions = sessions[sessions[\"user_id\"] == row[\"user_id\"]]\n",
    "    if len(user_sessions) == 0:\n",
    "        continue\n",
    "    rec = row.to_dict()\n",
    "    rec.update(compute_session_aggregates_for_user(user_sessions))\n",
    "    pred = predict_risk_tier(rec)\n",
    "    preds.append({\n",
    "        \"user_id\": row[\"user_id\"],\n",
    "        \"true_tier\": row[\"user_risk_tier\"],\n",
    "        \"xgb_pred\": pred[\"xgboost\"][\"label\"],\n",
    "    })\n",
    "\n",
    "results = pd.DataFrame(preds)\n",
    "ct = pd.crosstab(results[\"true_tier\"], results[\"xgb_pred\"],\n",
    "                 rownames=[\"true\"], colnames=[\"pred\"])\n",
    "print(\"Confusion on first 50 users (XGBoost):\")\n",
    "print(ct)\n",
    "acc = (results[\"true_tier\"] == results[\"xgb_pred\"]).mean()\n",
    "print(f\"\\nbatch accuracy on first 50 users (in-distribution): {acc:.4f}\")\n",
    "print(\"\\nNote: this includes training-set users. See validation_results.json\\n\"\n",
    "      \"for proper held-out test metrics.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Important: the leakage diagnostic\n",
    "\n",
    "Before using CYB006 sample data to train a threat-actor detector, read **`leakage_diagnostic.json`** in this repo. The README's stated headline use case (4-class threat-actor tier attribution) is not a representative ML task on the sample dataset — the synthetic generator produces threat-actor sessions with non-overlapping anomaly score distributions, so a plain XGBoost achieves 100% accuracy that doesn't reflect any real learning. The diagnostic documents which feature groups carry the leakage and what we recommend to dataset authors.\n",
    "\n",
    "This baseline ships `user_risk_tier` prediction instead, which has overlapping per-tier distributions and lifts ~10pp over majority baseline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Next steps\n",
    "\n",
    "- See `validation_results.json` for held-out test metrics (30 disjoint users).\n",
    "- See `multi_seed_results.json` for the across-10-seeds picture (accuracy 0.700 ± 0.082, ROC-AUC 0.812 ± 0.048).\n",
    "- See `ablation_results.json` for per-feature-group contribution. User aggregate counts (failed logins, lateral hops, etc.) carry the most signal.\n",
    "- See **`leakage_diagnostic.json`** for the detailed audit on threat-actor detection.\n",
    "- For the full ~1.1M-row CYB006 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}