--- license: mit library_name: onnxruntime tags: - onnx - gcn - graph-neural-network - cybersecurity - apt-detection - provenance-graph - threat-detection - graph-classification - streamspot pipeline_tag: graph-ml model-index: - name: provenance-gcn results: [] language: - en datasets: - custom metrics: - accuracy - precision - recall - f1 --- # Provenance-GCN — Graph Convolutional Network for APT Detection on Provenance Graphs

ONNX PyTorch APT Detection Parameters StreamSpot MIT License

## Model Description **Provenance-GCN** is a lightweight Graph Convolutional Network (GCN) designed to classify OS-level provenance graphs as benign or belonging to specific threat categories. It operates on fixed-size provenance subgraphs (32 nodes) extracted via 2-hop BFS and outputs a classification across 5 threat classes. The model implements real graph convolution (message passing) following Kipf & Welling (2017), not a simple MLP. The adjacency matrix and node features are passed as a single flattened tensor for ONNX compatibility, and internally reconstructed for graph-aware computation. Built as the scoring backend for [**Graph Hunter**](https://github.com/base4security/graph-hunter) — a Rust/Tauri threat hunting tool — and compatible with its `npu_scorer.rs` inference module. | Property | Value | |---|---| | **Architecture** | 2-layer GCN + MLP classifier | | **Input** | Flattened provenance subgraph (`batch_size × 1536`) | | **Output** | 5-class logits (`batch_size × 5`) | | **Parameters** | 6,637 (~6.6K) | | **File size** | 46.8 KB | | **Format** | ONNX (opset 18, IR v10) | | **Producer** | PyTorch 2.10.0+cu128 | | **Training hardware** | NVIDIA A100-SXM4-80GB (Google Colab) | ## Architecture ``` Input [batch, 1536] │ ├── Reshape → Node features X [batch, 32, 16] ├── Reshape → Adjacency matrix A [batch, 32, 32] │ ├── Â = A + I (add self-loops) ├── A_norm = D⁻¹ Â (row-normalize) │ ▼ GCN Layer 1: H¹ = ReLU(A_norm · X · W₁) (16 → 64) │ ▼ GCN Layer 2: H² = ReLU(A_norm · H¹ · W₂) (64 → 32) │ ▼ Global Mean Pooling over 32 nodes (32×32 → 32) │ ▼ Classifier: Linear(32, 64) → ReLU → Dropout(0.3) → Linear(64, 5) │ ▼ Output [batch, 5] (logits per class) ``` ### Graph Convolution Each GC layer performs the operation: $$H^{(l+1)} = \text{ReLU}\!\left(\hat{D}^{-1}\hat{A}\, H^{(l)}\, W^{(l)}\right)$$ $$\text{where }\hat{A} = A + I \text{ (adjacency with self-loops) }, \hat{D} \text{ is the degree matrix of }\hat{A}, H^{(l)}\text{ is the node feature matrix at layer l, and }W^{(l)}\text{ is the learnable weight matrix.}$$ ### Weight Tensors | Layer | Shape | Parameters | |---|---|---| | `gc1.weight` (via MatMul) | `16 × 64` | 1,024 | | `gc1.bias` | `64` | 64 | | `gc2.weight` (via MatMul) | `64 × 32` | 2,048 | | `gc2.bias` | `32` | 32 | | `classifier.0.weight` | `32 × 64` | 2,048 | | `classifier.0.bias` | `64` | 64 | | `classifier.3.weight` | `64 × 5` | 320 | | `classifier.3.bias` | `5` | 5 | | **Total** | | **~6,637** | ## Input Format The model expects a **flattened** float32 vector of size `1536` per graph: | Segment | Shape | Description | |---|---|---| | Node features | `32 × 16 = 512` | 16-dim feature vector per node (see below) | | Adjacency matrix | `32 × 32 = 1024` | Binary adjacency encoding causal edges | | **Total** | **1536** | Concatenated and flattened | ### Node Feature Schema (16 dimensions) | Index | Feature | Encoding | |---|---|---| | 0–8 | Entity type | One-hot: IP(0), Host(1), User(2), Process(3), File(4), Domain(5), Registry(6), URL(7), Service(8) | | 9 | Out-degree | `min(out_degree / 32, 1.0)` | | 10 | In-degree | `min(in_degree / 32, 1.0)` | | 11–13 | Reserved | `0.0` | | 14 | Total degree | `min((in+out) / 64, 1.0)` | | 15 | Is center node | `1.0` if BFS root, else `0.0` | ### Subgraph Extraction Subgraphs are extracted as **2-hop BFS neighborhoods** from a center node, capped at `K_MAX=32` nodes. This matches the extraction logic in `gnn_bridge.rs` from Graph Hunter. ## Output Classes | Index | Class | Description | |---|---|---| | 0 | **Benign** | Normal system activity | | 1 | **Exfiltration** | Data theft / unauthorized data transfer | | 2 | **C2Beacon** | Command & Control communication (process → IP patterns) | | 3 | **LateralMovement** | Lateral movement (process spawning chains) | | 4 | **PrivilegeEscalation** | Privilege escalation (auth-related anomalies) | Apply `softmax` to logits for probabilities. Threat score = `1.0 - P(Benign)`. ## Training Details ### Dataset: StreamSpot The model was trained on [**StreamSpot**](https://github.com/sbustreamspot/sbustreamspot-data) (Manzoor et al., 2016), a benchmark dataset of OS-level provenance graphs: | Property | Value | |---|---| | Total graphs | 600 | | Benign graphs | 500 (IDs 0–299, 400–599) | | Attack graphs | 100 (IDs 300–399, drive-by-download scenario) | | Raw edges | 89,770,902 | | Parsed nodes | 5,046,878 | | Parsed edges | 7,638,242 | | Node types | Process, File, IP | | Malicious nodes | 889,080 | ### Subgraph Sampling From the full graph, **1,000,000 subgraphs** were extracted (balanced 500K malicious + 500K benign) using 2-hop BFS with `K_MAX=32`. Threat labels were assigned via heuristics on the attack subgraph structure (network fan-out → C2Beacon, process spawning → LateralMovement, auth edges → PrivilegeEscalation, default malicious → Exfiltration). ### Training Configuration | Hyperparameter | Value | |---|---| | Train / Val / Test split | 68% / 12% / 20% (stratified) | | Batch size | 64 | | Optimizer | Adam (lr=0.001, weight_decay=1e-4) | | Scheduler | ReduceLROnPlateau (factor=0.5, patience=5) | | Loss | CrossEntropyLoss with inverse-frequency class weights | | Max epochs | 100 | | Early stopping | Patience = 10 (on validation loss) | | Dropout | 0.3 (classifier only) | | Hardware | NVIDIA A100-SXM4-80GB | | Export | `torch.onnx.export` (opset 18, dynamic batch axis) | ## Usage ### Python (ONNX Runtime) ```python import numpy as np import onnxruntime as ort session = ort.InferenceSession("provenance_gcn.onnx") # Build input: 32 nodes × 16 features + 32×32 adjacency, flattened graph_input = np.zeros((1, 1536), dtype=np.float32) # Example: center node is a Process (index 3) with high out-degree graph_input[0, 0 * 16 + 3] = 1.0 # node 0 = Process graph_input[0, 0 * 16 + 9] = 0.25 # out-degree graph_input[0, 0 * 16 + 15] = 1.0 # center node flag # Add neighbor IPs and edges in adjacency block (offset 512) for i in range(1, 6): graph_input[0, i * 16 + 0] = 1.0 # node i = IP graph_input[0, 512 + 0 * 32 + i] = 1.0 # edge: node 0 → node i # Run inference logits = session.run(None, {"input": graph_input})[0] probs = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True) threat_score = 1.0 - probs[0, 0] # 1 - P(Benign) CLASSES = ["Benign", "Exfiltration", "C2Beacon", "LateralMovement", "PrivilegeEscalation"] print(f"Prediction: {CLASSES[np.argmax(logits)]}, Threat Score: {threat_score:.3f}") ``` ### Batch Inference ```python batch = np.random.randn(256, 1536).astype(np.float32) logits = session.run(None, {"input": batch})[0] predictions = np.argmax(logits, axis=1) ``` ### Rust (Graph Hunter / ort crate) ```rust use ort::{Session, Value}; use ndarray::Array2; let session = Session::builder()?.commit_from_file("provenance_gcn.onnx")?; let input = Array2::::zeros((1, 1536)); let outputs = session.run(ort::inputs!["input" => input]?)?; let logits = outputs[0].extract_tensor::()?; ``` The ONNX interface is identical to what `npu_scorer.rs` expects: 1536 floats in, 5 logits out. Drop-in compatible with Graph Hunter. ## Intended Use - **Primary**: Real-time classification of OS provenance subgraphs for threat detection in SOC/EDR pipelines, as the scoring backend for Graph Hunter. - **Secondary**: Research on graph-based threat detection, GNN benchmarking for cybersecurity, edge/NPU deployment scenarios (model fits comfortably on NPUs like Hailo-8L). - **Out of scope**: This model is not a standalone security product. It should be integrated as one signal within a broader detection pipeline. ## Limitations - **Fixed graph size**: 32 nodes max — larger provenance graphs must be windowed via k-hop BFS before inference. - **Heuristic labels**: Threat subcategories (Exfiltration, C2, LateralMovement, PrivEsc) are assigned via structural heuristics on StreamSpot attack graphs, not from ground-truth APT campaign labels. - **Single attack type**: StreamSpot only contains one attack scenario (drive-by-download). Generalization to other APT campaigns (APT29, APT3, etc.) requires retraining on datasets like DARPA TC E3 CADETS/Theia. - **Lightweight by design**: 6.6K parameters enables edge deployment but limits capacity for complex multi-stage attack patterns. - **No adversarial evaluation**: Not tested against evasion techniques targeting GNN-based detectors. ## References - Kipf, T.N. & Welling, M. (2017). *Semi-Supervised Classification with Graph Convolutional Networks*. ICLR 2017. - Manzoor, E., Milajerdi, S.M. & Akoglu, L. (2016). *Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs*. KDD 2016. ([StreamSpot dataset](https://github.com/sbustreamspot/sbustreamspot-data)) - DARPA Transparent Computing Program — [TC Engagement 3 datasets](https://github.com/darpa-i2o/Transparent-Computing) ## Citation ```bibtex @misc{provenance-gcn-2025, title = {Provenance-GCN: Lightweight Graph Convolutional Network for APT Detection on Provenance Graphs}, author = {BASE4 Security R\&D+i}, year = {2026}, note = {Trained on StreamSpot, exported as ONNX for Graph Hunter} } ``` ## Contact - **Organization**: [BASE4 Security](https://base4sec.com/) — R&D+i Division - **Location**: Buenos Aires, Argentina