| --- |
| license: mit |
| library_name: onnxruntime |
| tags: |
| - onnx |
| - gcn |
| - graph-neural-network |
| - cybersecurity |
| - apt-detection |
| - provenance-graph |
| - threat-detection |
| - graph-classification |
| - streamspot |
| pipeline_tag: graph-ml |
| model-index: |
| - name: provenance-gcn |
| results: [] |
| language: |
| - en |
| datasets: |
| - custom |
| metrics: |
| - accuracy |
| - precision |
| - recall |
| - f1 |
| --- |
| |
| # Provenance-GCN β Graph Convolutional Network for APT Detection on Provenance Graphs |
|
|
| <p align="center"> |
| <img src="https://img.shields.io/badge/format-ONNX-blue?logo=onnx" alt="ONNX"> |
| <img src="https://img.shields.io/badge/framework-PyTorch%202.10-ee4c2c?logo=pytorch" alt="PyTorch"> |
| <img src="https://img.shields.io/badge/task-APT%20Detection-critical" alt="APT Detection"> |
| <img src="https://img.shields.io/badge/params-6.6K-green" alt="Parameters"> |
| <img src="https://img.shields.io/badge/dataset-StreamSpot-blueviolet" alt="StreamSpot"> |
| <img src="https://img.shields.io/badge/license-MIT-brightgreen" alt="MIT License"> |
| </p> |
|
|
| ## Model Description |
|
|
| **Provenance-GCN** is a lightweight Graph Convolutional Network (GCN) designed to classify OS-level provenance graphs as benign or belonging to specific threat categories. It operates on fixed-size provenance subgraphs (32 nodes) extracted via 2-hop BFS and outputs a classification across 5 threat classes. |
|
|
| The model implements real graph convolution (message passing) following Kipf & Welling (2017), not a simple MLP. The adjacency matrix and node features are passed as a single flattened tensor for ONNX compatibility, and internally reconstructed for graph-aware computation. |
|
|
| Built as the scoring backend for [**Graph Hunter**](https://github.com/base4security/graph-hunter) β a Rust/Tauri threat hunting tool β and compatible with its `npu_scorer.rs` inference module. |
|
|
| | Property | Value | |
| |---|---| |
| | **Architecture** | 2-layer GCN + MLP classifier | |
| | **Input** | Flattened provenance subgraph (`batch_size Γ 1536`) | |
| | **Output** | 5-class logits (`batch_size Γ 5`) | |
| | **Parameters** | 6,637 (~6.6K) | |
| | **File size** | 46.8 KB | |
| | **Format** | ONNX (opset 18, IR v10) | |
| | **Producer** | PyTorch 2.10.0+cu128 | |
| | **Training hardware** | NVIDIA A100-SXM4-80GB (Google Colab) | |
|
|
| ## Architecture |
|
|
| ``` |
| Input [batch, 1536] |
| β |
| βββ Reshape β Node features X [batch, 32, 16] |
| βββ Reshape β Adjacency matrix A [batch, 32, 32] |
| β |
| βββ Γ = A + I (add self-loops) |
| βββ A_norm = Dβ»ΒΉ Γ (row-normalize) |
| β |
| βΌ |
| GCN Layer 1: HΒΉ = ReLU(A_norm Β· X Β· Wβ) (16 β 64) |
| β |
| βΌ |
| GCN Layer 2: HΒ² = ReLU(A_norm Β· HΒΉ Β· Wβ) (64 β 32) |
| β |
| βΌ |
| Global Mean Pooling over 32 nodes (32Γ32 β 32) |
| β |
| βΌ |
| Classifier: |
| Linear(32, 64) β ReLU β Dropout(0.3) β Linear(64, 5) |
| β |
| βΌ |
| Output [batch, 5] (logits per class) |
| ``` |
|
|
| ### Graph Convolution |
|
|
| Each GC layer performs the operation: |
|
|
| $$H^{(l+1)} = \text{ReLU}\!\left(\hat{D}^{-1}\hat{A}\, H^{(l)}\, W^{(l)}\right)$$ |
|
|
| $$\text{where }\hat{A} = A + I \text{ (adjacency with self-loops) }, \hat{D} \text{ is the degree matrix of }\hat{A}, H^{(l)}\text{ is the node feature matrix at layer l, and }W^{(l)}\text{ is the learnable weight matrix.}$$ |
|
|
| ### Weight Tensors |
|
|
| | Layer | Shape | Parameters | |
| |---|---|---| |
| | `gc1.weight` (via MatMul) | `16 Γ 64` | 1,024 | |
| | `gc1.bias` | `64` | 64 | |
| | `gc2.weight` (via MatMul) | `64 Γ 32` | 2,048 | |
| | `gc2.bias` | `32` | 32 | |
| | `classifier.0.weight` | `32 Γ 64` | 2,048 | |
| | `classifier.0.bias` | `64` | 64 | |
| | `classifier.3.weight` | `64 Γ 5` | 320 | |
| | `classifier.3.bias` | `5` | 5 | |
| | **Total** | | **~6,637** | |
|
|
| ## Input Format |
|
|
| The model expects a **flattened** float32 vector of size `1536` per graph: |
|
|
| | Segment | Shape | Description | |
| |---|---|---| |
| | Node features | `32 Γ 16 = 512` | 16-dim feature vector per node (see below) | |
| | Adjacency matrix | `32 Γ 32 = 1024` | Binary adjacency encoding causal edges | |
| | **Total** | **1536** | Concatenated and flattened | |
|
|
| ### Node Feature Schema (16 dimensions) |
|
|
| | Index | Feature | Encoding | |
| |---|---|---| |
| | 0β8 | Entity type | One-hot: IP(0), Host(1), User(2), Process(3), File(4), Domain(5), Registry(6), URL(7), Service(8) | |
| | 9 | Out-degree | `min(out_degree / 32, 1.0)` | |
| | 10 | In-degree | `min(in_degree / 32, 1.0)` | |
| | 11β13 | Reserved | `0.0` | |
| | 14 | Total degree | `min((in+out) / 64, 1.0)` | |
| | 15 | Is center node | `1.0` if BFS root, else `0.0` | |
|
|
| ### Subgraph Extraction |
|
|
| Subgraphs are extracted as **2-hop BFS neighborhoods** from a center node, capped at `K_MAX=32` nodes. This matches the extraction logic in `gnn_bridge.rs` from Graph Hunter. |
|
|
| ## Output Classes |
|
|
| | Index | Class | Description | |
| |---|---|---| |
| | 0 | **Benign** | Normal system activity | |
| | 1 | **Exfiltration** | Data theft / unauthorized data transfer | |
| | 2 | **C2Beacon** | Command & Control communication (process β IP patterns) | |
| | 3 | **LateralMovement** | Lateral movement (process spawning chains) | |
| | 4 | **PrivilegeEscalation** | Privilege escalation (auth-related anomalies) | |
|
|
| Apply `softmax` to logits for probabilities. Threat score = `1.0 - P(Benign)`. |
|
|
| ## Training Details |
|
|
| ### Dataset: StreamSpot |
|
|
| The model was trained on [**StreamSpot**](https://github.com/sbustreamspot/sbustreamspot-data) (Manzoor et al., 2016), a benchmark dataset of OS-level provenance graphs: |
|
|
| | Property | Value | |
| |---|---| |
| | Total graphs | 600 | |
| | Benign graphs | 500 (IDs 0β299, 400β599) | |
| | Attack graphs | 100 (IDs 300β399, drive-by-download scenario) | |
| | Raw edges | 89,770,902 | |
| | Parsed nodes | 5,046,878 | |
| | Parsed edges | 7,638,242 | |
| | Node types | Process, File, IP | |
| | Malicious nodes | 889,080 | |
|
|
| ### Subgraph Sampling |
|
|
| From the full graph, **1,000,000 subgraphs** were extracted (balanced 500K malicious + 500K benign) using 2-hop BFS with `K_MAX=32`. Threat labels were assigned via heuristics on the attack subgraph structure (network fan-out β C2Beacon, process spawning β LateralMovement, auth edges β PrivilegeEscalation, default malicious β Exfiltration). |
|
|
| ### Training Configuration |
|
|
| | Hyperparameter | Value | |
| |---|---| |
| | Train / Val / Test split | 68% / 12% / 20% (stratified) | |
| | Batch size | 64 | |
| | Optimizer | Adam (lr=0.001, weight_decay=1e-4) | |
| | Scheduler | ReduceLROnPlateau (factor=0.5, patience=5) | |
| | Loss | CrossEntropyLoss with inverse-frequency class weights | |
| | Max epochs | 100 | |
| | Early stopping | Patience = 10 (on validation loss) | |
| | Dropout | 0.3 (classifier only) | |
| | Hardware | NVIDIA A100-SXM4-80GB | |
| | Export | `torch.onnx.export` (opset 18, dynamic batch axis) | |
| |
| ## Usage |
| |
| ### Python (ONNX Runtime) |
| |
| ```python |
| import numpy as np |
| import onnxruntime as ort |
| |
| session = ort.InferenceSession("provenance_gcn.onnx") |
|
|
| # Build input: 32 nodes Γ 16 features + 32Γ32 adjacency, flattened |
| graph_input = np.zeros((1, 1536), dtype=np.float32) |
| |
| # Example: center node is a Process (index 3) with high out-degree |
| graph_input[0, 0 * 16 + 3] = 1.0 # node 0 = Process |
| graph_input[0, 0 * 16 + 9] = 0.25 # out-degree |
| graph_input[0, 0 * 16 + 15] = 1.0 # center node flag |
|
|
| # Add neighbor IPs and edges in adjacency block (offset 512) |
| for i in range(1, 6): |
| graph_input[0, i * 16 + 0] = 1.0 # node i = IP |
| graph_input[0, 512 + 0 * 32 + i] = 1.0 # edge: node 0 β node i |
| |
| # Run inference |
| logits = session.run(None, {"input": graph_input})[0] |
| probs = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True) |
| threat_score = 1.0 - probs[0, 0] # 1 - P(Benign) |
|
|
| CLASSES = ["Benign", "Exfiltration", "C2Beacon", "LateralMovement", "PrivilegeEscalation"] |
| print(f"Prediction: {CLASSES[np.argmax(logits)]}, Threat Score: {threat_score:.3f}") |
| ``` |
| |
| ### Batch Inference |
| |
| ```python |
| batch = np.random.randn(256, 1536).astype(np.float32) |
| logits = session.run(None, {"input": batch})[0] |
| predictions = np.argmax(logits, axis=1) |
| ``` |
| |
| ### Rust (Graph Hunter / ort crate) |
| |
| ```rust |
| use ort::{Session, Value}; |
| use ndarray::Array2; |
| |
| let session = Session::builder()?.commit_from_file("provenance_gcn.onnx")?; |
| let input = Array2::<f32>::zeros((1, 1536)); |
| let outputs = session.run(ort::inputs!["input" => input]?)?; |
| let logits = outputs[0].extract_tensor::<f32>()?; |
| ``` |
| |
| The ONNX interface is identical to what `npu_scorer.rs` expects: 1536 floats in, 5 logits out. Drop-in compatible with Graph Hunter. |
|
|
| ## Intended Use |
|
|
| - **Primary**: Real-time classification of OS provenance subgraphs for threat detection in SOC/EDR pipelines, as the scoring backend for Graph Hunter. |
| - **Secondary**: Research on graph-based threat detection, GNN benchmarking for cybersecurity, edge/NPU deployment scenarios (model fits comfortably on NPUs like Hailo-8L). |
| - **Out of scope**: This model is not a standalone security product. It should be integrated as one signal within a broader detection pipeline. |
|
|
| ## Limitations |
|
|
| - **Fixed graph size**: 32 nodes max β larger provenance graphs must be windowed via k-hop BFS before inference. |
| - **Heuristic labels**: Threat subcategories (Exfiltration, C2, LateralMovement, PrivEsc) are assigned via structural heuristics on StreamSpot attack graphs, not from ground-truth APT campaign labels. |
| - **Single attack type**: StreamSpot only contains one attack scenario (drive-by-download). Generalization to other APT campaigns (APT29, APT3, etc.) requires retraining on datasets like DARPA TC E3 CADETS/Theia. |
| - **Lightweight by design**: 6.6K parameters enables edge deployment but limits capacity for complex multi-stage attack patterns. |
| - **No adversarial evaluation**: Not tested against evasion techniques targeting GNN-based detectors. |
|
|
| ## References |
|
|
| - Kipf, T.N. & Welling, M. (2017). *Semi-Supervised Classification with Graph Convolutional Networks*. ICLR 2017. |
| - Manzoor, E., Milajerdi, S.M. & Akoglu, L. (2016). *Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs*. KDD 2016. ([StreamSpot dataset](https://github.com/sbustreamspot/sbustreamspot-data)) |
| - DARPA Transparent Computing Program β [TC Engagement 3 datasets](https://github.com/darpa-i2o/Transparent-Computing) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{provenance-gcn-2025, |
| title = {Provenance-GCN: Lightweight Graph Convolutional Network for APT Detection on Provenance Graphs}, |
| author = {BASE4 Security R\&D+i}, |
| year = {2026}, |
| note = {Trained on StreamSpot, exported as ONNX for Graph Hunter} |
| } |
| ``` |
|
|
| ## Contact |
|
|
| - **Organization**: [BASE4 Security](https://base4sec.com/) β R&D+i Division |
| - **Location**: Buenos Aires, Argentina |