Provenance-GCN β Graph Convolutional Network for APT Detection on Provenance Graphs
Model Description
Provenance-GCN is a lightweight Graph Convolutional Network (GCN) designed to classify OS-level provenance graphs as benign or belonging to specific threat categories. It operates on fixed-size provenance subgraphs (32 nodes) extracted via 2-hop BFS and outputs a classification across 5 threat classes.
The model implements real graph convolution (message passing) following Kipf & Welling (2017), not a simple MLP. The adjacency matrix and node features are passed as a single flattened tensor for ONNX compatibility, and internally reconstructed for graph-aware computation.
Built as the scoring backend for Graph Hunter β a Rust/Tauri threat hunting tool β and compatible with its npu_scorer.rs inference module.
| Property | Value |
|---|---|
| Architecture | 2-layer GCN + MLP classifier |
| Input | Flattened provenance subgraph (batch_size Γ 1536) |
| Output | 5-class logits (batch_size Γ 5) |
| Parameters | 6,637 (~6.6K) |
| File size | 46.8 KB |
| Format | ONNX (opset 18, IR v10) |
| Producer | PyTorch 2.10.0+cu128 |
| Training hardware | NVIDIA A100-SXM4-80GB (Google Colab) |
Architecture
Input [batch, 1536]
β
βββ Reshape β Node features X [batch, 32, 16]
βββ Reshape β Adjacency matrix A [batch, 32, 32]
β
βββ Γ = A + I (add self-loops)
βββ A_norm = Dβ»ΒΉ Γ (row-normalize)
β
βΌ
GCN Layer 1: HΒΉ = ReLU(A_norm Β· X Β· Wβ) (16 β 64)
β
βΌ
GCN Layer 2: HΒ² = ReLU(A_norm Β· HΒΉ Β· Wβ) (64 β 32)
β
βΌ
Global Mean Pooling over 32 nodes (32Γ32 β 32)
β
βΌ
Classifier:
Linear(32, 64) β ReLU β Dropout(0.3) β Linear(64, 5)
β
βΌ
Output [batch, 5] (logits per class)
Graph Convolution
Each GC layer performs the operation:
Weight Tensors
| Layer | Shape | Parameters |
|---|---|---|
gc1.weight (via MatMul) |
16 Γ 64 |
1,024 |
gc1.bias |
64 |
64 |
gc2.weight (via MatMul) |
64 Γ 32 |
2,048 |
gc2.bias |
32 |
32 |
classifier.0.weight |
32 Γ 64 |
2,048 |
classifier.0.bias |
64 |
64 |
classifier.3.weight |
64 Γ 5 |
320 |
classifier.3.bias |
5 |
5 |
| Total | ~6,637 |
Input Format
The model expects a flattened float32 vector of size 1536 per graph:
| Segment | Shape | Description |
|---|---|---|
| Node features | 32 Γ 16 = 512 |
16-dim feature vector per node (see below) |
| Adjacency matrix | 32 Γ 32 = 1024 |
Binary adjacency encoding causal edges |
| Total | 1536 | Concatenated and flattened |
Node Feature Schema (16 dimensions)
| Index | Feature | Encoding |
|---|---|---|
| 0β8 | Entity type | One-hot: IP(0), Host(1), User(2), Process(3), File(4), Domain(5), Registry(6), URL(7), Service(8) |
| 9 | Out-degree | min(out_degree / 32, 1.0) |
| 10 | In-degree | min(in_degree / 32, 1.0) |
| 11β13 | Reserved | 0.0 |
| 14 | Total degree | min((in+out) / 64, 1.0) |
| 15 | Is center node | 1.0 if BFS root, else 0.0 |
Subgraph Extraction
Subgraphs are extracted as 2-hop BFS neighborhoods from a center node, capped at K_MAX=32 nodes. This matches the extraction logic in gnn_bridge.rs from Graph Hunter.
Output Classes
| Index | Class | Description |
|---|---|---|
| 0 | Benign | Normal system activity |
| 1 | Exfiltration | Data theft / unauthorized data transfer |
| 2 | C2Beacon | Command & Control communication (process β IP patterns) |
| 3 | LateralMovement | Lateral movement (process spawning chains) |
| 4 | PrivilegeEscalation | Privilege escalation (auth-related anomalies) |
Apply softmax to logits for probabilities. Threat score = 1.0 - P(Benign).
Training Details
Dataset: StreamSpot
The model was trained on StreamSpot (Manzoor et al., 2016), a benchmark dataset of OS-level provenance graphs:
| Property | Value |
|---|---|
| Total graphs | 600 |
| Benign graphs | 500 (IDs 0β299, 400β599) |
| Attack graphs | 100 (IDs 300β399, drive-by-download scenario) |
| Raw edges | 89,770,902 |
| Parsed nodes | 5,046,878 |
| Parsed edges | 7,638,242 |
| Node types | Process, File, IP |
| Malicious nodes | 889,080 |
Subgraph Sampling
From the full graph, 1,000,000 subgraphs were extracted (balanced 500K malicious + 500K benign) using 2-hop BFS with K_MAX=32. Threat labels were assigned via heuristics on the attack subgraph structure (network fan-out β C2Beacon, process spawning β LateralMovement, auth edges β PrivilegeEscalation, default malicious β Exfiltration).
Training Configuration
| Hyperparameter | Value |
|---|---|
| Train / Val / Test split | 68% / 12% / 20% (stratified) |
| Batch size | 64 |
| Optimizer | Adam (lr=0.001, weight_decay=1e-4) |
| Scheduler | ReduceLROnPlateau (factor=0.5, patience=5) |
| Loss | CrossEntropyLoss with inverse-frequency class weights |
| Max epochs | 100 |
| Early stopping | Patience = 10 (on validation loss) |
| Dropout | 0.3 (classifier only) |
| Hardware | NVIDIA A100-SXM4-80GB |
| Export | torch.onnx.export (opset 18, dynamic batch axis) |
Usage
Python (ONNX Runtime)
import numpy as np
import onnxruntime as ort
session = ort.InferenceSession("provenance_gcn.onnx")
# Build input: 32 nodes Γ 16 features + 32Γ32 adjacency, flattened
graph_input = np.zeros((1, 1536), dtype=np.float32)
# Example: center node is a Process (index 3) with high out-degree
graph_input[0, 0 * 16 + 3] = 1.0 # node 0 = Process
graph_input[0, 0 * 16 + 9] = 0.25 # out-degree
graph_input[0, 0 * 16 + 15] = 1.0 # center node flag
# Add neighbor IPs and edges in adjacency block (offset 512)
for i in range(1, 6):
graph_input[0, i * 16 + 0] = 1.0 # node i = IP
graph_input[0, 512 + 0 * 32 + i] = 1.0 # edge: node 0 β node i
# Run inference
logits = session.run(None, {"input": graph_input})[0]
probs = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True)
threat_score = 1.0 - probs[0, 0] # 1 - P(Benign)
CLASSES = ["Benign", "Exfiltration", "C2Beacon", "LateralMovement", "PrivilegeEscalation"]
print(f"Prediction: {CLASSES[np.argmax(logits)]}, Threat Score: {threat_score:.3f}")
Batch Inference
batch = np.random.randn(256, 1536).astype(np.float32)
logits = session.run(None, {"input": batch})[0]
predictions = np.argmax(logits, axis=1)
Rust (Graph Hunter / ort crate)
use ort::{Session, Value};
use ndarray::Array2;
let session = Session::builder()?.commit_from_file("provenance_gcn.onnx")?;
let input = Array2::<f32>::zeros((1, 1536));
let outputs = session.run(ort::inputs!["input" => input]?)?;
let logits = outputs[0].extract_tensor::<f32>()?;
The ONNX interface is identical to what npu_scorer.rs expects: 1536 floats in, 5 logits out. Drop-in compatible with Graph Hunter.
Intended Use
- Primary: Real-time classification of OS provenance subgraphs for threat detection in SOC/EDR pipelines, as the scoring backend for Graph Hunter.
- Secondary: Research on graph-based threat detection, GNN benchmarking for cybersecurity, edge/NPU deployment scenarios (model fits comfortably on NPUs like Hailo-8L).
- Out of scope: This model is not a standalone security product. It should be integrated as one signal within a broader detection pipeline.
Limitations
- Fixed graph size: 32 nodes max β larger provenance graphs must be windowed via k-hop BFS before inference.
- Heuristic labels: Threat subcategories (Exfiltration, C2, LateralMovement, PrivEsc) are assigned via structural heuristics on StreamSpot attack graphs, not from ground-truth APT campaign labels.
- Single attack type: StreamSpot only contains one attack scenario (drive-by-download). Generalization to other APT campaigns (APT29, APT3, etc.) requires retraining on datasets like DARPA TC E3 CADETS/Theia.
- Lightweight by design: 6.6K parameters enables edge deployment but limits capacity for complex multi-stage attack patterns.
- No adversarial evaluation: Not tested against evasion techniques targeting GNN-based detectors.
References
- Kipf, T.N. & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017.
- Manzoor, E., Milajerdi, S.M. & Akoglu, L. (2016). Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs. KDD 2016. (StreamSpot dataset)
- DARPA Transparent Computing Program β TC Engagement 3 datasets
Citation
@misc{provenance-gcn-2025,
title = {Provenance-GCN: Lightweight Graph Convolutional Network for APT Detection on Provenance Graphs},
author = {BASE4 Security R\&D+i},
year = {2026},
note = {Trained on StreamSpot, exported as ONNX for Graph Hunter}
}
Contact
- Organization: BASE4 Security β R&D+i Division
- Location: Buenos Aires, Argentina