File size: 10,493 Bytes
333d2aa 4de1756 333d2aa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 | ---
license: mit
library_name: onnxruntime
tags:
- onnx
- gcn
- graph-neural-network
- cybersecurity
- apt-detection
- provenance-graph
- threat-detection
- graph-classification
- streamspot
pipeline_tag: graph-ml
model-index:
- name: provenance-gcn
results: []
language:
- en
datasets:
- custom
metrics:
- accuracy
- precision
- recall
- f1
---
# Provenance-GCN β Graph Convolutional Network for APT Detection on Provenance Graphs
<p align="center">
<img src="https://img.shields.io/badge/format-ONNX-blue?logo=onnx" alt="ONNX">
<img src="https://img.shields.io/badge/framework-PyTorch%202.10-ee4c2c?logo=pytorch" alt="PyTorch">
<img src="https://img.shields.io/badge/task-APT%20Detection-critical" alt="APT Detection">
<img src="https://img.shields.io/badge/params-6.6K-green" alt="Parameters">
<img src="https://img.shields.io/badge/dataset-StreamSpot-blueviolet" alt="StreamSpot">
<img src="https://img.shields.io/badge/license-MIT-brightgreen" alt="MIT License">
</p>
## Model Description
**Provenance-GCN** is a lightweight Graph Convolutional Network (GCN) designed to classify OS-level provenance graphs as benign or belonging to specific threat categories. It operates on fixed-size provenance subgraphs (32 nodes) extracted via 2-hop BFS and outputs a classification across 5 threat classes.
The model implements real graph convolution (message passing) following Kipf & Welling (2017), not a simple MLP. The adjacency matrix and node features are passed as a single flattened tensor for ONNX compatibility, and internally reconstructed for graph-aware computation.
Built as the scoring backend for [**Graph Hunter**](https://github.com/base4security/graph-hunter) β a Rust/Tauri threat hunting tool β and compatible with its `npu_scorer.rs` inference module.
| Property | Value |
|---|---|
| **Architecture** | 2-layer GCN + MLP classifier |
| **Input** | Flattened provenance subgraph (`batch_size Γ 1536`) |
| **Output** | 5-class logits (`batch_size Γ 5`) |
| **Parameters** | 6,637 (~6.6K) |
| **File size** | 46.8 KB |
| **Format** | ONNX (opset 18, IR v10) |
| **Producer** | PyTorch 2.10.0+cu128 |
| **Training hardware** | NVIDIA A100-SXM4-80GB (Google Colab) |
## Architecture
```
Input [batch, 1536]
β
βββ Reshape β Node features X [batch, 32, 16]
βββ Reshape β Adjacency matrix A [batch, 32, 32]
β
βββ Γ = A + I (add self-loops)
βββ A_norm = Dβ»ΒΉ Γ (row-normalize)
β
βΌ
GCN Layer 1: HΒΉ = ReLU(A_norm Β· X Β· Wβ) (16 β 64)
β
βΌ
GCN Layer 2: HΒ² = ReLU(A_norm Β· HΒΉ Β· Wβ) (64 β 32)
β
βΌ
Global Mean Pooling over 32 nodes (32Γ32 β 32)
β
βΌ
Classifier:
Linear(32, 64) β ReLU β Dropout(0.3) β Linear(64, 5)
β
βΌ
Output [batch, 5] (logits per class)
```
### Graph Convolution
Each GC layer performs the operation:
$$H^{(l+1)} = \text{ReLU}\!\left(\hat{D}^{-1}\hat{A}\, H^{(l)}\, W^{(l)}\right)$$
$$\text{where }\hat{A} = A + I \text{ (adjacency with self-loops) }, \hat{D} \text{ is the degree matrix of }\hat{A}, H^{(l)}\text{ is the node feature matrix at layer l, and }W^{(l)}\text{ is the learnable weight matrix.}$$
### Weight Tensors
| Layer | Shape | Parameters |
|---|---|---|
| `gc1.weight` (via MatMul) | `16 Γ 64` | 1,024 |
| `gc1.bias` | `64` | 64 |
| `gc2.weight` (via MatMul) | `64 Γ 32` | 2,048 |
| `gc2.bias` | `32` | 32 |
| `classifier.0.weight` | `32 Γ 64` | 2,048 |
| `classifier.0.bias` | `64` | 64 |
| `classifier.3.weight` | `64 Γ 5` | 320 |
| `classifier.3.bias` | `5` | 5 |
| **Total** | | **~6,637** |
## Input Format
The model expects a **flattened** float32 vector of size `1536` per graph:
| Segment | Shape | Description |
|---|---|---|
| Node features | `32 Γ 16 = 512` | 16-dim feature vector per node (see below) |
| Adjacency matrix | `32 Γ 32 = 1024` | Binary adjacency encoding causal edges |
| **Total** | **1536** | Concatenated and flattened |
### Node Feature Schema (16 dimensions)
| Index | Feature | Encoding |
|---|---|---|
| 0β8 | Entity type | One-hot: IP(0), Host(1), User(2), Process(3), File(4), Domain(5), Registry(6), URL(7), Service(8) |
| 9 | Out-degree | `min(out_degree / 32, 1.0)` |
| 10 | In-degree | `min(in_degree / 32, 1.0)` |
| 11β13 | Reserved | `0.0` |
| 14 | Total degree | `min((in+out) / 64, 1.0)` |
| 15 | Is center node | `1.0` if BFS root, else `0.0` |
### Subgraph Extraction
Subgraphs are extracted as **2-hop BFS neighborhoods** from a center node, capped at `K_MAX=32` nodes. This matches the extraction logic in `gnn_bridge.rs` from Graph Hunter.
## Output Classes
| Index | Class | Description |
|---|---|---|
| 0 | **Benign** | Normal system activity |
| 1 | **Exfiltration** | Data theft / unauthorized data transfer |
| 2 | **C2Beacon** | Command & Control communication (process β IP patterns) |
| 3 | **LateralMovement** | Lateral movement (process spawning chains) |
| 4 | **PrivilegeEscalation** | Privilege escalation (auth-related anomalies) |
Apply `softmax` to logits for probabilities. Threat score = `1.0 - P(Benign)`.
## Training Details
### Dataset: StreamSpot
The model was trained on [**StreamSpot**](https://github.com/sbustreamspot/sbustreamspot-data) (Manzoor et al., 2016), a benchmark dataset of OS-level provenance graphs:
| Property | Value |
|---|---|
| Total graphs | 600 |
| Benign graphs | 500 (IDs 0β299, 400β599) |
| Attack graphs | 100 (IDs 300β399, drive-by-download scenario) |
| Raw edges | 89,770,902 |
| Parsed nodes | 5,046,878 |
| Parsed edges | 7,638,242 |
| Node types | Process, File, IP |
| Malicious nodes | 889,080 |
### Subgraph Sampling
From the full graph, **1,000,000 subgraphs** were extracted (balanced 500K malicious + 500K benign) using 2-hop BFS with `K_MAX=32`. Threat labels were assigned via heuristics on the attack subgraph structure (network fan-out β C2Beacon, process spawning β LateralMovement, auth edges β PrivilegeEscalation, default malicious β Exfiltration).
### Training Configuration
| Hyperparameter | Value |
|---|---|
| Train / Val / Test split | 68% / 12% / 20% (stratified) |
| Batch size | 64 |
| Optimizer | Adam (lr=0.001, weight_decay=1e-4) |
| Scheduler | ReduceLROnPlateau (factor=0.5, patience=5) |
| Loss | CrossEntropyLoss with inverse-frequency class weights |
| Max epochs | 100 |
| Early stopping | Patience = 10 (on validation loss) |
| Dropout | 0.3 (classifier only) |
| Hardware | NVIDIA A100-SXM4-80GB |
| Export | `torch.onnx.export` (opset 18, dynamic batch axis) |
## Usage
### Python (ONNX Runtime)
```python
import numpy as np
import onnxruntime as ort
session = ort.InferenceSession("provenance_gcn.onnx")
# Build input: 32 nodes Γ 16 features + 32Γ32 adjacency, flattened
graph_input = np.zeros((1, 1536), dtype=np.float32)
# Example: center node is a Process (index 3) with high out-degree
graph_input[0, 0 * 16 + 3] = 1.0 # node 0 = Process
graph_input[0, 0 * 16 + 9] = 0.25 # out-degree
graph_input[0, 0 * 16 + 15] = 1.0 # center node flag
# Add neighbor IPs and edges in adjacency block (offset 512)
for i in range(1, 6):
graph_input[0, i * 16 + 0] = 1.0 # node i = IP
graph_input[0, 512 + 0 * 32 + i] = 1.0 # edge: node 0 β node i
# Run inference
logits = session.run(None, {"input": graph_input})[0]
probs = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True)
threat_score = 1.0 - probs[0, 0] # 1 - P(Benign)
CLASSES = ["Benign", "Exfiltration", "C2Beacon", "LateralMovement", "PrivilegeEscalation"]
print(f"Prediction: {CLASSES[np.argmax(logits)]}, Threat Score: {threat_score:.3f}")
```
### Batch Inference
```python
batch = np.random.randn(256, 1536).astype(np.float32)
logits = session.run(None, {"input": batch})[0]
predictions = np.argmax(logits, axis=1)
```
### Rust (Graph Hunter / ort crate)
```rust
use ort::{Session, Value};
use ndarray::Array2;
let session = Session::builder()?.commit_from_file("provenance_gcn.onnx")?;
let input = Array2::<f32>::zeros((1, 1536));
let outputs = session.run(ort::inputs!["input" => input]?)?;
let logits = outputs[0].extract_tensor::<f32>()?;
```
The ONNX interface is identical to what `npu_scorer.rs` expects: 1536 floats in, 5 logits out. Drop-in compatible with Graph Hunter.
## Intended Use
- **Primary**: Real-time classification of OS provenance subgraphs for threat detection in SOC/EDR pipelines, as the scoring backend for Graph Hunter.
- **Secondary**: Research on graph-based threat detection, GNN benchmarking for cybersecurity, edge/NPU deployment scenarios (model fits comfortably on NPUs like Hailo-8L).
- **Out of scope**: This model is not a standalone security product. It should be integrated as one signal within a broader detection pipeline.
## Limitations
- **Fixed graph size**: 32 nodes max β larger provenance graphs must be windowed via k-hop BFS before inference.
- **Heuristic labels**: Threat subcategories (Exfiltration, C2, LateralMovement, PrivEsc) are assigned via structural heuristics on StreamSpot attack graphs, not from ground-truth APT campaign labels.
- **Single attack type**: StreamSpot only contains one attack scenario (drive-by-download). Generalization to other APT campaigns (APT29, APT3, etc.) requires retraining on datasets like DARPA TC E3 CADETS/Theia.
- **Lightweight by design**: 6.6K parameters enables edge deployment but limits capacity for complex multi-stage attack patterns.
- **No adversarial evaluation**: Not tested against evasion techniques targeting GNN-based detectors.
## References
- Kipf, T.N. & Welling, M. (2017). *Semi-Supervised Classification with Graph Convolutional Networks*. ICLR 2017.
- Manzoor, E., Milajerdi, S.M. & Akoglu, L. (2016). *Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs*. KDD 2016. ([StreamSpot dataset](https://github.com/sbustreamspot/sbustreamspot-data))
- DARPA Transparent Computing Program β [TC Engagement 3 datasets](https://github.com/darpa-i2o/Transparent-Computing)
## Citation
```bibtex
@misc{provenance-gcn-2025,
title = {Provenance-GCN: Lightweight Graph Convolutional Network for APT Detection on Provenance Graphs},
author = {BASE4 Security R\&D+i},
year = {2026},
note = {Trained on StreamSpot, exported as ONNX for Graph Hunter}
}
```
## Contact
- **Organization**: [BASE4 Security](https://base4sec.com/) β R&D+i Division
- **Location**: Buenos Aires, Argentina |