File size: 10,493 Bytes
333d2aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4de1756
333d2aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
---
license: mit
library_name: onnxruntime
tags:
  - onnx
  - gcn
  - graph-neural-network
  - cybersecurity
  - apt-detection
  - provenance-graph
  - threat-detection
  - graph-classification
  - streamspot
pipeline_tag: graph-ml
model-index:
  - name: provenance-gcn
    results: []
language:
  - en
datasets:
  - custom
metrics:
  - accuracy
  - precision
  - recall
  - f1
---

# Provenance-GCN β€” Graph Convolutional Network for APT Detection on Provenance Graphs

<p align="center">
  <img src="https://img.shields.io/badge/format-ONNX-blue?logo=onnx" alt="ONNX">
  <img src="https://img.shields.io/badge/framework-PyTorch%202.10-ee4c2c?logo=pytorch" alt="PyTorch">
  <img src="https://img.shields.io/badge/task-APT%20Detection-critical" alt="APT Detection">
  <img src="https://img.shields.io/badge/params-6.6K-green" alt="Parameters">
  <img src="https://img.shields.io/badge/dataset-StreamSpot-blueviolet" alt="StreamSpot">
  <img src="https://img.shields.io/badge/license-MIT-brightgreen" alt="MIT License">
</p>

## Model Description

**Provenance-GCN** is a lightweight Graph Convolutional Network (GCN) designed to classify OS-level provenance graphs as benign or belonging to specific threat categories. It operates on fixed-size provenance subgraphs (32 nodes) extracted via 2-hop BFS and outputs a classification across 5 threat classes.

The model implements real graph convolution (message passing) following Kipf & Welling (2017), not a simple MLP. The adjacency matrix and node features are passed as a single flattened tensor for ONNX compatibility, and internally reconstructed for graph-aware computation.

Built as the scoring backend for [**Graph Hunter**](https://github.com/base4security/graph-hunter) β€” a Rust/Tauri threat hunting tool β€” and compatible with its `npu_scorer.rs` inference module.

| Property | Value |
|---|---|
| **Architecture** | 2-layer GCN + MLP classifier |
| **Input** | Flattened provenance subgraph (`batch_size Γ— 1536`) |
| **Output** | 5-class logits (`batch_size Γ— 5`) |
| **Parameters** | 6,637 (~6.6K) |
| **File size** | 46.8 KB |
| **Format** | ONNX (opset 18, IR v10) |
| **Producer** | PyTorch 2.10.0+cu128 |
| **Training hardware** | NVIDIA A100-SXM4-80GB (Google Colab) |

## Architecture

```
Input [batch, 1536]
  β”‚
  β”œβ”€β”€ Reshape β†’ Node features X  [batch, 32, 16]
  β”œβ”€β”€ Reshape β†’ Adjacency matrix A  [batch, 32, 32]
  β”‚
  β”œβ”€β”€ Γ‚ = A + I  (add self-loops)
  β”œβ”€β”€ A_norm = D⁻¹ Γ‚  (row-normalize)
  β”‚
  β–Ό
GCN Layer 1:  HΒΉ = ReLU(A_norm Β· X Β· W₁)           (16 β†’ 64)
  β”‚
  β–Ό
GCN Layer 2:  HΒ² = ReLU(A_norm Β· HΒΉ Β· Wβ‚‚)          (64 β†’ 32)
  β”‚
  β–Ό
Global Mean Pooling over 32 nodes                    (32Γ—32 β†’ 32)
  β”‚
  β–Ό
Classifier:
  Linear(32, 64) β†’ ReLU β†’ Dropout(0.3) β†’ Linear(64, 5)
  β”‚
  β–Ό
Output [batch, 5]   (logits per class)
```

### Graph Convolution

Each GC layer performs the operation:

$$H^{(l+1)} = \text{ReLU}\!\left(\hat{D}^{-1}\hat{A}\, H^{(l)}\, W^{(l)}\right)$$

$$\text{where }\hat{A} = A + I \text{ (adjacency with self-loops) }, \hat{D} \text{ is the degree matrix of }\hat{A}, H^{(l)}\text{ is the node feature matrix at layer l, and }W^{(l)}\text{ is the learnable weight matrix.}$$

### Weight Tensors

| Layer | Shape | Parameters |
|---|---|---|
| `gc1.weight` (via MatMul) | `16 Γ— 64` | 1,024 |
| `gc1.bias` | `64` | 64 |
| `gc2.weight` (via MatMul) | `64 Γ— 32` | 2,048 |
| `gc2.bias` | `32` | 32 |
| `classifier.0.weight` | `32 Γ— 64` | 2,048 |
| `classifier.0.bias` | `64` | 64 |
| `classifier.3.weight` | `64 Γ— 5` | 320 |
| `classifier.3.bias` | `5` | 5 |
| **Total** | | **~6,637** |

## Input Format

The model expects a **flattened** float32 vector of size `1536` per graph:

| Segment | Shape | Description |
|---|---|---|
| Node features | `32 Γ— 16 = 512` | 16-dim feature vector per node (see below) |
| Adjacency matrix | `32 Γ— 32 = 1024` | Binary adjacency encoding causal edges |
| **Total** | **1536** | Concatenated and flattened |

### Node Feature Schema (16 dimensions)

| Index | Feature | Encoding |
|---|---|---|
| 0–8 | Entity type | One-hot: IP(0), Host(1), User(2), Process(3), File(4), Domain(5), Registry(6), URL(7), Service(8) |
| 9 | Out-degree | `min(out_degree / 32, 1.0)` |
| 10 | In-degree | `min(in_degree / 32, 1.0)` |
| 11–13 | Reserved | `0.0` |
| 14 | Total degree | `min((in+out) / 64, 1.0)` |
| 15 | Is center node | `1.0` if BFS root, else `0.0` |

### Subgraph Extraction

Subgraphs are extracted as **2-hop BFS neighborhoods** from a center node, capped at `K_MAX=32` nodes. This matches the extraction logic in `gnn_bridge.rs` from Graph Hunter.

## Output Classes

| Index | Class | Description |
|---|---|---|
| 0 | **Benign** | Normal system activity |
| 1 | **Exfiltration** | Data theft / unauthorized data transfer |
| 2 | **C2Beacon** | Command & Control communication (process β†’ IP patterns) |
| 3 | **LateralMovement** | Lateral movement (process spawning chains) |
| 4 | **PrivilegeEscalation** | Privilege escalation (auth-related anomalies) |

Apply `softmax` to logits for probabilities. Threat score = `1.0 - P(Benign)`.

## Training Details

### Dataset: StreamSpot

The model was trained on [**StreamSpot**](https://github.com/sbustreamspot/sbustreamspot-data) (Manzoor et al., 2016), a benchmark dataset of OS-level provenance graphs:

| Property | Value |
|---|---|
| Total graphs | 600 |
| Benign graphs | 500 (IDs 0–299, 400–599) |
| Attack graphs | 100 (IDs 300–399, drive-by-download scenario) |
| Raw edges | 89,770,902 |
| Parsed nodes | 5,046,878 |
| Parsed edges | 7,638,242 |
| Node types | Process, File, IP |
| Malicious nodes | 889,080 |

### Subgraph Sampling

From the full graph, **1,000,000 subgraphs** were extracted (balanced 500K malicious + 500K benign) using 2-hop BFS with `K_MAX=32`. Threat labels were assigned via heuristics on the attack subgraph structure (network fan-out β†’ C2Beacon, process spawning β†’ LateralMovement, auth edges β†’ PrivilegeEscalation, default malicious β†’ Exfiltration).

### Training Configuration

| Hyperparameter | Value |
|---|---|
| Train / Val / Test split | 68% / 12% / 20% (stratified) |
| Batch size | 64 |
| Optimizer | Adam (lr=0.001, weight_decay=1e-4) |
| Scheduler | ReduceLROnPlateau (factor=0.5, patience=5) |
| Loss | CrossEntropyLoss with inverse-frequency class weights |
| Max epochs | 100 |
| Early stopping | Patience = 10 (on validation loss) |
| Dropout | 0.3 (classifier only) |
| Hardware | NVIDIA A100-SXM4-80GB |
| Export | `torch.onnx.export` (opset 18, dynamic batch axis) |

## Usage

### Python (ONNX Runtime)

```python
import numpy as np
import onnxruntime as ort

session = ort.InferenceSession("provenance_gcn.onnx")

# Build input: 32 nodes Γ— 16 features + 32Γ—32 adjacency, flattened
graph_input = np.zeros((1, 1536), dtype=np.float32)

# Example: center node is a Process (index 3) with high out-degree
graph_input[0, 0 * 16 + 3] = 1.0   # node 0 = Process
graph_input[0, 0 * 16 + 9] = 0.25  # out-degree
graph_input[0, 0 * 16 + 15] = 1.0  # center node flag

# Add neighbor IPs and edges in adjacency block (offset 512)
for i in range(1, 6):
    graph_input[0, i * 16 + 0] = 1.0          # node i = IP
    graph_input[0, 512 + 0 * 32 + i] = 1.0    # edge: node 0 β†’ node i

# Run inference
logits = session.run(None, {"input": graph_input})[0]
probs = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True)
threat_score = 1.0 - probs[0, 0]  # 1 - P(Benign)

CLASSES = ["Benign", "Exfiltration", "C2Beacon", "LateralMovement", "PrivilegeEscalation"]
print(f"Prediction: {CLASSES[np.argmax(logits)]}, Threat Score: {threat_score:.3f}")
```

### Batch Inference

```python
batch = np.random.randn(256, 1536).astype(np.float32)
logits = session.run(None, {"input": batch})[0]
predictions = np.argmax(logits, axis=1)
```

### Rust (Graph Hunter / ort crate)

```rust
use ort::{Session, Value};
use ndarray::Array2;

let session = Session::builder()?.commit_from_file("provenance_gcn.onnx")?;
let input = Array2::<f32>::zeros((1, 1536));
let outputs = session.run(ort::inputs!["input" => input]?)?;
let logits = outputs[0].extract_tensor::<f32>()?;
```

The ONNX interface is identical to what `npu_scorer.rs` expects: 1536 floats in, 5 logits out. Drop-in compatible with Graph Hunter.

## Intended Use

- **Primary**: Real-time classification of OS provenance subgraphs for threat detection in SOC/EDR pipelines, as the scoring backend for Graph Hunter.
- **Secondary**: Research on graph-based threat detection, GNN benchmarking for cybersecurity, edge/NPU deployment scenarios (model fits comfortably on NPUs like Hailo-8L).
- **Out of scope**: This model is not a standalone security product. It should be integrated as one signal within a broader detection pipeline.

## Limitations

- **Fixed graph size**: 32 nodes max β€” larger provenance graphs must be windowed via k-hop BFS before inference.
- **Heuristic labels**: Threat subcategories (Exfiltration, C2, LateralMovement, PrivEsc) are assigned via structural heuristics on StreamSpot attack graphs, not from ground-truth APT campaign labels.
- **Single attack type**: StreamSpot only contains one attack scenario (drive-by-download). Generalization to other APT campaigns (APT29, APT3, etc.) requires retraining on datasets like DARPA TC E3 CADETS/Theia.
- **Lightweight by design**: 6.6K parameters enables edge deployment but limits capacity for complex multi-stage attack patterns.
- **No adversarial evaluation**: Not tested against evasion techniques targeting GNN-based detectors.

## References

- Kipf, T.N. & Welling, M. (2017). *Semi-Supervised Classification with Graph Convolutional Networks*. ICLR 2017.
- Manzoor, E., Milajerdi, S.M. & Akoglu, L. (2016). *Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs*. KDD 2016. ([StreamSpot dataset](https://github.com/sbustreamspot/sbustreamspot-data))
- DARPA Transparent Computing Program β€” [TC Engagement 3 datasets](https://github.com/darpa-i2o/Transparent-Computing)

## Citation

```bibtex
@misc{provenance-gcn-2025,
  title   = {Provenance-GCN: Lightweight Graph Convolutional Network for APT Detection on Provenance Graphs},
  author  = {BASE4 Security R\&D+i},
  year    = {2026},
  note    = {Trained on StreamSpot, exported as ONNX for Graph Hunter}
}
```

## Contact

- **Organization**: [BASE4 Security](https://base4sec.com/) β€” R&D+i Division
- **Location**: Buenos Aires, Argentina