---
license: mit
library_name: onnxruntime
tags:
  - onnx
  - gcn
  - graph-neural-network
  - cybersecurity
  - apt-detection
  - provenance-graph
  - threat-detection
  - graph-classification
  - streamspot
pipeline_tag: graph-ml
model-index:
  - name: provenance-gcn
    results: []
language:
  - en
datasets:
  - custom
metrics:
  - accuracy
  - precision
  - recall
  - f1
---

# Provenance-GCN — Graph Convolutional Network for APT Detection on Provenance Graphs

<p align="center">
  <img src="https://img.shields.io/badge/format-ONNX-blue?logo=onnx" alt="ONNX">
  <img src="https://img.shields.io/badge/framework-PyTorch%202.10-ee4c2c?logo=pytorch" alt="PyTorch">
  <img src="https://img.shields.io/badge/task-APT%20Detection-critical" alt="APT Detection">
  <img src="https://img.shields.io/badge/params-6.6K-green" alt="Parameters">
  <img src="https://img.shields.io/badge/dataset-StreamSpot-blueviolet" alt="StreamSpot">
  <img src="https://img.shields.io/badge/license-MIT-brightgreen" alt="MIT License">
</p>

## Model Description

**Provenance-GCN** is a lightweight Graph Convolutional Network (GCN) designed to classify OS-level provenance graphs as benign or belonging to specific threat categories. It operates on fixed-size provenance subgraphs (32 nodes) extracted via 2-hop BFS and outputs a classification across 5 threat classes.

The model implements real graph convolution (message passing) following Kipf & Welling (2017), not a simple MLP. The adjacency matrix and node features are passed as a single flattened tensor for ONNX compatibility, and internally reconstructed for graph-aware computation.

Built as the scoring backend for [**Graph Hunter**](https://github.com/base4security/graph-hunter) — a Rust/Tauri threat hunting tool — and compatible with its `npu_scorer.rs` inference module.

| Property | Value |
|---|---|
| **Architecture** | 2-layer GCN + MLP classifier |
| **Input** | Flattened provenance subgraph (`batch_size × 1536`) |
| **Output** | 5-class logits (`batch_size × 5`) |
| **Parameters** | 6,637 (~6.6K) |
| **File size** | 46.8 KB |
| **Format** | ONNX (opset 18, IR v10) |
| **Producer** | PyTorch 2.10.0+cu128 |
| **Training hardware** | NVIDIA A100-SXM4-80GB (Google Colab) |

## Architecture

```
Input [batch, 1536]
  │
  ├── Reshape → Node features X  [batch, 32, 16]
  ├── Reshape → Adjacency matrix A  [batch, 32, 32]
  │
  ├── Â = A + I  (add self-loops)
  ├── A_norm = D⁻¹ Â  (row-normalize)
  │
  ▼
GCN Layer 1:  H¹ = ReLU(A_norm · X · W₁)           (16 → 64)
  │
  ▼
GCN Layer 2:  H² = ReLU(A_norm · H¹ · W₂)          (64 → 32)
  │
  ▼
Global Mean Pooling over 32 nodes                    (32×32 → 32)
  │
  ▼
Classifier:
  Linear(32, 64) → ReLU → Dropout(0.3) → Linear(64, 5)
  │
  ▼
Output [batch, 5]   (logits per class)
```

### Graph Convolution

Each GC layer performs the operation:

$$H^{(l+1)} = \text{ReLU}\!\left(\hat{D}^{-1}\hat{A}\, H^{(l)}\, W^{(l)}\right)$$

$$\text{where }\hat{A} = A + I \text{ (adjacency with self-loops) }, \hat{D} \text{ is the degree matrix of }\hat{A}, H^{(l)}\text{ is the node feature matrix at layer l, and }W^{(l)}\text{ is the learnable weight matrix.}$$

### Weight Tensors

| Layer | Shape | Parameters |
|---|---|---|
| `gc1.weight` (via MatMul) | `16 × 64` | 1,024 |
| `gc1.bias` | `64` | 64 |
| `gc2.weight` (via MatMul) | `64 × 32` | 2,048 |
| `gc2.bias` | `32` | 32 |
| `classifier.0.weight` | `32 × 64` | 2,048 |
| `classifier.0.bias` | `64` | 64 |
| `classifier.3.weight` | `64 × 5` | 320 |
| `classifier.3.bias` | `5` | 5 |
| **Total** | | **~6,637** |

## Input Format

The model expects a **flattened** float32 vector of size `1536` per graph:

| Segment | Shape | Description |
|---|---|---|
| Node features | `32 × 16 = 512` | 16-dim feature vector per node (see below) |
| Adjacency matrix | `32 × 32 = 1024` | Binary adjacency encoding causal edges |
| **Total** | **1536** | Concatenated and flattened |

### Node Feature Schema (16 dimensions)

| Index | Feature | Encoding |
|---|---|---|
| 0–8 | Entity type | One-hot: IP(0), Host(1), User(2), Process(3), File(4), Domain(5), Registry(6), URL(7), Service(8) |
| 9 | Out-degree | `min(out_degree / 32, 1.0)` |
| 10 | In-degree | `min(in_degree / 32, 1.0)` |
| 11–13 | Reserved | `0.0` |
| 14 | Total degree | `min((in+out) / 64, 1.0)` |
| 15 | Is center node | `1.0` if BFS root, else `0.0` |

### Subgraph Extraction

Subgraphs are extracted as **2-hop BFS neighborhoods** from a center node, capped at `K_MAX=32` nodes. This matches the extraction logic in `gnn_bridge.rs` from Graph Hunter.

## Output Classes

| Index | Class | Description |
|---|---|---|
| 0 | **Benign** | Normal system activity |
| 1 | **Exfiltration** | Data theft / unauthorized data transfer |
| 2 | **C2Beacon** | Command & Control communication (process → IP patterns) |
| 3 | **LateralMovement** | Lateral movement (process spawning chains) |
| 4 | **PrivilegeEscalation** | Privilege escalation (auth-related anomalies) |

Apply `softmax` to logits for probabilities. Threat score = `1.0 - P(Benign)`.

## Training Details

### Dataset: StreamSpot

The model was trained on [**StreamSpot**](https://github.com/sbustreamspot/sbustreamspot-data) (Manzoor et al., 2016), a benchmark dataset of OS-level provenance graphs:

| Property | Value |
|---|---|
| Total graphs | 600 |
| Benign graphs | 500 (IDs 0–299, 400–599) |
| Attack graphs | 100 (IDs 300–399, drive-by-download scenario) |
| Raw edges | 89,770,902 |
| Parsed nodes | 5,046,878 |
| Parsed edges | 7,638,242 |
| Node types | Process, File, IP |
| Malicious nodes | 889,080 |

### Subgraph Sampling

From the full graph, **1,000,000 subgraphs** were extracted (balanced 500K malicious + 500K benign) using 2-hop BFS with `K_MAX=32`. Threat labels were assigned via heuristics on the attack subgraph structure (network fan-out → C2Beacon, process spawning → LateralMovement, auth edges → PrivilegeEscalation, default malicious → Exfiltration).

### Training Configuration

| Hyperparameter | Value |
|---|---|
| Train / Val / Test split | 68% / 12% / 20% (stratified) |
| Batch size | 64 |
| Optimizer | Adam (lr=0.001, weight_decay=1e-4) |
| Scheduler | ReduceLROnPlateau (factor=0.5, patience=5) |
| Loss | CrossEntropyLoss with inverse-frequency class weights |
| Max epochs | 100 |
| Early stopping | Patience = 10 (on validation loss) |
| Dropout | 0.3 (classifier only) |
| Hardware | NVIDIA A100-SXM4-80GB |
| Export | `torch.onnx.export` (opset 18, dynamic batch axis) |

## Usage

### Python (ONNX Runtime)

```python
import numpy as np
import onnxruntime as ort

session = ort.InferenceSession("provenance_gcn.onnx")

# Build input: 32 nodes × 16 features + 32×32 adjacency, flattened
graph_input = np.zeros((1, 1536), dtype=np.float32)

# Example: center node is a Process (index 3) with high out-degree
graph_input[0, 0 * 16 + 3] = 1.0   # node 0 = Process
graph_input[0, 0 * 16 + 9] = 0.25  # out-degree
graph_input[0, 0 * 16 + 15] = 1.0  # center node flag

# Add neighbor IPs and edges in adjacency block (offset 512)
for i in range(1, 6):
    graph_input[0, i * 16 + 0] = 1.0          # node i = IP
    graph_input[0, 512 + 0 * 32 + i] = 1.0    # edge: node 0 → node i

# Run inference
logits = session.run(None, {"input": graph_input})[0]
probs = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True)
threat_score = 1.0 - probs[0, 0]  # 1 - P(Benign)

CLASSES = ["Benign", "Exfiltration", "C2Beacon", "LateralMovement", "PrivilegeEscalation"]
print(f"Prediction: {CLASSES[np.argmax(logits)]}, Threat Score: {threat_score:.3f}")
```

### Batch Inference

```python
batch = np.random.randn(256, 1536).astype(np.float32)
logits = session.run(None, {"input": batch})[0]
predictions = np.argmax(logits, axis=1)
```

### Rust (Graph Hunter / ort crate)

```rust
use ort::{Session, Value};
use ndarray::Array2;

let session = Session::builder()?.commit_from_file("provenance_gcn.onnx")?;
let input = Array2::<f32>::zeros((1, 1536));
let outputs = session.run(ort::inputs!["input" => input]?)?;
let logits = outputs[0].extract_tensor::<f32>()?;
```

The ONNX interface is identical to what `npu_scorer.rs` expects: 1536 floats in, 5 logits out. Drop-in compatible with Graph Hunter.

## Intended Use

- **Primary**: Real-time classification of OS provenance subgraphs for threat detection in SOC/EDR pipelines, as the scoring backend for Graph Hunter.
- **Secondary**: Research on graph-based threat detection, GNN benchmarking for cybersecurity, edge/NPU deployment scenarios (model fits comfortably on NPUs like Hailo-8L).
- **Out of scope**: This model is not a standalone security product. It should be integrated as one signal within a broader detection pipeline.

## Limitations

- **Fixed graph size**: 32 nodes max — larger provenance graphs must be windowed via k-hop BFS before inference.
- **Heuristic labels**: Threat subcategories (Exfiltration, C2, LateralMovement, PrivEsc) are assigned via structural heuristics on StreamSpot attack graphs, not from ground-truth APT campaign labels.
- **Single attack type**: StreamSpot only contains one attack scenario (drive-by-download). Generalization to other APT campaigns (APT29, APT3, etc.) requires retraining on datasets like DARPA TC E3 CADETS/Theia.
- **Lightweight by design**: 6.6K parameters enables edge deployment but limits capacity for complex multi-stage attack patterns.
- **No adversarial evaluation**: Not tested against evasion techniques targeting GNN-based detectors.

## References

- Kipf, T.N. & Welling, M. (2017). *Semi-Supervised Classification with Graph Convolutional Networks*. ICLR 2017.
- Manzoor, E., Milajerdi, S.M. & Akoglu, L. (2016). *Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs*. KDD 2016. ([StreamSpot dataset](https://github.com/sbustreamspot/sbustreamspot-data))
- DARPA Transparent Computing Program — [TC Engagement 3 datasets](https://github.com/darpa-i2o/Transparent-Computing)

## Citation

```bibtex
@misc{provenance-gcn-2025,
  title   = {Provenance-GCN: Lightweight Graph Convolutional Network for APT Detection on Provenance Graphs},
  author  = {BASE4 Security R\&D+i},
  year    = {2026},
  note    = {Trained on StreamSpot, exported as ONNX for Graph Hunter}
}
```

## Contact

- **Organization**: [BASE4 Security](https://base4sec.com/) — R&D+i Division
- **Location**: Buenos Aires, Argentina