Update README.md

4de1756 verified 17 days ago

10.5 kB

	---
	license: mit
	library_name: onnxruntime
	tags:
	- onnx
	- gcn
	- graph-neural-network
	- cybersecurity
	- apt-detection
	- provenance-graph
	- threat-detection
	- graph-classification
	- streamspot
	pipeline_tag: graph-ml
	model-index:
	- name: provenance-gcn
	results: []
	language:
	- en
	datasets:
	- custom
	metrics:
	- accuracy
	- precision
	- recall
	- f1
	---

	# Provenance-GCN — Graph Convolutional Network for APT Detection on Provenance Graphs

	<p align="center">
	<img src="https://img.shields.io/badge/format-ONNX-blue?logo=onnx" alt="ONNX">
	<img src="https://img.shields.io/badge/framework-PyTorch%202.10-ee4c2c?logo=pytorch" alt="PyTorch">
	<img src="https://img.shields.io/badge/task-APT%20Detection-critical" alt="APT Detection">
	<img src="https://img.shields.io/badge/params-6.6K-green" alt="Parameters">
	<img src="https://img.shields.io/badge/dataset-StreamSpot-blueviolet" alt="StreamSpot">
	<img src="https://img.shields.io/badge/license-MIT-brightgreen" alt="MIT License">
	</p>

	## Model Description

	Provenance-GCN is a lightweight Graph Convolutional Network (GCN) designed to classify OS-level provenance graphs as benign or belonging to specific threat categories. It operates on fixed-size provenance subgraphs (32 nodes) extracted via 2-hop BFS and outputs a classification across 5 threat classes.

	The model implements real graph convolution (message passing) following Kipf & Welling (2017), not a simple MLP. The adjacency matrix and node features are passed as a single flattened tensor for ONNX compatibility, and internally reconstructed for graph-aware computation.

	Built as the scoring backend for [Graph Hunter](https://github.com/base4security/graph-hunter) — a Rust/Tauri threat hunting tool — and compatible with its `npu_scorer.rs` inference module.

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| 2-layer GCN + MLP classifier \|
	\| Input \| Flattened provenance subgraph (`batch_size × 1536`) \|
	\| Output \| 5-class logits (`batch_size × 5`) \|
	\| Parameters \| 6,637 (~6.6K) \|
	\| File size \| 46.8 KB \|
	\| Format \| ONNX (opset 18, IR v10) \|
	\| Producer \| PyTorch 2.10.0+cu128 \|
	\| Training hardware \| NVIDIA A100-SXM4-80GB (Google Colab) \|

	## Architecture

	```
	Input [batch, 1536]
	│
	├── Reshape → Node features X [batch, 32, 16]
	├── Reshape → Adjacency matrix A [batch, 32, 32]
	│
	├── Â = A + I (add self-loops)
	├── A_norm = D⁻¹ Â (row-normalize)
	│
	▼
	GCN Layer 1: H¹ = ReLU(A_norm · X · W₁) (16 → 64)
	│
	▼
	GCN Layer 2: H² = ReLU(A_norm · H¹ · W₂) (64 → 32)
	│
	▼
	Global Mean Pooling over 32 nodes (32×32 → 32)
	│
	▼
	Classifier:
	Linear(32, 64) → ReLU → Dropout(0.3) → Linear(64, 5)
	│
	▼
	Output [batch, 5] (logits per class)
	```

	### Graph Convolution

	Each GC layer performs the operation:

	$$H^{(l+1)} = \text{ReLU}\!\left(\hat{D}^{-1}\hat{A}\, H^{(l)}\, W^{(l)}\right)$$

	$$\text{where }\hat{A} = A + I \text{ (adjacency with self-loops) }, \hat{D} \text{ is the degree matrix of }\hat{A}, H^{(l)}\text{ is the node feature matrix at layer l, and }W^{(l)}\text{ is the learnable weight matrix.}$$

	### Weight Tensors

	\| Layer \| Shape \| Parameters \|
	\|---\|---\|---\|
	\| `gc1.weight` (via MatMul) \| `16 × 64` \| 1,024 \|
	\| `gc1.bias` \| `64` \| 64 \|
	\| `gc2.weight` (via MatMul) \| `64 × 32` \| 2,048 \|
	\| `gc2.bias` \| `32` \| 32 \|
	\| `classifier.0.weight` \| `32 × 64` \| 2,048 \|
	\| `classifier.0.bias` \| `64` \| 64 \|
	\| `classifier.3.weight` \| `64 × 5` \| 320 \|
	\| `classifier.3.bias` \| `5` \| 5 \|
	\| Total \| \| ~6,637 \|

	## Input Format

	The model expects a flattened float32 vector of size `1536` per graph:

	\| Segment \| Shape \| Description \|
	\|---\|---\|---\|
	\| Node features \| `32 × 16 = 512` \| 16-dim feature vector per node (see below) \|
	\| Adjacency matrix \| `32 × 32 = 1024` \| Binary adjacency encoding causal edges \|
	\| Total \| 1536 \| Concatenated and flattened \|

	### Node Feature Schema (16 dimensions)

	\| Index \| Feature \| Encoding \|
	\|---\|---\|---\|
	\| 0–8 \| Entity type \| One-hot: IP(0), Host(1), User(2), Process(3), File(4), Domain(5), Registry(6), URL(7), Service(8) \|
	\| 9 \| Out-degree \| `min(out_degree / 32, 1.0)` \|
	\| 10 \| In-degree \| `min(in_degree / 32, 1.0)` \|
	\| 11–13 \| Reserved \| `0.0` \|
	\| 14 \| Total degree \| `min((in+out) / 64, 1.0)` \|
	\| 15 \| Is center node \| `1.0` if BFS root, else `0.0` \|

	### Subgraph Extraction

	Subgraphs are extracted as 2-hop BFS neighborhoods from a center node, capped at `K_MAX=32` nodes. This matches the extraction logic in `gnn_bridge.rs` from Graph Hunter.

	## Output Classes

	\| Index \| Class \| Description \|
	\|---\|---\|---\|
	\| 0 \| Benign \| Normal system activity \|
	\| 1 \| Exfiltration \| Data theft / unauthorized data transfer \|
	\| 2 \| C2Beacon \| Command & Control communication (process → IP patterns) \|
	\| 3 \| LateralMovement \| Lateral movement (process spawning chains) \|
	\| 4 \| PrivilegeEscalation \| Privilege escalation (auth-related anomalies) \|

	Apply `softmax` to logits for probabilities. Threat score = `1.0 - P(Benign)`.

	## Training Details

	### Dataset: StreamSpot

	The model was trained on [StreamSpot](https://github.com/sbustreamspot/sbustreamspot-data) (Manzoor et al., 2016), a benchmark dataset of OS-level provenance graphs:

	\| Property \| Value \|
	\|---\|---\|
	\| Total graphs \| 600 \|
	\| Benign graphs \| 500 (IDs 0–299, 400–599) \|
	\| Attack graphs \| 100 (IDs 300–399, drive-by-download scenario) \|
	\| Raw edges \| 89,770,902 \|
	\| Parsed nodes \| 5,046,878 \|
	\| Parsed edges \| 7,638,242 \|
	\| Node types \| Process, File, IP \|
	\| Malicious nodes \| 889,080 \|

	### Subgraph Sampling

	From the full graph, 1,000,000 subgraphs were extracted (balanced 500K malicious + 500K benign) using 2-hop BFS with `K_MAX=32`. Threat labels were assigned via heuristics on the attack subgraph structure (network fan-out → C2Beacon, process spawning → LateralMovement, auth edges → PrivilegeEscalation, default malicious → Exfiltration).

	### Training Configuration

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Train / Val / Test split \| 68% / 12% / 20% (stratified) \|
	\| Batch size \| 64 \|
	\| Optimizer \| Adam (lr=0.001, weight_decay=1e-4) \|
	\| Scheduler \| ReduceLROnPlateau (factor=0.5, patience=5) \|
	\| Loss \| CrossEntropyLoss with inverse-frequency class weights \|
	\| Max epochs \| 100 \|
	\| Early stopping \| Patience = 10 (on validation loss) \|
	\| Dropout \| 0.3 (classifier only) \|
	\| Hardware \| NVIDIA A100-SXM4-80GB \|
	\| Export \| `torch.onnx.export` (opset 18, dynamic batch axis) \|

	## Usage

	### Python (ONNX Runtime)

	```python
	import numpy as np
	import onnxruntime as ort

	session = ort.InferenceSession("provenance_gcn.onnx")

	# Build input: 32 nodes × 16 features + 32×32 adjacency, flattened
	graph_input = np.zeros((1, 1536), dtype=np.float32)

	# Example: center node is a Process (index 3) with high out-degree
	graph_input[0, 0 * 16 + 3] = 1.0 # node 0 = Process
	graph_input[0, 0 * 16 + 9] = 0.25 # out-degree
	graph_input[0, 0 * 16 + 15] = 1.0 # center node flag

	# Add neighbor IPs and edges in adjacency block (offset 512)
	for i in range(1, 6):
	graph_input[0, i * 16 + 0] = 1.0 # node i = IP
	graph_input[0, 512 + 0 * 32 + i] = 1.0 # edge: node 0 → node i

	# Run inference
	logits = session.run(None, {"input": graph_input})[0]
	probs = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True)
	threat_score = 1.0 - probs[0, 0] # 1 - P(Benign)

	CLASSES = ["Benign", "Exfiltration", "C2Beacon", "LateralMovement", "PrivilegeEscalation"]
	print(f"Prediction: {CLASSES[np.argmax(logits)]}, Threat Score: {threat_score:.3f}")
	```

	### Batch Inference

	```python
	batch = np.random.randn(256, 1536).astype(np.float32)
	logits = session.run(None, {"input": batch})[0]
	predictions = np.argmax(logits, axis=1)
	```

	### Rust (Graph Hunter / ort crate)

	```rust
	use ort::{Session, Value};
	use ndarray::Array2;

	let session = Session::builder()?.commit_from_file("provenance_gcn.onnx")?;
	let input = Array2::<f32>::zeros((1, 1536));
	let outputs = session.run(ort::inputs!["input" => input]?)?;
	let logits = outputs[0].extract_tensor::<f32>()?;
	```

	The ONNX interface is identical to what `npu_scorer.rs` expects: 1536 floats in, 5 logits out. Drop-in compatible with Graph Hunter.

	## Intended Use

	- Primary: Real-time classification of OS provenance subgraphs for threat detection in SOC/EDR pipelines, as the scoring backend for Graph Hunter.
	- Secondary: Research on graph-based threat detection, GNN benchmarking for cybersecurity, edge/NPU deployment scenarios (model fits comfortably on NPUs like Hailo-8L).
	- Out of scope: This model is not a standalone security product. It should be integrated as one signal within a broader detection pipeline.

	## Limitations

	- Fixed graph size: 32 nodes max — larger provenance graphs must be windowed via k-hop BFS before inference.
	- Heuristic labels: Threat subcategories (Exfiltration, C2, LateralMovement, PrivEsc) are assigned via structural heuristics on StreamSpot attack graphs, not from ground-truth APT campaign labels.
	- Single attack type: StreamSpot only contains one attack scenario (drive-by-download). Generalization to other APT campaigns (APT29, APT3, etc.) requires retraining on datasets like DARPA TC E3 CADETS/Theia.
	- Lightweight by design: 6.6K parameters enables edge deployment but limits capacity for complex multi-stage attack patterns.
	- No adversarial evaluation: Not tested against evasion techniques targeting GNN-based detectors.

	## References

	- Kipf, T.N. & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017.
	- Manzoor, E., Milajerdi, S.M. & Akoglu, L. (2016). Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs. KDD 2016. ([StreamSpot dataset](https://github.com/sbustreamspot/sbustreamspot-data))
	- DARPA Transparent Computing Program — [TC Engagement 3 datasets](https://github.com/darpa-i2o/Transparent-Computing)

	## Citation

	```bibtex
	@misc{provenance-gcn-2025,
	title = {Provenance-GCN: Lightweight Graph Convolutional Network for APT Detection on Provenance Graphs},
	author = {BASE4 Security R\&D+i},
	year = {2026},
	note = {Trained on StreamSpot, exported as ONNX for Graph Hunter}
	}
	```

	## Contact

	- Organization: [BASE4 Security](https://base4sec.com/) — R&D+i Division
	- Location: Buenos Aires, Argentina